enduring differentiationgilesm/cuda/lecs/enduring_differentiation-2x2.pdfai (tensor cores): ~20...
Post on 31-Jul-2020
8 Views
Preview:
TRANSCRIPT
Timothy Lanfear
ENDURING DIFFERENTIATION
2
WHERE ARE WE?
3
LIFE AFTER DENNARD SCALING
1980 1990 2000 2010 2020
102
103
104
105
106
107
40 Years of Microprocessor Trend Data
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,
O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected
for 2010-2015 by K. Rupp
Single-threaded
perf
1.5X per year
1.1X per yearTransistors
(thousands)
4
0
8
16
24
32
40
AMBER Performance (ns/day)
P100
2016
K80
2015
K40
2014
K20
2013
AMBER 12
CUDA 4
AMBER 14
CUDA 5
AMBER 14
CUDA 6
AMBER 16
CUDA 8
0
2400
4800
7200
9600
12000
GoogleNet Performance (i/s)
cuDNN 2
CUDA 6
cuDNN 4
CUDA 7
cuDNN 6
CUDA 8
NCCL 1.6
cuDNN 7
CUDA 9
NCCL 2
8x K80
2014
8x Maxwell
2015
DGX-1
2016
DGX-1V
2017
GPU-ACCELERATED PERFORMANCE
5
Delivered value grows over time
10X Perf in 8 Years 6X Perf in 6 Years
TESLA VALUE: $15-20M COST SAVINGS DELIVERED
Life Sciences (AMBER) Oil & Gas (RTM)
Amber performance: Nano Seconds Per Day delivered on 1xServer with GPUs and CPUS
TESLA PLATFORM ADVANTAGE
6
GPU-ACCELERATED EFFICIENCY
On Track To Meet Exascale Goal
GFLO
PS p
er
Watt
0
5
10
15
20
25
30
35
9.5 SaturnV
P100
Top GPU Systems in Green500 Lists and NVIDIA Projections for V100
35 GF/WExascale Goal
14.1 TSUBAME 3.0
P1005.3 TiTechW780IK80
4.4 Tsubame-
KFCK20X
3.2 EurotechAurora
K20
V100
13/13 Greenest Supercomputers Powered by Tesla P100
TSUBAME 3.0
Kukai
AIST AI Cloud
RAIDEN GPU subsystem
Piz Daint
Wilkes-2
GOSAT-2 (RCF2)
DGX Saturn V
Reedbush-H
JADE
Facebook Cluster
Cedar
DAVIDE
7
HOW ARE WE DOING THIS?
• What are the most important dimensions of our differentiation?
• Why are GPUs so much more efficient than CPUs?
• How can we continue scaling performance/efficiency as Moore’s Law fades?
• Why can’t competitors replicate GPU efficiency, performance, scaling, etc., with
lots of weak CPU cores? (e.g., Intel KNC/KNL/KNM)
• How is optimizing GPUs for AI affecting their suitability for HPC?
And, is our differentiation sustainable?
8
ENERGY EFFICIENCY
9
20mm
64-bit DP
1000 pJ
28nm IC
256-bit access8 kB SRAM 50 pJ
16000 pJ DRAM Rd/Wr
500 pJ Efficient off-chip link
20 pJ 26 pJ 256 pJ
256
bits
COMPUTATION VERSUS COMMUNICATIONS
10
CPU126 pJ/flop (SP)
Optimized for Latency
Deep Cache Hierarchy
Broadwell E5 v414 nm
GPU28 pJ/flop (SP)
Optimized for Throughput
Explicit Managementof On-chip Memory
Pascal16 nm
11
HOW IS POWER SPENT IN A CPU?
Out of Order, High Performance
Clock + Pins
45%
ALU
4%
Fetch
11%
Rename
10%
RF
14%
DataSupply
5%
Natarajan [2003] (Alpha 21264)
In Order, Embedded
Clock + Control Logic
24%
Data Supply
17%
Instruction Supply
42%
Register File
11%
ALU 6%
Issue
11%
Dally [2008] (Embedded in-order CPU)
12
Overhead
985pJ
Payload
Arithmetic
15pJ
13Source: Azizi [PhD 2010]
SIMPLER CORES = ENERGY EFFICIENCY
14
Overhead
15pJ
Payload
Arithmetic
15pJ
15
THROUGHPUT PROCESSORS
16
RISE OF LEAKAGE
17
FREQUENCY VS. LEAKAGE
Source: Gordon Moore, Intel; IEEE19
OPTIMIZED FOR DATACENTER EFFICIENCY40% More Performance in a Rack
Computer Vision
V100Max Efficiency
Computer Vision
V100Max Performance
13 KW Rack4 Nodes of 8xV100
13ResNet-50 Networks
Trained Per Day
13 KW Rack7 Nodes of 8xV100
18ResNet-50 Networks
Trained Per Day
ResNet-50 Training, Max Efficiency run with V100@160W | V100 performance measured on pre-production hardware.
80% Perf at Half the Power
20
SP ENERGY EFFICIENCY @ 28 NM
0
5
10
15
20
25
Fermi Kepler Maxwell
GF
LO
PS
/ w
att
21
HETEROGENEOUS COMPUTING
22
OPTIMIZING SERIAL/PARALLEL EXECUTIONApplication Code
+
GPU CPU
Parallel Work
Majority of Ops
Serial Work
System and Sequential Ops
23
TWO TYPES OF ACCELERATORS
CPUOptimized for Serial Tasks
GPU AcceleratorOptimized for Parallel Tasks
Heterogeneous Computing ModelComplementary Processors Work Together
Many-Weak-Cores (MWC) ModelSingle CPU Core for Both Serial & Parallel Work
Xeon Phi (And Others)Many Weak Serial Cores
24
EXTENSIBILITY
25
NVLINK: A MEMORY FABRIC, NOT A NETWORKDGX-1: 8 NVLink-Connected GPUs
26
LATENCY HIDING FOR LOAD/STORE/ATOMICSWhere are the NICs? There are no NICs.
27
STRONG SCALING
0x
10x
20x
30x
40x
1 4 8 16 32 64 128
AlexNet TrainingDGX-1 Faster than 128 Knights Landing Servers
Knights Landing Servers 1x DGX1
Speed-u
p v
s 1x K
NL S
erv
er
GPU-Accelerated Server
GPU-Accelerated ServerNVIDIA DGX-1
28
Speed-u
p v
s 1 K
NL S
erv
er
Node
0x
5x
10x
15x
20x
25x
30x
35x
40x
0 4 8 12 16 20 24 28 32
2x P100: 13x Faster
# of KNL Nodes
4x P100: 21x Faster
8x P100: 35x Faster
LAMMPS: Molecular Dynamics8x Tesla P100 PCIe Server Faster than 32 KNL Servers
Speed-u
p v
s 1 K
NL S
erv
er
Node
0x
2x
4x
6x
8x
10x
0 8 16 24 32 40 48 56 64
# of KNL Nodes
GTC-P: Plasma Turbulence8x Tesla P100 PCIe Server Faster than 64 KNL Servers
2x P100
4x P100
8x P100: 8x Faster
STRONG SCALING
29
Gflop/s
0
5,000
10,000
15,000
20,000
25,000
0 18 36 54 72 90 108 126 144
# of CPU Nodes (in SuperMUC Supercomputer)
1x DGX-1: 8K Gflop/s
2x DGX-1: 15K Gflop/s
4x DGX-1: 20K Gflop/s
2K Gflop/s3K Gflop/s
5K Gflop/s
7K Gflop/s
LQCD- Higher Energy PhysicsSATURNV DGX Servers vs SuperMUC Supercomputer
QUDA version 0.9beta, using double-half mixed precision DDalphaAMG using double-single
# of CPU Servers to Match Performance of SATURNV
2,300CPU Servers
S3D: Discovering New Fuel for Engines
3,800CPU Servers
SPECFEM3D: Simulating Earthquakes
STRONG SCALING
30
NEW TENSOR CORE
New CUDA TensorOpinstructions and data formats
4×4 matrix processing array
DFP32 = AFP16 × BFP16 + CFP32
Optimized for deep learning
Activation Inputs Weights Inputs Output Results
31
TESLA PLATFORM
32
TESLA IS A PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS & TOOLS
APPLICATIONS
ECOSYSTEM TOOLS
HPC
+450 Applications
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK COMPUTEWORKS
SYSTEM OEM CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
cuBLAS DeepStream SDK
NCCL
cuSPARSEcuDNN TensorRT
33
MULTIPLE GROWTH MARKETS
10M Users 40 years of video/day
TESLA PLATFORM
GROWTH MARKETS
Video TranscodingHPC AI Training AI Inference Desktop Virtualization
34
CONCLUSION
35
PASCAL TO VOLTA
• Area: ~600 mm2 → ~800 mm2 (~33% more area)
• Process: ~ small Pascal → Volta improvement (a few percent)
• Clocks: similar dynamic range, power limited
• Memory BW (sustained): 50% improvement
• Communications (NVLink): 160 GB/s → 300 GB/s (almost double!)
• AI (Tensor Cores): ~20 TFLOPS → 120 TFLOPS (~6x!)
Architecture with Technology
36
REVOLUTIONARY PERFORMANCE FOR HPC AND AISingle Platform For Data Science and Computation Science
1.5X HPC Performance In 1 Year 3X AI Performance in 1 Year
LSTM ( Neural Machine Translation)
0 10 20
1X V100
1X P100
2X CPU 15 Days
18 Hours
6 Hours
1.8
1.5 1.61.5
Molecular Dynamics(VMD)
Physics (QUDA) Seismic (RTM) STREAM
V100 P
erf
orm
ance N
orm
alize
d t
o P
100
NMT Training for 13 Epochs |German ->English, WMT15 subset |System Config Info: CPU = 2x Xeon E5 2699 V4 w/ P100s or V100s |
QUDA, RTM, STREAM | System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100.
VMD | System Config Info: Xeon E5-2698v3 w/ 1x Tesla P100 and E5-2697Av4 w/ 1x Tesla V100 | V100 measured on pre-production hardware.
37
P100 V100 Ratio
Training acceleration 10 TOPS 120 TOPS 12x
Inference acceleration 21 TFLOPS 120 TOPS 6x
FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
GPU PERFORMANCE COMPARISON
38
GPU TRAJECTORY
1980 1990 2000 2010 2020
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte,
O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected
for 2010-2015 by K. Rupp
Single-threaded
perf 1.5X per year
1.1X per year
GPU-Computing perf
1.5X per year
1000X
by
2025
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
top related