John Danskin Vice President GPU Architecture
Salishan 2017
HETEROGENEOUS COMPUTINGCHALLENGES & DIRECTIONS
2
TOPICS
Why Throughout Optimized Processors are Efficient
Throughput Optimized = GPU Latency Optimized = CPU
Why You Need Both Kinds of Processors
Why Fat Nodes are Better Thin Nodes
A Qualitative Discussion
3
THROUGHPUT VS LATENCY
One BrickMany Bricks
4
WHY THROUGHPUT IS EFFICIENT
Power = Capacitance * Voltage2 * frequency
Resistance
frequency ∝ Voltage
Power ∝ frequency3
Energy/Operation ∝ frequency2
Frequency Caution: Qualitative Approximations
Serial Performance Costs Power
Wide & Slow is Efficient
Implication: 1.5 GHz should be ~7x more efficient than 4 GHz
5
WHY THROUGHPUT IS EFFICIENT
Power = Capacitance * Voltage2 * frequency
Capacitance ∝ Area !!Caution
Energy/Op ∝ Area/Op
=>Small Simple Cores are Efficient*
*Unless Communication Explodes
Core Area Caution: Qualitative Argument
NVIDIAGP100610mm^23840 Cores
IBM P8650mm^212*8 Cores
40x Difference in Area/Core*
*Redefined core
6
GP100: Highly Evolved Throughput Engine
Not a Sea of CPUs
Mostly Computation
Tiny Caches
14MB Register File
Single Instruction Multiple Thread (SIMT)
Not vectors
manages divergence
SW Coherence
Programming model
Compilers
7
BASIC PATTERN OF COMPUTATION
Parallel
“Hurry Up”
Serial
“Wait”
Parallel
“Hurry Up”
Serial
“Wait”
• Serial Sections• Parallel Resources Wait• Resource Utilization Low
Push Latency When Serial
Parallel Sections• Every Unit Active
Push Efficiency When Parallel
8
ARCHITECTURAL OPTIONS
Sea of Latency Processors
Poor Throughput
Sea of Throughput Processors
Poor Utilization
Heterogeneous Mix
Possibly Perfect But:
Processor BalanceProcessor TransitionsI/O Balance
9
BALANCE: DGX-1 NODE FOR DEEP LEARNING
Balance:2 CPUs8 GPUs4 NICs
Flexible Balance VIA I/O Fabric Switches
PCIE limits bandwidth
Future: Disaggregate?Ick?
10
BALANCE: SUMMIT NODE
2 CPUs6 GPUs2 NICs
I/O VIA CPU fixes ratio
Very high bandwidth CPU/GPU Connection
Flexible Ratio
Future: NICs closer to GPUs?
P9
NIC
Volta Volta Volta
Coherent NVLINK2
16GB
HBM2
16GB
HBM2
16GB
HBM2
256GB
DDR4 P9
NIC
Volta Volta Volta
Coherent NVLINK2
16GB
HBM2
16GB
HBM2
16GB
HBM2
256GB
DDR4
11
LOC/TOC PROCESSING TRANSITIONS
Dependency Loop Bottlenecks:
Bandwidth:
CPU Scale (1x)
=>Fast Fabric
=>See IBM Minsky
Latency:
Interesting Work in Progress
=>Hide Dependent Latencies!
Basic Computational Loop
GPUs – Throughput
10-30x Ops
CPU– Latency
1x Ops
Response
Stimulus
Dependency Loop
NVLINK is Fast Enough™
12
FAT VS THIN NODESTHIN (1X) FAT (10X)
Node Count 10X 1X
Flat Hierarchy Y N
But Hierarchy can be Hidden
Memory per Node
(Flat System Total)
1X 10X
NIC BW Per System
(Surface/Volume)
100% 3D: 46% 5D: 60%
Targets for All to All 10X 1X
13
FAT NODE NETWORK I/OSome I/O Hidden Inside Fast Efficient Local Network
8 Thin Nodes 2 Fat Nodes
% of I/O Hidden Inside Fat Nodes Depends on Dimensionality of Problem
14
FAT NODE I/O & DIMENSIONAL GRIDS
External I/O Requirements
0%
25%
50%
75%
100%
1 4 8 16 32
IdealNodeSurface(Network)vsNodeVolume(Capacity)in1-6Dimensions
1D 2D 3D 4D 5D 6D
15
FAT NODE I/O & GLOBAL HASHING
Applications: Crypto, Gen Splicing
Characteristic Operation: Atomic to Random System Location
Problem: Tiny Global References Never Local to Node, However Fat
Solution: Batch & Sort Updates by Target
10x Fat Nodes have 10x fewer Targets, 10x Fewer Remote Operations
Fat Targets can sort local operations for memory locality
Fat Nodes Increase Target Locality
16
FAT NODE DIRECTION
Scaling Fat Nodes Introduces New Problems: Coherence, Wire Length, Locality
Questions:
What is a node? What matters? What can be jettisoned?
Coherency Domain? Operating System? Physical Volume?
What is optimal size for a node?
What does physics have to say?
Challenges, not answers
Fat Node Redefinition
17
SUMMARY
Throughput Processors Fundamentally More Efficient than Latency Processors
Heterogeneous Systems More Efficient than Homogeneous Systems
Heterogeneous LOC/TOC Transition
Bandwidth Solved (Summit/Sierra)
Watch Dependent Latency
Fat Nodes
Increase Node Memory
Increase Network Efficiency