heterogeneous computing challenges &...

Post on 11-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

John Danskin Vice President GPU Architecture

Salishan 2017

HETEROGENEOUS COMPUTINGCHALLENGES & DIRECTIONS

2

TOPICS

Why Throughout Optimized Processors are Efficient

Throughput Optimized = GPU Latency Optimized = CPU

Why You Need Both Kinds of Processors

Why Fat Nodes are Better Thin Nodes

A Qualitative Discussion

3

THROUGHPUT VS LATENCY

One BrickMany Bricks

4

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Resistance

frequency ∝ Voltage

Power ∝ frequency3

Energy/Operation ∝ frequency2

Frequency Caution: Qualitative Approximations

Serial Performance Costs Power

Wide & Slow is Efficient

Implication: 1.5 GHz should be ~7x more efficient than 4 GHz

5

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Capacitance ∝ Area !!Caution

Energy/Op ∝ Area/Op

=>Small Simple Cores are Efficient*

*Unless Communication Explodes

Core Area Caution: Qualitative Argument

NVIDIAGP100610mm^23840 Cores

IBM P8650mm^212*8 Cores

40x Difference in Area/Core*

*Redefined core

6

GP100: Highly Evolved Throughput Engine

Not a Sea of CPUs

Mostly Computation

Tiny Caches

14MB Register File

Single Instruction Multiple Thread (SIMT)

Not vectors

manages divergence

SW Coherence

Programming model

Compilers

7

BASIC PATTERN OF COMPUTATION

Parallel

“Hurry Up”

Serial

“Wait”

Parallel

“Hurry Up”

Serial

“Wait”

• Serial Sections• Parallel Resources Wait• Resource Utilization Low

Push Latency When Serial

Parallel Sections• Every Unit Active

Push Efficiency When Parallel

8

ARCHITECTURAL OPTIONS

Sea of Latency Processors

Poor Throughput

Sea of Throughput Processors

Poor Utilization

Heterogeneous Mix

Possibly Perfect But:

Processor BalanceProcessor TransitionsI/O Balance

9

BALANCE: DGX-1 NODE FOR DEEP LEARNING

Balance:2 CPUs8 GPUs4 NICs

Flexible Balance VIA I/O Fabric Switches

PCIE limits bandwidth

Future: Disaggregate?Ick?

10

BALANCE: SUMMIT NODE

2 CPUs6 GPUs2 NICs

I/O VIA CPU fixes ratio

Very high bandwidth CPU/GPU Connection

Flexible Ratio

Future: NICs closer to GPUs?

P9

NIC

Volta Volta Volta

Coherent NVLINK2

16GB

HBM2

16GB

HBM2

16GB

HBM2

256GB

DDR4 P9

NIC

Volta Volta Volta

Coherent NVLINK2

16GB

HBM2

16GB

HBM2

16GB

HBM2

256GB

DDR4

11

LOC/TOC PROCESSING TRANSITIONS

Dependency Loop Bottlenecks:

Bandwidth:

CPU Scale (1x)

=>Fast Fabric

=>See IBM Minsky

Latency:

Interesting Work in Progress

=>Hide Dependent Latencies!

Basic Computational Loop

GPUs – Throughput

10-30x Ops

CPU– Latency

1x Ops

Response

Stimulus

Dependency Loop

NVLINK is Fast Enough™

12

FAT VS THIN NODESTHIN (1X) FAT (10X)

Node Count 10X 1X

Flat Hierarchy Y N

But Hierarchy can be Hidden

Memory per Node

(Flat System Total)

1X 10X

NIC BW Per System

(Surface/Volume)

100% 3D: 46% 5D: 60%

Targets for All to All 10X 1X

13

FAT NODE NETWORK I/OSome I/O Hidden Inside Fast Efficient Local Network

8 Thin Nodes 2 Fat Nodes

% of I/O Hidden Inside Fat Nodes Depends on Dimensionality of Problem

14

FAT NODE I/O & DIMENSIONAL GRIDS

External I/O Requirements

0%

25%

50%

75%

100%

1 4 8 16 32

IdealNodeSurface(Network)vsNodeVolume(Capacity)in1-6Dimensions

1D 2D 3D 4D 5D 6D

15

FAT NODE I/O & GLOBAL HASHING

Applications: Crypto, Gen Splicing

Characteristic Operation: Atomic to Random System Location

Problem: Tiny Global References Never Local to Node, However Fat

Solution: Batch & Sort Updates by Target

10x Fat Nodes have 10x fewer Targets, 10x Fewer Remote Operations

Fat Targets can sort local operations for memory locality

Fat Nodes Increase Target Locality

16

FAT NODE DIRECTION

Scaling Fat Nodes Introduces New Problems: Coherence, Wire Length, Locality

Questions:

What is a node? What matters? What can be jettisoned?

Coherency Domain? Operating System? Physical Volume?

What is optimal size for a node?

What does physics have to say?

Challenges, not answers

Fat Node Redefinition

17

SUMMARY

Throughput Processors Fundamentally More Efficient than Latency Processors

Heterogeneous Systems More Efficient than Homogeneous Systems

Heterogeneous LOC/TOC Transition

Bandwidth Solved (Summit/Sierra)

Watch Dependent Latency

Fat Nodes

Increase Node Memory

Increase Network Efficiency

top related