heterogeneous computing challenges &...

John Danskin Vice President GPU Architecture

Salishan 2017

HETEROGENEOUS COMPUTINGCHALLENGES & DIRECTIONS

TOPICS

Why Throughout Optimized Processors are Efficient

Throughput Optimized = GPU Latency Optimized = CPU

Why You Need Both Kinds of Processors

Why Fat Nodes are Better Thin Nodes

A Qualitative Discussion

THROUGHPUT VS LATENCY

One BrickMany Bricks

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Resistance

frequency ∝ Voltage

Power ∝ frequency3

Energy/Operation ∝ frequency2

Frequency Caution: Qualitative Approximations

Serial Performance Costs Power

Wide & Slow is Efficient

Implication: 1.5 GHz should be ~7x more efficient than 4 GHz

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Capacitance ∝ Area !!Caution

Energy/Op ∝ Area/Op

=>Small Simple Cores are Efficient*

*Unless Communication Explodes

Core Area Caution: Qualitative Argument

NVIDIAGP100610mm^23840 Cores

IBM P8650mm^212*8 Cores

40x Difference in Area/Core*

*Redefined core

GP100: Highly Evolved Throughput Engine

Not a Sea of CPUs

Mostly Computation

Tiny Caches

14MB Register File

Single Instruction Multiple Thread (SIMT)

Not vectors

manages divergence

SW Coherence

Programming model

Compilers

BASIC PATTERN OF COMPUTATION

Parallel

“Hurry Up”

Serial

“Wait”

Parallel

“Hurry Up”

Serial

“Wait”

• Serial Sections• Parallel Resources Wait• Resource Utilization Low

Push Latency When Serial

Parallel Sections• Every Unit Active

Push Efficiency When Parallel

ARCHITECTURAL OPTIONS

Sea of Latency Processors

Poor Throughput

Sea of Throughput Processors

Poor Utilization

Heterogeneous Mix

Possibly Perfect But:

Processor BalanceProcessor TransitionsI/O Balance

BALANCE: DGX-1 NODE FOR DEEP LEARNING

Balance:2 CPUs8 GPUs4 NICs

Flexible Balance VIA I/O Fabric Switches

PCIE limits bandwidth

Future: Disaggregate?Ick?

BALANCE: SUMMIT NODE

2 CPUs6 GPUs2 NICs

I/O VIA CPU fixes ratio

Very high bandwidth CPU/GPU Connection

Flexible Ratio

Future: NICs closer to GPUs?

Volta Volta Volta

Coherent NVLINK2

DDR4 P9

Volta Volta Volta

Coherent NVLINK2

LOC/TOC PROCESSING TRANSITIONS

Dependency Loop Bottlenecks:

Bandwidth:

CPU Scale (1x)

=>Fast Fabric

=>See IBM Minsky

Latency:

Interesting Work in Progress

=>Hide Dependent Latencies!

Basic Computational Loop

GPUs – Throughput

10-30x Ops

CPU– Latency

1x Ops

Response

Stimulus

Dependency Loop

NVLINK is Fast Enough™

FAT VS THIN NODESTHIN (1X) FAT (10X)

Node Count 10X 1X

Flat Hierarchy Y N

But Hierarchy can be Hidden

Memory per Node

(Flat System Total)

1X 10X

NIC BW Per System

(Surface/Volume)

100% 3D: 46% 5D: 60%

Targets for All to All 10X 1X

FAT NODE NETWORK I/OSome I/O Hidden Inside Fast Efficient Local Network

8 Thin Nodes 2 Fat Nodes

% of I/O Hidden Inside Fat Nodes Depends on Dimensionality of Problem

FAT NODE I/O & DIMENSIONAL GRIDS

External I/O Requirements

1 4 8 16 32

IdealNodeSurface(Network)vsNodeVolume(Capacity)in1-6Dimensions

1D 2D 3D 4D 5D 6D

FAT NODE I/O & GLOBAL HASHING

Applications: Crypto, Gen Splicing

Characteristic Operation: Atomic to Random System Location

Problem: Tiny Global References Never Local to Node, However Fat

Solution: Batch & Sort Updates by Target

10x Fat Nodes have 10x fewer Targets, 10x Fewer Remote Operations

Fat Targets can sort local operations for memory locality

Fat Nodes Increase Target Locality

FAT NODE DIRECTION

Scaling Fat Nodes Introduces New Problems: Coherence, Wire Length, Locality

Questions:

What is a node? What matters? What can be jettisoned?

Coherency Domain? Operating System? Physical Volume?

What is optimal size for a node?

What does physics have to say?

Challenges, not answers

Fat Node Redefinition

SUMMARY

Throughput Processors Fundamentally More Efficient than Latency Processors

Heterogeneous Systems More Efficient than Homogeneous Systems

Heterogeneous LOC/TOC Transition

Bandwidth Solved (Summit/Sierra)

Watch Dependent Latency

Fat Nodes

Increase Node Memory

Increase Network Efficiency

heterogeneous computing challenges &...

Documents

耶利米哀歌 - laccf-nm.org

building a universal silicon...

hashing part two: static perfect hashing

hashing-baseddelayedduplicatedetectionasanapproachto ... ·...

extreme scale computing with optical data...

hpx-5 runtime system overhead...

scalable i/o-aware job scheduling for burst buffer enabled...

hpc storage and io trends and...

an update on the google’s quantum computing...

it’s%about%sw%now,%not%hw%...

maximizing...

high dimensional search min-hashing locality sensitive...

deterministic and stochastic acceleration techniques for...

making students a priority promoting ... - nea-nm.org

building reliable chips in future technologies: fact...

michael heroux sos 22 march 27,...

new static hashing schemes - github...

from file systems to services - salishan...

failure, resilience, opportunity and...

pediatric feeding clinic - nm.org