vanguard tti 12/6/11 1. vanguard tti 12/6/11 2 q: what do these have in common? tianhe 1a ~4pflops...

66
Vanguard TTI 12/6/11 1

Upload: baldwin-simpson

Post on 02-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 1

Page 2: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 2

Q: What do these have in common?

Tianhe 1A~4PFLOPS peak2.5PFLOPS sustained~7,000 NVIDIA GPUsAbout 5MW

3G smart phoneBaseband processing 10GOPSApplications processing 1GOPS and increasingPower limit of 300mW

Page 3: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 3

?

Page 4: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 4

Both are Based on NVIDIA Chips

Fermi3 x 109 Transistors512 Cores

Tegra-2 (T20)3 ARM CoresGPUAudio, Video, etc…

Page 5: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 5

More Fundamentally

Both

are power limited

get performance from parallelism

need 100x performance increase in 10 years

Page 6: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 6

100x performance in 10 years, Moore’s Law will take care of that, right?

Page 7: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 7

100x performance in 10 years, Moore’s Law will take care of that, right?

Wrong!

Page 8: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 8

Moore’s Law gives us transistorsWhich we used to turn into scalar performance

Moore, Electronics 38(8) April 19, 1965

Page 9: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 9

ISAT LCC: 9

But ILP was ‘mined out’ in 2000

1e-4

1e-3

1e-2

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)

Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1

Dally et al. “The Last Classical Computer”, ISAT Study, 2001

Page 10: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 10

And L3 energy scaling ended in 2005

Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003

Page 11: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 11

Result: The End of Historic Scaling

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Page 12: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 12

Historic scaling is at an end!

To continue performance scaling of all sizes of computer systems requires addressing two challenges:

Power and Parallelism

Much of the economy depends on this

Page 13: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 13

The Power Challenge

Page 14: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 14

In the past we had constant-field scalingL’ = L/2V’ = V/2

E’ = CV2 = E/8f’ = 2f

D’ = 1/L2 = 4DP’ = P

Halve L and get 8x the capability for the same power

Page 15: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 15

Now voltage is held nearly constantL’ = L/2V’ = V

E’ = CV2 = E/2f’ = 2f*

D’ = 1/L2 = 4DP’ = 4P

Halve L and get 2x the capability for the same power in ¼ the area

*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did

Page 16: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 16

Performance = Efficiency

Efficiency = Locality

Page 17: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 17

Locality

Page 18: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 18

The High Cost of Data MovementFetching operands costs more than computing on them

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28nm

256-bitbuses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

Page 19: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 19

Scaling makes locality even more important

Page 20: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 20

Its not about the FLOPS

Its about data movement

Algorithms should be designed to perform more work per unit data movement.

Programming systems should further optimize this data movement.

Architectures should facilitate this by providing an exposed hierarchy and efficient communication.

Page 21: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 21

System SketchSystem Sketch

Page 22: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 22

Echelon Chip Floorplan

L2 Banks

XBAR

NOC

SM

Lan

e

Lan

eL

ane

Lan

eL

ane

Lan

eL

ane

Lan

e

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

CNOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O 17mm

10nm process290mm2

Page 23: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 23

Overhead

Page 24: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 24

4/11/11 Milad Mohammadi 24

An Out-of-Order CoreSpends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)

Page 25: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 25

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0AddrL1Addr

Net

LM Bank

0

To LD/ST

LM Bank

3

RFL0AddrL1Addr

Net

RF

Net

DataPath

L0I$

Thr

ead

PC

sA

ctiv

eP

Cs

Inst

ControlPath

Sch

edul

er

64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)

Page 26: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 26

Solving the Power Challenge – 1, 2, 3

Page 27: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 27

Solving the ExaScale Power Problem

Page 28: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 28

More Fundamentally

Both

are power limited

get performance from parallelism

need 100x performance increase in 10 years

Page 29: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 29

Parallelism

Page 30: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 30

Parallel programming is not inherently any more difficult than serial programming

However, we can make it a lot more difficult

Page 31: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 31

A simple parallel program

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}

Page 32: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 32

Why is this easy?

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}

No machine detailsAll parallelism is expressedSynchronization is semantic (in reduction)

Page 33: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 33

We could make it hard

pid = fork() ; // explicitly managing threads

lock(struct.lock) ; // complicated, error-prone synchronization// manipulate structunlock(struct.lock) ;

code = send(pid, tag, &msg) ; // partition across nodes

Page 34: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 34

Programmers, tools, and architectureNeed to play their positions

Programmer

Architecture

Tools

Page 35: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 35

Programmers, tools, and architectureNeed to play their positions

Programmer

Architecture

Tools

AlgorithmAll of the parallelismAbstract locality

Fast mechanismsExposed costs

Combinatorial optimizationMappingSelection of mechanisms

Page 36: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 36

Programmers, tools, and architectureNeed to play their positions

Programmer

Architecture

Tools

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}

Map foralls in time and spaceMap molecules across memoriesStage data up/down hierarchySelect mechanisms

Exposed storage hierarchyFast comm/sync/thread mechanisms

Page 37: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 37

Fundamental and Incidental Obstacles to Programmability

FundamentalExpressing 109 way parallelism

Expressing locality to deal with >100:1 global:local energy

Balancing load across 109 cores

IncidentalDealing with multiple address spaces

Partitioning data across nodes

Aggregating data to amortize message overhead

Page 38: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 38

The fundamental problems are hard enough. We must eliminate the incidental ones.

Page 39: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 39

Execution ModelExecution Model

A B

Active Message

Abstract Memory

Hierarchy

Global Address Space

ThreadObject

B

Lo

ad

/Sto

re

A

B Bulk Xfer

Page 40: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 40

Thread array creation, messages, block transfers, collective operations – at the “speed of light”

Page 41: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 41

Fermi

Hardware thread-array creation

Fast syncthreads() ;

Shared memory

Page 42: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 42

Scalar ISAs don’t matter

Parallel ISAs – the mechanisms for threads, communication, and synchronization make a huge difference.

Page 43: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 43

Abstract description of Locality – not mapping

compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;

forall(i in 0:N-1) { compute_forces(part_molecules) ; }

Page 44: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 44

Abstract description of Locality – not mapping

compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;

forall(i in 0:N-1) { compute_forces(part_molecules) ; }

Autotuner picks number and size of partitions - recursively

No need to worry about “ghost molecules” with global address space, it just works

Page 45: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 45

Autotuning Search Spaces

T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle

Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.

In IEEE PACT, pages 237-248, 2000.

ExeExecution Time of Matrix Multiplication for Unrolling and Tiling

Architecture enables simple and effective autotuning

Page 46: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 46

Performance of Auto-tuner

Conv2D SGEMM FFT3D SUmb

Cell Auto 96.4 129 57 10.5

Hand 85 119 54

Cluster Auto 26.7 91.3 5.5 1.65

Hand 24 90 5.5

Cluster of PS3s

Auto 19.5 32.4 0.55 0.49

Hand 19 30 0.23

Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.

For FFT3D, performances is with fusion of leaf tasks.

SUmb is too complicated to be hand-tuned.

Page 47: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 47

More Fundamentally

Both

are power limited

get performance from parallelism

need 100x performance increase in 10 years

Page 48: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 48

A Prescription

Page 49: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 49

Research

Need a research vehicle (experimental system)Co-design architecture, programming system, applications

Productive parallel programmingExpress all the parallelism and locality

Compiler and run-time map to the target machine

Leverage an existing eco-system

Mechanisms – for: threads, comm, syncEliminate ‘incidental’ programming issues

Enable fine-grain execution

PowerLocality – exposed memory hierarchy and software to use it

Overhead – move scheduling to compiler

Others are investing, if we don’t invest we will be left behind.

Page 50: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 50

Education

We need parallel programmersBut we are training serial programmers

and serial thinkers

Parallelism throughout the CS curriculumProgramming

AlgorithmsParallel algorithms

Analysis focused on communications, not counting ops

Systems

Models need to include locality

Page 51: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 51

A Bright Future from Supercomputers to Cellphones

Eliminate overhead and exploit locality to get 100x power efficiency

Easy parallelism with a coordinated team

Programmer

Tools

Architecture

100x performance increase in 10 years

Page 52: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 52

Page 53: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 53

Granularity

Page 54: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 54

#Threads increasing faster than problem size.

Page 55: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 55

Number of Threads increasing faster than problem size

Page 56: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 56

Number of Threads increasing faster than problem size

WeakScalingWeak

Scaling

Page 57: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 57

Number of Threads increasing faster than problem size

WeakScalingWeak

ScalingStrongScalingStrongScaling

Page 58: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 58

Smaller sub-problem per thread

Page 59: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 59

Smaller sub-problem per thread

Page 60: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 60

Smaller sub-problem per thread

More frequent comm, sync, and thread operations

Page 61: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 61

Smaller sub-problem per thread

More frequent comm, sync, and thread operations

Page 62: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 62

This fine-grain parallelism is multi-level and irregular

Page 63: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 63

To support this requires fast mechanisms for

Thread arrays – create, terminate, suspend, resumeHardware allocation of resources to a thread array

threads, registers, shared memory

With locality

CommunicationData movement up and down the hierarchy

Fast active messages (message-driven computing)

SynchronizationCollective operations (e.g., barrier, reduce)

Pairwise (producer-consumer)

Page 64: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

Vanguard TTI12/6/11 64

Execution ModelExecution Model

A B

Active Message

Abstract Memory

Hierarchy

Global Address Space

ThreadObject

B

Lo

ad

/Sto

re

A

B Bulk Xfer

Page 65: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

J-Machine Speedup with Strong Scaling

Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235

Page 66: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW

J-Machine Speedup with Strong Scaling

Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235

2 characters per node