vanguard tti 12/6/11 1. vanguard tti 12/6/11 2 q: what do these have in common? tianhe 1a ~4pflops...
TRANSCRIPT
![Page 1: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/1.jpg)
Vanguard TTI12/6/11 1
![Page 2: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/2.jpg)
Vanguard TTI12/6/11 2
Q: What do these have in common?
Tianhe 1A~4PFLOPS peak2.5PFLOPS sustained~7,000 NVIDIA GPUsAbout 5MW
3G smart phoneBaseband processing 10GOPSApplications processing 1GOPS and increasingPower limit of 300mW
![Page 3: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/3.jpg)
Vanguard TTI12/6/11 3
?
![Page 4: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/4.jpg)
Vanguard TTI12/6/11 4
Both are Based on NVIDIA Chips
Fermi3 x 109 Transistors512 Cores
Tegra-2 (T20)3 ARM CoresGPUAudio, Video, etc…
![Page 5: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/5.jpg)
Vanguard TTI12/6/11 5
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
![Page 6: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/6.jpg)
Vanguard TTI12/6/11 6
100x performance in 10 years, Moore’s Law will take care of that, right?
![Page 7: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/7.jpg)
Vanguard TTI12/6/11 7
100x performance in 10 years, Moore’s Law will take care of that, right?
Wrong!
![Page 8: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/8.jpg)
Vanguard TTI12/6/11 8
Moore’s Law gives us transistorsWhich we used to turn into scalar performance
Moore, Electronics 38(8) April 19, 1965
![Page 9: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/9.jpg)
Vanguard TTI12/6/11 9
ISAT LCC: 9
But ILP was ‘mined out’ in 2000
1e-4
1e-3
1e-2
1e-1
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)
Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
Dally et al. “The Last Classical Computer”, ISAT Study, 2001
![Page 10: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/10.jpg)
Vanguard TTI12/6/11 10
And L3 energy scaling ended in 2005
Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003
![Page 11: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/11.jpg)
Vanguard TTI12/6/11 11
Result: The End of Historic Scaling
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
![Page 12: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/12.jpg)
Vanguard TTI12/6/11 12
Historic scaling is at an end!
To continue performance scaling of all sizes of computer systems requires addressing two challenges:
Power and Parallelism
Much of the economy depends on this
![Page 13: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/13.jpg)
Vanguard TTI12/6/11 13
The Power Challenge
![Page 14: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/14.jpg)
Vanguard TTI12/6/11 14
In the past we had constant-field scalingL’ = L/2V’ = V/2
E’ = CV2 = E/8f’ = 2f
D’ = 1/L2 = 4DP’ = P
Halve L and get 8x the capability for the same power
![Page 15: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/15.jpg)
Vanguard TTI12/6/11 15
Now voltage is held nearly constantL’ = L/2V’ = V
E’ = CV2 = E/2f’ = 2f*
D’ = 1/L2 = 4DP’ = 4P
Halve L and get 2x the capability for the same power in ¼ the area
*f is no longer scaling as 1/L, but it doesn’t matter, we couldn’t power it if it did
![Page 16: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/16.jpg)
Vanguard TTI12/6/11 16
Performance = Efficiency
Efficiency = Locality
![Page 17: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/17.jpg)
Vanguard TTI12/6/11 17
Locality
![Page 18: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/18.jpg)
Vanguard TTI12/6/11 18
The High Cost of Data MovementFetching operands costs more than computing on them
20mm
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficientoff-chip link
28nm
256-bitbuses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
![Page 19: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/19.jpg)
Vanguard TTI12/6/11 19
Scaling makes locality even more important
![Page 20: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/20.jpg)
Vanguard TTI12/6/11 20
Its not about the FLOPS
Its about data movement
Algorithms should be designed to perform more work per unit data movement.
Programming systems should further optimize this data movement.
Architectures should facilitate this by providing an exposed hierarchy and efficient communication.
![Page 21: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/21.jpg)
Vanguard TTI12/6/11 21
System SketchSystem Sketch
![Page 22: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/22.jpg)
Vanguard TTI12/6/11 22
Echelon Chip Floorplan
L2 Banks
XBAR
NOC
SM
Lan
e
Lan
eL
ane
Lan
eL
ane
Lan
eL
ane
Lan
e
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOCS
M
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
CNOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOCS
M
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O 17mm
10nm process290mm2
![Page 23: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/23.jpg)
Vanguard TTI12/6/11 23
Overhead
![Page 24: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/24.jpg)
Vanguard TTI12/6/11 24
4/11/11 Milad Mohammadi 24
An Out-of-Order CoreSpends 2nJ to schedule a 25pJ FMUL (or an 0.5pJ integer add)
![Page 25: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/25.jpg)
Vanguard TTI12/6/11 25
SM Lane Architecture
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0AddrL1Addr
Net
LM Bank
0
To LD/ST
LM Bank
3
RFL0AddrL1Addr
Net
RF
Net
DataPath
L0I$
Thr
ead
PC
sA
ctiv
eP
Cs
Inst
ControlPath
Sch
edul
er
64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)
![Page 26: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/26.jpg)
Vanguard TTI12/6/11 26
Solving the Power Challenge – 1, 2, 3
![Page 27: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/27.jpg)
Vanguard TTI12/6/11 27
Solving the ExaScale Power Problem
![Page 28: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/28.jpg)
Vanguard TTI12/6/11 28
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
![Page 29: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/29.jpg)
Vanguard TTI12/6/11 29
Parallelism
![Page 30: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/30.jpg)
Vanguard TTI12/6/11 30
Parallel programming is not inherently any more difficult than serial programming
However, we can make it a lot more difficult
![Page 31: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/31.jpg)
Vanguard TTI12/6/11 31
A simple parallel program
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
![Page 32: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/32.jpg)
Vanguard TTI12/6/11 32
Why is this easy?
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
No machine detailsAll parallelism is expressedSynchronization is semantic (in reduction)
![Page 33: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/33.jpg)
Vanguard TTI12/6/11 33
We could make it hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization// manipulate structunlock(struct.lock) ;
code = send(pid, tag, &msg) ; // partition across nodes
![Page 34: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/34.jpg)
Vanguard TTI12/6/11 34
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
![Page 35: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/35.jpg)
Vanguard TTI12/6/11 35
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
AlgorithmAll of the parallelismAbstract locality
Fast mechanismsExposed costs
Combinatorial optimizationMappingSelection of mechanisms
![Page 36: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/36.jpg)
Vanguard TTI12/6/11 36
Programmers, tools, and architectureNeed to play their positions
Programmer
Architecture
Tools
forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // forall force in forces { // doubly nested molecule.force = reduce_sum(force(molecule, neighbor)) } }}
Map foralls in time and spaceMap molecules across memoriesStage data up/down hierarchySelect mechanisms
Exposed storage hierarchyFast comm/sync/thread mechanisms
![Page 37: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/37.jpg)
Vanguard TTI12/6/11 37
Fundamental and Incidental Obstacles to Programmability
FundamentalExpressing 109 way parallelism
Expressing locality to deal with >100:1 global:local energy
Balancing load across 109 cores
IncidentalDealing with multiple address spaces
Partitioning data across nodes
Aggregating data to amortize message overhead
![Page 38: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/38.jpg)
Vanguard TTI12/6/11 38
The fundamental problems are hard enough. We must eliminate the incidental ones.
![Page 39: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/39.jpg)
Vanguard TTI12/6/11 39
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Lo
ad
/Sto
re
A
B Bulk Xfer
![Page 40: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/40.jpg)
Vanguard TTI12/6/11 40
Thread array creation, messages, block transfers, collective operations – at the “speed of light”
![Page 41: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/41.jpg)
Vanguard TTI12/6/11 41
Fermi
Hardware thread-array creation
Fast syncthreads() ;
Shared memory
![Page 42: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/42.jpg)
Vanguard TTI12/6/11 42
Scalar ISAs don’t matter
Parallel ISAs – the mechanisms for threads, communication, and synchronization make a huge difference.
![Page 43: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/43.jpg)
Vanguard TTI12/6/11 43
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) { compute_forces(part_molecules) ; }
![Page 44: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/44.jpg)
Vanguard TTI12/6/11 44
Abstract description of Locality – not mapping
compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;
forall(i in 0:N-1) { compute_forces(part_molecules) ; }
Autotuner picks number and size of partitions - recursively
No need to worry about “ghost molecules” with global address space, it just works
![Page 45: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/45.jpg)
Vanguard TTI12/6/11 45
Autotuning Search Spaces
T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.
In IEEE PACT, pages 237-248, 2000.
ExeExecution Time of Matrix Multiplication for Unrolling and Tiling
Architecture enables simple and effective autotuning
![Page 46: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/46.jpg)
Vanguard TTI12/6/11 46
Performance of Auto-tuner
Conv2D SGEMM FFT3D SUmb
Cell Auto 96.4 129 57 10.5
Hand 85 119 54
Cluster Auto 26.7 91.3 5.5 1.65
Hand 24 90 5.5
Cluster of PS3s
Auto 19.5 32.4 0.55 0.49
Hand 19 30 0.23
Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.
For FFT3D, performances is with fusion of leaf tasks.
SUmb is too complicated to be hand-tuned.
![Page 47: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/47.jpg)
Vanguard TTI12/6/11 47
More Fundamentally
Both
are power limited
get performance from parallelism
need 100x performance increase in 10 years
![Page 48: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/48.jpg)
Vanguard TTI12/6/11 48
A Prescription
![Page 49: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/49.jpg)
Vanguard TTI12/6/11 49
Research
Need a research vehicle (experimental system)Co-design architecture, programming system, applications
Productive parallel programmingExpress all the parallelism and locality
Compiler and run-time map to the target machine
Leverage an existing eco-system
Mechanisms – for: threads, comm, syncEliminate ‘incidental’ programming issues
Enable fine-grain execution
PowerLocality – exposed memory hierarchy and software to use it
Overhead – move scheduling to compiler
Others are investing, if we don’t invest we will be left behind.
![Page 50: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/50.jpg)
Vanguard TTI12/6/11 50
Education
We need parallel programmersBut we are training serial programmers
and serial thinkers
Parallelism throughout the CS curriculumProgramming
AlgorithmsParallel algorithms
Analysis focused on communications, not counting ops
Systems
Models need to include locality
![Page 51: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/51.jpg)
Vanguard TTI12/6/11 51
A Bright Future from Supercomputers to Cellphones
Eliminate overhead and exploit locality to get 100x power efficiency
Easy parallelism with a coordinated team
Programmer
Tools
Architecture
100x performance increase in 10 years
![Page 52: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/52.jpg)
Vanguard TTI12/6/11 52
![Page 53: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/53.jpg)
Vanguard TTI12/6/11 53
Granularity
![Page 54: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/54.jpg)
Vanguard TTI12/6/11 54
#Threads increasing faster than problem size.
![Page 55: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/55.jpg)
Vanguard TTI12/6/11 55
Number of Threads increasing faster than problem size
![Page 56: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/56.jpg)
Vanguard TTI12/6/11 56
Number of Threads increasing faster than problem size
WeakScalingWeak
Scaling
![Page 57: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/57.jpg)
Vanguard TTI12/6/11 57
Number of Threads increasing faster than problem size
WeakScalingWeak
ScalingStrongScalingStrongScaling
![Page 58: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/58.jpg)
Vanguard TTI12/6/11 58
Smaller sub-problem per thread
![Page 59: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/59.jpg)
Vanguard TTI12/6/11 59
Smaller sub-problem per thread
![Page 60: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/60.jpg)
Vanguard TTI12/6/11 60
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
![Page 61: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/61.jpg)
Vanguard TTI12/6/11 61
Smaller sub-problem per thread
More frequent comm, sync, and thread operations
![Page 62: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/62.jpg)
Vanguard TTI12/6/11 62
This fine-grain parallelism is multi-level and irregular
![Page 63: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/63.jpg)
Vanguard TTI12/6/11 63
To support this requires fast mechanisms for
Thread arrays – create, terminate, suspend, resumeHardware allocation of resources to a thread array
threads, registers, shared memory
With locality
CommunicationData movement up and down the hierarchy
Fast active messages (message-driven computing)
SynchronizationCollective operations (e.g., barrier, reduce)
Pairwise (producer-consumer)
![Page 64: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/64.jpg)
Vanguard TTI12/6/11 64
Execution ModelExecution Model
A B
Active Message
Abstract Memory
Hierarchy
Global Address Space
ThreadObject
B
Lo
ad
/Sto
re
A
B Bulk Xfer
![Page 65: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/65.jpg)
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
![Page 66: Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW](https://reader030.vdocuments.net/reader030/viewer/2022032805/56649ede5503460f94beed73/html5/thumbnails/66.jpg)
J-Machine Speedup with Strong Scaling
Noakes et al. “The J-Machine Multicomputer: an Architectural Evaluation”, ISCA, 1993, pp.224-235
2 characters per node