algorithms/architecture/co1design/ for exascalecomputer · 2014. 9. 24. · 0.01 0.1 1 10 100 1000...

Algorithms/Architecture Co-‐design for Exascale Computer

Stanislav Sedukhin The University of Aizu

Outline !   Scalar Fused Mul=ply-‐Add opera=on as a workhorse of current scien=fic compu=ng

!   Current State-‐of-‐the-‐Art and Historical Observa=on (Review of 50+ single-‐chip uProcessors with FMA units)

!   Tree-‐ and Torus-‐structured Machines !   Arithme=c (Scalar MulEply-‐Add) à Algebra (Matrix MulEply-‐Add): Algebraic Path Problem

!   Design Space of Extremely-‐scalable GEneral Matrix Mul=ply-‐add (GEMM) Algorithms/Architecture (Best algorithm/architecture selecEon)

!   GEMM-‐based Orbital Algorithms and Unified Architecture for Mul=dimensional Data Processing

!   Conclusion

9/24/2014 2 © S. Sedukhin, University of Aizu

9/24/2014 3

SCALAR FUSED MULTIPLY-‐ADD OPERATION AS A WORKHORSE OF CURRENT SCIENTIFIC COMPUTING

© S. Sedukhin, University of Aizu

It is Time to Rethink Compu=ng

Cost of Compu=ng Date US$ (2013)

per GFLOP/S Computer

1961 0.8 × 1012 IBM 1620 x 106 @$64,000 each

1984 42,8 × 106 Cray X-‐MP/48

1997 42 × 103 Beowulf PenEum

2000 836 KLAT2 Athlon

2003 100 KASY0 Athlon

2007 52 Microwulf

2011 1.80 HPU4Science

2013 0.22 Sony PlaystaEon 4

Before •  ReducEon of expensive arithmeEc

operaEons (mostly mulEplicaEon): Fast algorithms:

–  FFT (Cooley-‐Tukey), –  MMA (Strassen, Winograd,…)

•  Algorithm SerializaEon •  Avoiding storage and mulEplicaEon by 0

–  Sparse Algorithms

9/24/2014 4

Source: hcp://en.wikipedia.org/wiki/FLOPS

Now and Later On •  MinimizaEon of expensive data movement is

more important than reducEon of cheap operaEons.

•  Deeply parallelize the original (w/regular data access), not fast, algorithm

•  “Go ahead, mulEply by zero!” (Dr. J. Gustafson) •  Time to Compute Sparse as Dense


Scalar Fused Mul=ply-‐Add (FMA)

Fused Mul=ply-‐Add (FMA): (3-‐read & 1-‐write) Floa=ng Point Unit

9/24/2014 5

FMA

00 =rRegister File

11 =r2r

3r

nr

•  two (SP/DP) floaEng-‐point scalar operaEons (× and +) in a single cycle (2FLOPs/cycle);

•  improved the accuracy due to only one final rounding;

•  a few (3÷6) cycles latency; •  standard “addiEon” and

“mulEplicaEon” by using hardwired constants 0.0 and 1.0

•  allows an efficient so[ware implementaEon of division and square root;

•  Use of the FMA operaEon results in pracEcal “disappearance” of four basic arithmeEc operaEons: add, subtract, mulEply, divide

c ç FMA(a,b,c); a × b , c = 0.0 : mulEplicaEon; c ç a × b + c : a(b) + c , b(a) = 1.0 : addiEon; a(b) , b(a) = 1.0, c = 0.0 : copy.


9/24/2014 6

Processor Architecture Vendor Year # DP FMA/clock Clock Speed (GHz) Cores DP peak (GFLOPS) TDP (Watt) #Transistors (x Million) Fab. process (nm) Die Size (mm^2)

POWER1 RIOS-‐1 IBM 1990 1 0.041 1 0.082 4 6.9 1000

PA-‐8000 PA-‐8000 HP 1996 1 0.2 1 0.4 3.8 500 337.69

SuperH SH-‐4 SuperH Hitachi 1998 1 0.2 1 1.4 1.5 3.2 250 42.25

Pen=um III Xeon 550 P6 Intel 1999 2 0.55 1 2.2 39.5 9.5 250 128

Itanium Itanium (Merced) Intel 2001 2 0.8 1 3.2 130 25 180 544

Opteron 850 K8 (SledgeHammer) AMD 2004 2 2.4 1 9.6 89 105.9 130 193

Itanium 2 Itanium 2 (Madison) Intel 2005 2 1.67 1 6.67 130 221 130 544

Xeon 3.80E NetBurst Intel 2005 2 3.8 1 15.2 110 189 90 135

Opteron 880 K8 (Egypt) AMD 2005 4 2.4 2 19.2 95 233 90

Opteron 8220 SE K8 (Santa Rosa) AMD 2006 4 2.8 2 22.4 119.2 243 90

Itanium 2 9010 Itanium 2 9000 (Montecito) Intel 2006 4 1.6 2 12.8 104 1720 90 544

Xeon 7140M NetBurst Intel 2006 4 3.4 2 27.2 150 1328 65 435

SPARC64 VI SPARC64 VI Fujitsu 2007 4 2.4 2 19.2 120 543 90 421.25

Opteron 8360 SE K10 (Barcelona) AMD 2007 8 2.5 4 40 119 463 65

Blue Gene/P PowerPC 450 IBM 2007 8 0.85 4 13.6 36.7 208 90 121

Xeon X7350 Core Intel 2007 8 2.93 4 46.88 130 582 65 503

SPARC64 VII SPARC64 VII Fujitsu 2008 8 2.88 4 46.08 600 65 445

Xeon X7460 Penryn Intel 2008 12 2.66 6 63.84 130 1900 45 503

PowerXCell 8i Cell IBM 2008 16 3.2 8 109 92 250 65 212

Tesla C1060 GT200 NVIDIA 2008 30 1.3 240 78 187.8 1400 65 470

Opteron 8439 SE K10 (Istanbul) AMD 2009 12 2.8 6 67.2 137 904 45

Xeon X7560 Nehalem Intel 2009 16 2.266 8 72.512 130 2300 45 684

SPARC64 VII+ SPARC64 VII+ Fujitsu 2010 8 3 4 48 45

Itanium 9350 Itanium 9300 (Tukwila) Intel 2010 16 1.73 4 55.36 185 2046 65 698.75

Xeon E7-‐8870 Westmere (Nehalem-‐C) Intel 2010 20 2.4 10 96 130 2600 32

Opteron 6180 SE K10 (Magny-‐Cours) AMD 2010 24 2.5 12 120 140 904 45

POWER7 POWER7 IBM 2010 32 4.14 8 264.96 150 1200 45 567

Tesla C2070 Fermi NVIDIA 2010 224 1.15 448 515 247 3100 40 529

Radeon HD 5870 Evergreen (Cypress XT) AMD 2010 320 0.85 1600 544 188 2154 40 334

Opteron 6282 SE Bulldozer (Interlagos) AMD 2011 32 2.6 16 166.4 140 1200 32

HPC-‐ACE SPARC64 VIIIfx SPARC64 (Venus) Fujitsu 2011 32 2 8 128 58 760 45 513

Xeon E5-‐4650 Sandy Bridge Intel 2011 32 2.7 8 172.8 130 2270 32

Godson-‐3 L3B Loongson 3B ICT 2011 64 1.05 8 128 40 582.6 65 300

Tesla M2090 Fermi NVIDIA 2011 256 1.301 512 666 225 3000 40

Radeon HD 6970 Northern Islands (Cayman XT) AMD 2011 384 0.88 1536 676 250 2640 40 389

Opteron 6386 SE Piledriver (Abu Dhabi) AMD 2012 32 2.8 16 179.2 140 1308 32

Itanium 9560 Itanium 9500 (Poulson) Intel 2012 48 2.53 8 242.88 170 3100 32 544

HPC-‐ACE SPARC64 VIXfx SPARC64 Fujitsu 2012 64 1.848 16 236.5 110 45

Blue Gene/Q PowerPC A2 IBM 2012 64 1.6 18 204.8 55 1470 45 428

Xeon Phi 5110P MIC Intel 2012 480 1.053 60 1011 225 5000 22 350

Radeon HD 7970 Southern Islands (TahiE XT) AMD 2012 512 0.925 2048 947 230 4313 28 352

Radeon HD 7970 GHz Edi=on Southern Islands (TahiE XT2) AMD 2012 512 1.05 2048 1075 230 4313 28 352

Tesla K20X Kepler GK110 NVIDIA 2012 896 0.732 2688 1312 235 7080 28 561

Quadro K6000 GK110 NVIDIA 2013 960 0.901 2880 1732 225 7080 28 561

Core i7-‐4770K Haswell Intel 2013 32 3.8 4 243.2 84 1400 22 177

SPARC64 XIfx SPARC64 XIfx Fujitsu 2014 272 2.2 34 1196.8 160 3750 20 600

FirePro W9100 Hawaii XT AMD 2014 1408 0.93 2816 2618.9 275 6200 28 438

FirePro S9150 Hawaii XT GL AMD 2014 1408 0.9 2816 2534.4 235 6200 28 438

GeForce GTX Titan Black GK110-‐430 NVIDIA 2014 960 0.889 2880 1707 250 7080 28 561

POWER8 POWER8 IBM 2014 24 4.2 12 201.6 250 4200 22 675

50+ Single-‐chip μ-‐Processors (1990 – 2014)


1

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

Num

ber o

f DP

FMA/

cloc

k (F

MA

units

)

1990 1995 2000 2005 2010 2015 2020

1248

163264

128256512

1024204840968192

1638432768

1990 1995 2000 2005 2010 2015 2020Num

ber o

f Tra

nsis

tors

(Mill

ions

)


0.031

0.063

0.125

0.25

0.5

1

2

4

8

Clo

ck

Sp

ee

d (

GH

z)

1990 1995 2000 2005 2010 2015 2020

1

2

4

8

16

32

64

128

256

512

1990 1995 2000 2005 2010 2015 2020

1

2

4

8

16

32

64

128

256

512

1024

Die

Size

(mm

^2)

1990 1995 2000 2005 2010 2015 2020


9/24/2014 9

0.01

0.1

1

10

100

1000

10000

DP

Pe

ak

Pe

rfo

rm

an

ce

(G

FLO

PS

)

1990 1995 2000 2005 2010 2015 2020

AMD W9100 (2619) AMD S9150 (2534) NVIDIA GTX 780 Ti (1707) Fujitsu SPARC64 XI (1196.8) Intel Haswell-EP 2686v3 (1008) IBM POWER8 (201.6)

NVIDIA GK110 (1732) NVIDIA K20X (1312)

AMD 5870 (544) IBM PowerXCell 8i (109)

AMD Opteron 880 (19.2) AMD Opteron 850 (9.6)

Intel Pentium III (2.2) Hitachi SuperH-4 (1.4)

HP PA-8000 (0.4)

IBM POWER1 (0.08)

Intel Itanium (3.2)

AMD Opteron 6386 (179.2) Intel Haswell (243.2)

Intel Sandy Bridge (172.8) Fujitsu SPARC64 VII+(48)

IBM PowerPC 450 (13.6)

Intel Itanium2 (6.67)

Intel Xeon 7140M (27.2)

DP Peak Performance (GFLOPS)

2014: Peak Performance (GFLOPS) 1.  AMD W9100 2619 2.  AMD S9150 2534 3.  NVIDIA GTX 780 Ti 1707 4.  NVIDIA GTX 980 1537 5.  Fujitsu SPARC64 XIfx 1197 6.  Intel Haswell-‐EP 1008 7.  IBM POWER8 202


0.01

0.10

1.00

10.00

100.00

0.1 1 10 100

Po

we

r/T

hro

ug

hp

ut (

W/G

FLO

PS

)

Area/Throughput (mm²/GFLOPS)

Merced

MadisonPentium III

MontecitoOpteron

SPARC64 VIXeon 7140

Xeon 3.80E

Itanium 9300Xeon 7350

PowerPC 460

Xeon 7650PenrynGT200

POWER8PowerXCell8i

POWER7 SPARC64 VIII

Itanium 9560

LoongsonPowerPC A2

Tesla C2070

Radeon 5870Radeon HD 6970

Radeon 7970MIC

SPARC64 XIfx

Tesla K20X

GTX 780GK110-410

W9100 S9150

Haswell

GTX 980

Haswell-EP

9/24/2014 10

1 W/mm2

0.1 W/mm2


9/24/2014 11

10 20 50 100 200 500 1.000 2.000 3.000 4.000 10.000 GFLOPS

4

2

1

2014

Processor Vendor Clock Period (GHz)

Die Area (mm2)

TDP (Waks)

# DP FMA units

Peak Performance (GFLOPS)

GFLOPS/W Area/GFLOPS (mm2/GFLOPS)

POWER8 IBM 4.2 675 250 24 201.6 0.81 3.35

SPARC64 XIfx Fujitsu 2.2 600 160 272 1197 7.5 0.50

S9150 AMD 0.90 438 235 1408 2534 10.8 0.17

Performancepeak (GFLOPS) = #FMAunits × 2FLOPs × CLKperiod (GHz)


0.01

0.1

1

10

100

1990 1995 2000 2005 2010 2015 2020

AMD S9150 (10.8) AMD W9100 (9.5) NVIDIA GTX 980 (9.3) Intel Haswell (8.4) Fujitsu SPARC 64 XI (7.5) NVIDIA GK110-430 (6.8) IBM POWER8 (0.8)

NVIDIA GK110 (7.7) NVIDIA K20X (5.6)

AMD 5870 (2.9)

PowerXCell8i (1.2)

IBM PowerPC 450 (0.37)

Intel Opteron 850 (0.1) Intel Pentium III (0.06)

Hitachi SuperH (0.93)

IBM Power1 (0.02) Intel Itanium (0.03)

Intel Opteron 880 (0.2)

Compu=ng Efficiency: Performance per Wak 50 GFLOPS/Wak for Exaflopic Super

@20MW

2014: Efficiency (GFLOPS/W) AMD S9150 10.8 AMD W9110 9.5 NVIDIA GTX 980 9.3 Intel Haswell-‐EP 8.4 Fujitsu SPARC64 XIfx 7.5 NVIDIA GTX 780 Ti 6.8 IBM POWER8 0.8

2014: Peak Performance (GFLOPS) 1.  AMD W9100 2619 2.  AMD S9150 2534 3.  NVIDIA GTX 780 Ti 1707 4.  NVIDIA GTX 980 1537 5.  Fujitsu SPARC64 XIfx 1197 6.  Intel Haswell-‐EP 1008 7.  IBM POWER8 202


Even more FMA Units in Supercomputers !   IBM Roadrunner’s Peak Performance:

12,960 PowerXCell 8i × 8 SPEs × 2-‐way FMAs/clock = 207.360 FMAs/clock (@3.2 GHz) ≈ 1.3 PFLOPS !   K Supercomputer: 80,000 SPARC64 VIIIfx chips, each chip is 8-‐core CPU 128GFLOPS, i.e.

640,000 cores × 4-‐way FMAs/clock = 2.56M FMAs /clock (@2 GHz) = 10.24 PFLOPS !   IBM BlueGene/Q (Sequoia): 1.3M cores × 4-‐way FMAs/clock

= 5.2M FMAs/clock ≈ 10M FLOPs/clock (@2 GHz) ≈ 20 PFLOPS !   IBM Mira: 16-‐core Power A2 (@1.6GHz); 750K cores × 4-‐way = 3M FMAs/clock

≈ 6.14MFLOPs/clock ≈ 9.83 PFLOPS !   GPU-‐based Supers:

–  Tianhe-‐1A: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 7,168 GPUs ≈ 1.8M FMAs/clock ≈ 5.9 PFLOPS –  TSUBAME 2: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 4,224 GPUs ≈ 1.1M FMAs/clock ≈ 3.5 PFLOPS

!   Tianhe-‐2 : 16,000 nodes × (2 Intel Xeon Ivy Bridge + 3 Intel Xeon Phi): –  Phi: 480 × 3 = 1440 FMA/clock/node × 16,000 ≈ 23M FMAs/clock × 2FLOPs @1.053GHz ≈ 48.5PFLOPS –  Ivy Bridge: 12 cores × 4 FMAs/cycle × 2 = 96 FMA/clock/node × 16,000 ≈ 1.54M FMAs/clock × 2FLOPs @2.2GHz ≈ 6.8PFLOPS –  Total: 23M + 1.54M ≈ 25M FMAs/clock; 48.555PFLOPS + 6.855PFLOPS ≈ 55PFLOPS –  Footprint: 720 m2 ≈ 27 m × 27 m

!   Synchroniza=on of Processes in such mul=-‐node GALS-‐Supers is provided by distributed Global “Barrier Synchroniza=on” in hardware or/and in so[ware!

–  “Barrier SynchronizaEon” on 32-‐processor Shared Memory SGI Origin 300 System: 232,000 cycles ≈ 22MFLOPs –  Time of “Barrier SynchronizaEon” depends on the size of system –  Scalability of MPI_Barrier in “Earth Simulator”: ~3 μsec – 333KHz while operaEonal frequency is 1GHz, i.e. difference is 3000

Emes…


Source: Matzke, D., "Will physical scalability sabotage performance gains?," Computer , vol.30, no.9, pp.37,39, Sep 1997

Source: Agarwal, V., Hrishikesh, V., M., Keckler, S., W., Burger, D. , “Clock rate versus IPC: the end of the road for convenEonal microarchitectures.” SIGARCH Comput. Archit. News 28, 2 (May 2000), 248-‐259

0.031

0.063

0.125

0.25

0.5

1

2

4

8

Clo

ck

Sp

ee

d (

GH

z)

1990 1995 2000 2005 2010 2015 2020

Trends for “Clock Region” or “Span of Control”


•  Performance P (GFLOPS) = #FPUs × 2FLOPs × F (GHz),

where F is a clock cycle speed or an operaEon frequency. •  The number of FPUs (#FPUs) defines the area (A) of a system. Clock period T = 1/F should be long enough to “cover” A in a single clock. •  The maximum area, “reachable” in a single clock period would be π R2, for planar technology (area of a circle); A = (4/3) π R3, for 3D technology (volume of a ball/sphere); where a “reachability” radius R = S T = S / F and S is the speed of a clock-‐signal propagaEon in a media (wires, opEc, etc.): S = k Ċ, the speed-‐of-‐light Ċ and 0.5 < k < 1.

More Performance and Less Power Consump=on by Frequency Reduc=on


•  Reducing a frequency F m-‐Emes, i.e. increasing the “reachability” radius R m-‐Emes, leads to increasing area A and, therefore, the number of FPUs,

m2 Emes, for planar technology; m3 Emes, for cubical technology. •  Because each FPU becomes m-‐Emes “slower”, the performance P is increased m Emes, for planar technology; m2 Emes, for cubical technology. •  The power consumpEon E of a processor is the funcEon of frequency F : E = C V2 F , where C is capacitance, V is voltage. Hence, reducEon of frequency m-‐Emes will proporEonally reduces a power. Reality: •  VLSI technology uses not Euclidian, but Manhacan geometry (metric), i.e. area will be not

a circle/sphere, but a square/cube! •  Historically, VLSI technology increases the chip area (die size) very slowly (only ×2, from

350 to 700 mm2 for the last 25 years), while decreases the feature size exponenEally ( ×50, from 1000 to 20 nm for the same period of Eme)!

More Performance and Less Power Consump=on by Frequency Reduc=on


The Light Speed Barrier limits the size of synchronous computer by the raEo D ≤ S / F , where D is the diameter of a system (mm) , S is the speed of clock-‐signal propagaEon (mm/s), S = kĊ, the speed-‐of-‐light Ċ ≈ 31011 mm/s, 0.5 < k < 1, and F is an operaEonal frequency in Hz (1/s).

Heat Barrie

r

Classical theory of compu=ng where the =me of data transferring to processor is neglected. It is theory of slow compu=ng as in the classical mechanics – the theory of slow mo=on.

Rela=vis=c theory of compu=ng pays most aken=on to transferring data to processor. This is theory of fast compu=ng as in the rela=vis=c, speed-‐of-‐light mo=on, mechanics.

D ≈ (D

ie Size)

½

Tianhe-‐2 Footprint 720 ≈ 27×27 (m2)

Chip Area 700 ≈ 27×27 (mm2)


9/24/2014 18

Typical DGEMM Performance on CPUs & GPUs

512 FMAs @0.925 GHz (947 GFLOPS) AMD “Tahi=” HD 7970 384 FMAs @0.880 GHz (676 GFLOPS) AMD “Cayman” HD 6970 512 FMAs @0.650 GHz (666 GFLOPS) NVIDIA “Fermi” Tesla M2090 96 FMAs @1.085 GHz (122 GFLOPS) NVIDIA “Kepler” GTX 670 OC 32 FMAs @2.7 GHz (173 GFLOPS) Intel “Sandy Bridge” Core i7 3960X 32 FMAs @2.6 GHz (166 GFLOPS) AMD “Bulldozer” FX-‐8150

Source: Matsumoto, K.; Nakasato, N.; Sedukhin, S.G., "Performance Tuning of Matrix MulEplicaEon in OpenCL on Different GPUs and CPUs, " High Performance CompuBng, Networking, Storage and Analysis (SCC), 2012 SC Companion: , pp.396-‐405, 10-‐16 Nov. 2012

For today supercomputers with more than 106 FMA units, Nmax ≈ 107 and N1/2 ≈ 106 Performance scalability is not acceptable for mobile/embedded applica=ons


Tree-‐structured Flat Machines: Low Performance Scalability

•  Today scienEfic-‐oriented computers are very inerBal: iniEal data are very far from FPUs

•  Pipelined FPU w/few cycles latency requires a few concurrent threads (no fine-‐grained implementaEon)

•  Storage hierarchy adds more overhead w/each addiEonal level of hierarchy (coarse-‐grained memory accesses)

•  Data in storage hierarchy are copied at each level: no compuEng-‐in-‐place is possible

•  Chips are power limited and most power is spent for (global) data moving and replicaBon

•  processor/memory/interconnect are scaled differently (a progressive computer scaling is provided by drasEcally increasing the data or problem size, Nmax for FLOPSmax)

9/24/2014 19

Source: Bill Dally, Chief Scien;st & Sr. VP of Research, NVIDIA


•  Data reusing by local data movement between FPUs (not by using hierarchical data storage and global mulEple data replicaEon)

•  Fine-‐grained data processing (compuEng and data access/exchange)

•  CombinaEon w/parallel read-‐out sensors and stacked memory are possible (compuEng-‐in-‐place)

•  Processor/Memory/Interconnect as a unified element of structural scalability which keeps a single image of the system like a biological cell.

•  2D/3D machines for compuEng 2D/3D tensor data

•  More specialized, but more reacEve computer organizaEon

•  Global synchronizaEon by locally coordinated (asynchronous) massively data circulaEon

9/24/2014 20

Target: Mesh/Torus-‐structured Machines

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

P LM

N

k

ji


ARITHMETIC & ALGEBRA RelaEons between


FMA and Algebraic Semiring

FMA: { R, (× +), 0.0, 1.0 } •  Set of real numbers

–  R •  Fused arithmeEc

mulBply-‐and-‐add: –  (× +)

•  Two constants from R:

–  0.0 –  1.0

Semiring: { S, ⊗, ⊕, 0, 1 } •  Set of numbers

–  S •  Two algebraic operaEons

–  mulBply ⊗ –  add ⊕

•  Two constants from S: –  0 –  1


Algebraic Path Problem

l  Problems from different disciplines represented as a single algorithmic scheme (rich in FMA operaEons) -  Linear Algebra

-  compuEng the inverse of a matrix -  Graph and Network Problems

-  transiEve&reflexive closure and transiEve reducEon -  shortest distance problems (distance funcEons) -  capacity problems (max flow, network capacity, tunnel-‐problem) -  connecEvity measures for reliability networks -  stochasEc communicaEon network problems

-  Regular Language Problems -  correspondence: regular expressions and finite state automata

l  UnificaEon based on the theory of algebraic semirings –  Different semirings for different applicaEons

l  SoluEon as a case of a matrix closure problem


24/24

The Algebraic Path Problem Defini=on

!   Let G = (V, E, w) be a weighted graph, V = {0,1,…,n-‐1} is a set of n verEces; E = V × V is a set of edges; w: E → S is an edge weighEng funcEon whose values taken from the set S

!   The weighEng funcEon belongs to the so called path algebra or algebraic semiring (S, ⊕, ⊗, ∗, 0, 1)


l  A closed semiring (S, ⊕, ⊗, ∗, 0, 1) is an algebraic structure defined by: -  Set of scalar elements S -  Two binary operaEons:

l  AddiBon ⊕ : S × S → S

l  MulBplicaBon ⊗ : S × S → S

-  Unary operaEon called closure ∗ : S → S -  Two constants 0 and 1 in S , 0 and 1 are the neutral elements for ⊕ and ⊗ , respecEvely.

Scalar Semiring


Different Semirings and Associa=ve Problems

9/24/2014 26

S ⊕ ⊗ (*) 0 1 Basic&Problem& Core&Algorithm&

0,1 ∨ ∧ 1* =a 0 1 Transitive*Closure* Warshall*ℜ + × 1)1(* −−= aa 0 1 Matrix*Inversion* Gauss5Jordan*

}{+∞∪ℜ+ min + 0* =a ∞+ 0 All*Pairs*Shortest*Paths* Floyd*}{−∞∪ℜ+ max + 0* =a ∞− 0 Maximum*Cost*(Critical*Path)* ?*

}{+∞∪ℜ+ max min ∞=*a 0 ∞+ Maximum*Capacity*Paths* ?*

]1,0[ℜ max × 1* =a 0 1 Maximum*Reliability*Paths* ?*

]1,0[ℜ min × 1* =a 0 1 Minimum*Reliability*Paths* New*

}{+∞∪ℜ+ min max 0* =a ∞+ 0 Minimum*Spanning*Tree* Maggs5Plotkin*

Given an n×n matrix A: distance/cost matrix of a weighted n-‐vertex graph. The APP is to find the closure A* of a matrix A in different algebraic semirings:

A* = Īn ⊕ A ⊕ A2 ⊕ A3 ⊕ … = Ī ⊕ A ⊗ A* SoluEon by unified Gauss-‐Jordan/Warshall/Floyd (GJWF) algorithm


l  ( Sbxb, ⊕, ⊗, ∗, 0, I ) is an algebraic structure defined by: -  Set of b×b matrices Sbxb over a closed scalar semiring (S, ⊕, ⊗, ∗, 0, 1)

-  Two binary operaEons: l  Matrix AddiBon ⊕ : Sbxb × Sbxb → Sbxb l  Matrix MulBplicaBon ⊗ : Sbxb × Sbxb → Sbxb

-  A unary operaEon: closure ∗ of a matrix: Sbxb → Sbxb -  Two n×n matrices of constants in Sbxb -  0 where all elements are equal to 0 (zero matrix) -  I where all diagonal elements are equal to 1 (idenEty matrix)

Matrix Semiring


Scalar FMA Opera=on ⇔ Matrix FMA Opera=on Arithme=c: Scalar Fused Mul=ply-‐Add

•  (×, +)-‐algebra: MMA

•  (+, min)-‐algebra: SPP

•  (+, max)-‐algebra: CRP

•  (min, max)-‐algebra: MCP

•  (×, max)-‐algebra: MRP

•  (max, min)-‐algebra: MST

Matrix Algebra: Matrix Fused Mul=ply-‐Add

9/24/2014 28

bacc ⋅+←

),min( bacc +←

),max( bacc +←

)},max(,min{ bacc←

),max( bacc ×←

)},max(,min{ bacc←

•  (×, +)-‐algebra:

•  (+, min)-‐algebra:

•  (+,max)-‐algebra:

•  (min, max)-‐algebra:

•  (×, max)-‐algebra:

•  (max, min)-‐algebra:

∑−

=⋅+←

1

0

n

k kjikijijcccc

)}(min,min{1

0 kjik

n

kijijcccc +←

−

=

)}(max,max{1

0 kjik

n

kijijcccc +←

−

=

)}],{min(max,max[1

0 kjik

n

kijijcccc

−

=←

)}(max,max{1

0 kjik

n

kijijcccc ×←

−

=

)}],{max(min,min[1

0 kjik

n

kijijcccc

−

=←

MMA– Matrix MulEply-‐Add SPP– Shortest Paths Problem

CRP– CriEcal Path Problem MCP– Max. Capacity Paths

MRP– Max. Reliable Paths MST– Min. Spanning Tree


Accelerators: Scalar FMA Unit ⇔ Matrix FMA Unit

Arithme=c: Scalar Fused Mul=ply-‐Update Unit

Matrix Algebra: Matrix FMA Array Processor

9/24/2014 29

No need to understand how FMA unit is internally constructed! No need to understand how “Big Mul=plier” is internally constructed!


XGEMM-‐based Algorithms/Architecture

•  GEMM in Different Algebras (XGEMM)

•  Chaining Matrix Products

•  Focal-‐Plane I/O for Streaming Data •  CompuEng-‐near-‐Data

CBACCBACCBAC

T

T

⊕⊗←

⊕⊗←

⊕⊗←

!

!

!

!!!!

!T

T

CBADCBAD

⊗⊗←

⊗⊗←

30 9/24/2014

Streaming 2D Data from Sensor Array or Stacked Memory


EXTREMELY-‐SCALABLE GENERAL MATRIX MULTIPLY-‐ADD ALGORITHM

Algorithmically and Technologically


Matrix-‐by-‐Matrix Mul=ply-‐Add (MMA)

C← A×B+C, where A, B, and C are dense (n×n) matrices:

c(k ) (i, j)← a(i, k) ⋅b(k, j)k=0

n−1

∑ + c(k−1)(i, j), 0 ≤ i, j < n;

• The computational index space is a bounded grid of n×n×n index points: ℑ = {(i, j, k)T : 0 ≤ i, j, k < n}⊆ Z3, where in each index point p = (i, j, k)T ∈ℑ we have to update c(i, j) by implementing scalar multiply-add operation: c(i, j)← a(i, k) ⋅b(k, j)+ c(i, j), i.e. all three scalars a(i, k), b(k, j), and c(i, j) should be available in this index point before computing is started.• Because there are only 3n2 input matrix data, but n3 index points, no more than n2 data independent index points can be activated (i.e., no more than n2 multiply-add operations per time-step) if no data replication is considered. • Hence, n is the minimal number of "read - compute - write" steps to implement a matrix-by-matrix multiply-add.

Example:

32 9/24/2014

p = (2,1, 2)T ∈ℑ

c(2)(2,1)← a(2, 2) ⋅b(2,1)+ c(1)(2,1)C (2)[2,1]← A[2, 2]×B[2,1]+C (1)[2,1]


!  A Bme-‐scheduling funcEon: step(p): Z3 → Z , for all p=(i,j,k)T from I –  linear or modular form: step(p) = (αTp) or step(p) = (αTp) mod_n where α = (αi, αj, αk)T is a Eme-‐scheduling vector

!  A space-‐scheduling funcEon: alloca=on(p): Z3 → Z2 , for all p=(i,j,k)T from I –  linear projecEon method: alloca=on(p): S × p, where S is a 2×3 space transformaEon matrix corresponding to the projecEon vector η, such that S×η = Ō and (αTη) ≠ 0

9/24/2014 33

Time-‐Space Scheduling of MMA


Grid-‐based Scheduling The time scheduling function: step(p) =αT ⋅ p, where αT = (αi

B,α jA,αk

C )T = (0, 0,1)T , i.e. step(p) = k,

p = (i, j,k)T ∈ℑ = (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Z3;On each step(p) = s ∈ [0,n), broadcasting of the n-elementcolumn as ∈ A and n-element row bs

T ∈ B to update an n2 -element matrix C (s) are required. Hence, BCS-scheduling corresponds an implementation on each time-step s = 0,1,...,n−1 the rank-1 update: C (s+1) ←C (s) + as ⊗ bs

T .All initial/intermediate/final data are resided inside the index space ℑ.

The total number of “broadcast-‐compute-‐shi” steps is n.

34 9/24/2014

Broadcast-‐Compute-‐Shi[ (BCS) Scheduling


Systolic or Mesh Scheduling The time scheduling function: step(p) =αT ⋅ p, where α = (αi

B,α jA,αk

C )T = (1,1,1)T , i.e. step(p) = i+ j + k,

p = (i, j,k)T ∈ℑ= (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Z3;It is a "broadcast-to-pipeline" or "time-multiplexing" version of a BCS scheduling.All initial data is aligned and located on the hyper-plane outside of the index space ℑ.The same will be for the final data.

The total number of “all-‐shi-‐compute” steps is 3n-‐2.

35 9/24/2014

All-‐Shi[-‐Compute (ASC) Scheduling


Cylindrical Scheduling The time scheduling function: step(p) = (αT ⋅ p)modn, where α = (αi

B,α jA,αk

C )T = (−1, 0,1)T , i.e. step(p) = (k − i)modn,p = (i, j,k)T ∈ℑ⊆ Z3;This scheduling requires on each step(p) = s ∈ [0,n) :s-th dioganal n-vector as ∈ A be broadcast(α j

A = 0) along !j -axis to compute (update) a matrix C

and, then, matrices B and C are rolled opposite!i -axis or orbit (αi

B = -1) and along !k -orbit (αk

C =1).All initial/intermediate/final data are resided inside the index space ℑ.

The total number of “broadcast-‐compute-‐roll” steps is n.

36 9/24/2014

Broadcast-‐Compute-‐Roll (BCR) Scheduling


Orbital or Toroidal Scheduling The modular time scheduling function: step(p) = (αT ⋅ p)modn, where α = (αi

B,α jA,αk

C )T = (±1,±1,±1)T , i.e. step(p) = (±i± j ± k)modn,

p = (i, j,k)T ∈ℑ = (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Zn3;The computational index space ℑ is a 3D torus.At each step(p) = s ∈ [0,n) :computing: c(i, j)← c(i, j)+ a(i,k) ⋅b(k, j);roll - all: a(i,k),b(k, j), and c(i, j) are rolledalong ± j-,±i-, and ± k-orbit, respectevely.All initial/intermediate/final data are resided inside the index space ℑ.

The total number of “compute-‐roll-‐all” steps is n.

37 9/24/2014

Compute-‐Roll-‐All (CRA) Scheduling


38

step(p) = [-i - j + k] mod n

000

011

022

033

101

112

123

130

202

213

220

231

303

310

321

332

i

j

ijk

310

321

332

303 213

202

231

220

123

112

101

130

033

022

011

000

A-‐staBonary and in a canonical layout

c

c

c

c

b b

b

b

303 310 321 332 220 231

112 123 130 000 011 022 033

C-‐staBonary and in a canonical layout

b b

b b

a a a a

213 101

202

(-‐1,0,0)T (0,-‐1,0)T

(0,0,-‐1) T

k

B-‐staBonary and in a canonical layout

303 213 123 033

202 112 022 332

101 011 321 231

000 310 220 130 c

c

c

c

a a a a

Cannon’s Algorithm [1969]

9/24/2014

3D Index-‐Space à 2D Processor-‐Space Alloca=on

step(p) = 0, n = 4


Comparison of Different MMA Implementa=ons

BCS ASC BCR CRA # =me-‐steps n 3n-‐2 n n

Index Space 3D Grid 3D Mesh 3D Cylinder 3D Torus

I/O Data Loca=on Inside Outside Inside Inside

Data reusing/step 2n 1…3n(n+1)/2…1 n+n2 2n2

Data Movement (Global/Local)

Bcast/Shi All Shi Bcast/Roll All Roll

Compu=ng-‐in-‐Place Possible Impossible Possible Possible

3D à 2D Projec=on 2D Grid Agarwal, at al.(1994) van de Geijn, (1995)

2D Mesh Kung, Leiserson, (1979)

2D Torus Fox, Oco, Hey, (1987)

2D Torus Cannon (1969)

Scalability Bad Good Bad Good

Final Selec=on No No No Yes

39 9/24/2014 © S. Sedukhin, University of Aizu

Three Forms of MMA

•  Any Eme-‐step scheduling funcEon defines only matrix data distribuEon among the acEve index points and does not specify how compuEng is performed actually.

•  By using the same Eme scheduling funcEon we can implement either

∑

∑

∑

−

=

−

=

−

=

⋅+←⇔+←

⋅+←⇔+←

⋅+←⇔+←

1

0

1

0

1

0

);,(),(),(),( )(

);,(),(),(),( )(

;),(),(),(),( )(

n

j

T

n

i

T

n

k

jkbjickiakiaACBAj

jickiajkbjkbBCABi

jkbkiajicjicCABCk

!

!

!

40 9/24/2014

AccumulaEon along: NN-‐, TN-‐, and NT-‐forms ¾ of GEMM


Chaining of Matrix Products

•  Because in the Orbital Scheduling the resulEng data is properly resided inside the index space, by using the three forms of accumulaEon (k), (i), and (j) we can efficiently implement MMAs chaining.

•  Two MMAs Chaining: (!j,!k ) :E← (CBT + A)D+E;

(!i ,!k ) :E← D(ATC +B)+E;

(!k,!j ) :E← (AB+C)DT +E;

(!i ,!j ) :E← DT (ATC +B)+E;

(!j,!i ) :E← (CBT + A)T D+E;

(!k,!i ) :E← DT (AB+C)+E.

41 9/24/2014

•  2n Eme-‐steps are needed to complete this chaining


GEMM-‐BASED ORBITAL ALGORITHMS


Systolic → Orbital Rescheduling

u  Linear Algebra §  Matrix-‐by-‐matrix mulEplicaEon, §  SoluEon linear systems (matrix

inversion), §  LU, QR, SVD decomposiEons, §  ...

u  Digital Signal Processing §  FIR, IIR, ID/2D convoluEon, §  2D DFT, DCT, DHT, ... §  Dynamic scene analysis, §  Image resampling, §  InterpolaEon, §  1D/2D median filtering, §  Geometric warping, §  …

u  Non-‐numerical applica=ons §  Data structures (stack/queue,

searching, priority queue, sorEng), §  Graph algorithms (transiEve closure,

minimum spanning tree, connected components,...)

§  Language recogniEon (string matching, regular expression)

§  Dynamic programming, §  RelaEonal database operaEons, §  Encoders (polynomial division) §  ...


2-‐Dimensional Separable Transforms

•  Let X=[x(i,k)] be n×n signal or image data. A 2D forward transform of X is defined as

•  A 2D inverse transform: •  The transform coefficient matrix C

can be –  symmetric (C = CT) and unitary

(C-1 = C*T) as for DFT or DHT –  unitary and real as in DCT –  only ±1 and be symmetric and

orthogonal as in DWHT

9/24/2014 45

!!x(u,v) = c(i,u) x(i,k) ⋅c(k,v)k=0

n−1

∑i=0

n−1

∑

in a matrix form:!!X =CT × X ×C

TCXCX ××= !!

•  Chaining of Matrix Products

•  Totally 2n time-steps are needed to implement any 2D transform

.)(:),(

;)(:),(;)(:),(;)(:),(

;)(:),(

;)(:),(

ECABDEik

EDACBEijEBCADEjiEDCABEjk

EBCADEki

EDACBEkj

T

TT

TT

T

T

T

++←

++←

++←

++←

++←

++←

!!

!!

!!

!!

"!

!!


Data Manipula=on by Matrix-‐Matrix Mul=ply-‐Add

!  A generic form of MMA operaEon: D ← MMA[⊗,⊕](A,B,C) : D ← A ⊗ B ⊕ C

!   Row/column interchange: D(n,n) ← MMA[×,+](P(n,n),A(n,n),zero(n,n)) where P(n,n) is a (i,j)-‐permutaEon matrix

!   Rows/columns RotaEon !   Scalar Data ReplicaEon (Broadcast)


Global Reduc=on and Broadcast !   Generic form:

B(n,n) ← ones(n,n) ⊗ A(n,n) ⊗ ones(n,n) !   For example, summaEon and broadcast:

q C(n,n) ← MMA[×,+](ones(n,n),A(n,n),zero(n,n)) q D(n,n) ← MMA[×,+](C(n,n),ones(n,n),zero(n,n))

!   maximum and broadcast: v  C(n,n) ← MMA[×,max](ones(n,n),A(n,n),-‐inf(n,n)) v C(n,n) ← MMA[×,max](C(n,n),ones(n,n),-‐inf(n,n))


CONCLUSION IN ONE SLIDE


9/24/2014

Matrix Array Processor Accelerated Functions Target Applications

•  2D/3D Toroidal Array Processor for Computing-near-Data

–  Basic Operation: Matrix multiply-add –  3-D toroidally connected Banks of Memory with attached scalar FMA units –  Using Slower and Simpler Cores –  Multidimensional I/O

•  Keeping Integrity of Data •  Highly scalable computing/ memory/interconnect fabric

Computing

•  Matrix Mathematics –  GEMM (BLAS Level 3): C=C+A×B; C=C+AT×B; C=C+A×BT –  Linear Algebra

•  Graph (Path) Algorithms GEMM in Different Algebras –  Transitive Closure –  All Pairs Shortest/Longest Paths –  Critical Path –  Maximum Capacity Path –  Most Reliable Path –  Minimum Spanning Tree

•  Multidimensional Linear Transforms –  DFT 2D/3D Fwd/Inv Separable Transforms –  DCT Y=(CT×(X×C))×C –  DHT X=(C×(Y×CT))×CT –  DWH –  DST

Data Manipulation •  Rotation •  Permutation •  Transposition •  Copy •  Replication •  Reduction •  Broadcast

•  Medical Imaging / Visualization •  Radar Systems •  Sonar Systems •  Defense and Security IT •  Surveillance •  Wireless Communications •  Network Processing •  Voice and Pattern Recognition •  Computational Chemistry •  Climate Modeling •  Data Mining and Analysis •  Game Physics / Physics Simulation •  Life Sciences & Biotechnology

–  Computational chemistry and biology –  In silico drug discovery –  Gene sequencing –  Pharmacogenomics –  Protein folding –  Molecular dynamics –  Personalized medicine –  Genomics, Proteomics, Metabolomics –  Simulation of biological systems

•  Geophysical Science –  Seismic data processing –  Petroleum Reservoir Modeling

•  Financial Analysis and Modeling •  …

k

ji

53 9/24/2014 © S. Sedukhin, University of Aizu

9/24/2014

The Human Brain •  Number of neurons 1011 •  Number of synapses (adult) 1014 (2,000-‐5,000 per neuron) •  Power consumpEon (adult): 20-‐40 Wacs (0.5-‐4 nW/neuron) •  Maximum firing frequency of neuron: 250-‐2,000 Hz (0.5-‐4 ms intervals) •  Signal propagaEon speed inside axon 90 m/s sheathed,

THANK YOU !


algorithms/architecture/co1design/ for exascalecomputer · 2014. 9. 24. · 0.01 0.1 1 10 100 1000...

Documents