algorithms/architecture/co1design/ for exascalecomputer · 2014. 9. 24. · 0.01 0.1 1 10 100 1000...
TRANSCRIPT
-
Algorithms/Architecture Co-‐design for Exascale Computer
Stanislav Sedukhin The University of Aizu
-
Outline ! Scalar Fused Mul=ply-‐Add opera=on as a workhorse of current scien=fic compu=ng
! Current State-‐of-‐the-‐Art and Historical Observa=on (Review of 50+ single-‐chip uProcessors with FMA units)
! Tree-‐ and Torus-‐structured Machines ! Arithme=c (Scalar MulEply-‐Add) à Algebra (Matrix MulEply-‐Add): Algebraic Path Problem
! Design Space of Extremely-‐scalable GEneral Matrix Mul=ply-‐add (GEMM) Algorithms/Architecture (Best algorithm/architecture selecEon)
! GEMM-‐based Orbital Algorithms and Unified Architecture for Mul=dimensional Data Processing
! Conclusion
9/24/2014 2 © S. Sedukhin, University of Aizu
-
9/24/2014 3
SCALAR FUSED MULTIPLY-‐ADD OPERATION AS A WORKHORSE OF CURRENT SCIENTIFIC COMPUTING
© S. Sedukhin, University of Aizu
-
It is Time to Rethink Compu=ng
Cost of Compu=ng Date US$ (2013)
per GFLOP/S Computer
1961 0.8 × 1012 IBM 1620 x 106 @$64,000 each
1984 42,8 × 106 Cray X-‐MP/48
1997 42 × 103 Beowulf PenEum
2000 836 KLAT2 Athlon
2003 100 KASY0 Athlon
2007 52 Microwulf
2011 1.80 HPU4Science
2013 0.22 Sony PlaystaEon 4
Before • ReducEon of expensive arithmeEc
operaEons (mostly mulEplicaEon): Fast algorithms:
– FFT (Cooley-‐Tukey), – MMA (Strassen, Winograd,…)
• Algorithm SerializaEon • Avoiding storage and mulEplicaEon by 0
– Sparse Algorithms
9/24/2014 4
Source: hcp://en.wikipedia.org/wiki/FLOPS
Now and Later On • MinimizaEon of expensive data movement is
more important than reducEon of cheap operaEons.
• Deeply parallelize the original (w/regular data access), not fast, algorithm
• “Go ahead, mulEply by zero!” (Dr. J. Gustafson) • Time to Compute Sparse as Dense
© S. Sedukhin, University of Aizu
-
Scalar Fused Mul=ply-‐Add (FMA)
Fused Mul=ply-‐Add (FMA): (3-‐read & 1-‐write) Floa=ng Point Unit
9/24/2014 5
FMA
00 =rRegister File
11 =r2r
3r
nr
• two (SP/DP) floaEng-‐point scalar operaEons (× and +) in a single cycle (2FLOPs/cycle);
• improved the accuracy due to only one final rounding;
• a few (3÷6) cycles latency; • standard “addiEon” and
“mulEplicaEon” by using hardwired constants 0.0 and 1.0
• allows an efficient so[ware implementaEon of division and square root;
• Use of the FMA operaEon results in pracEcal “disappearance” of four basic arithmeEc operaEons: add, subtract, mulEply, divide
c ç FMA(a,b,c); a × b , c = 0.0 : mulEplicaEon; c ç a × b + c : a(b) + c , b(a) = 1.0 : addiEon; a(b) , b(a) = 1.0, c = 0.0 : copy.
© S. Sedukhin, University of Aizu
-
9/24/2014 6
Processor Architecture Vendor Year # DP FMA/clock Clock Speed (GHz) Cores DP peak (GFLOPS) TDP (Watt) #Transistors (x Million) Fab. process (nm) Die Size (mm^2)
POWER1 RIOS-‐1 IBM 1990 1 0.041 1 0.082 4 6.9 1000
PA-‐8000 PA-‐8000 HP 1996 1 0.2 1 0.4 3.8 500 337.69
SuperH SH-‐4 SuperH Hitachi 1998 1 0.2 1 1.4 1.5 3.2 250 42.25
Pen=um III Xeon 550 P6 Intel 1999 2 0.55 1 2.2 39.5 9.5 250 128
Itanium Itanium (Merced) Intel 2001 2 0.8 1 3.2 130 25 180 544
Opteron 850 K8 (SledgeHammer) AMD 2004 2 2.4 1 9.6 89 105.9 130 193
Itanium 2 Itanium 2 (Madison) Intel 2005 2 1.67 1 6.67 130 221 130 544
Xeon 3.80E NetBurst Intel 2005 2 3.8 1 15.2 110 189 90 135
Opteron 880 K8 (Egypt) AMD 2005 4 2.4 2 19.2 95 233 90
Opteron 8220 SE K8 (Santa Rosa) AMD 2006 4 2.8 2 22.4 119.2 243 90
Itanium 2 9010 Itanium 2 9000 (Montecito) Intel 2006 4 1.6 2 12.8 104 1720 90 544
Xeon 7140M NetBurst Intel 2006 4 3.4 2 27.2 150 1328 65 435
SPARC64 VI SPARC64 VI Fujitsu 2007 4 2.4 2 19.2 120 543 90 421.25
Opteron 8360 SE K10 (Barcelona) AMD 2007 8 2.5 4 40 119 463 65
Blue Gene/P PowerPC 450 IBM 2007 8 0.85 4 13.6 36.7 208 90 121
Xeon X7350 Core Intel 2007 8 2.93 4 46.88 130 582 65 503
SPARC64 VII SPARC64 VII Fujitsu 2008 8 2.88 4 46.08 600 65 445
Xeon X7460 Penryn Intel 2008 12 2.66 6 63.84 130 1900 45 503
PowerXCell 8i Cell IBM 2008 16 3.2 8 109 92 250 65 212
Tesla C1060 GT200 NVIDIA 2008 30 1.3 240 78 187.8 1400 65 470
Opteron 8439 SE K10 (Istanbul) AMD 2009 12 2.8 6 67.2 137 904 45
Xeon X7560 Nehalem Intel 2009 16 2.266 8 72.512 130 2300 45 684
SPARC64 VII+ SPARC64 VII+ Fujitsu 2010 8 3 4 48 45
Itanium 9350 Itanium 9300 (Tukwila) Intel 2010 16 1.73 4 55.36 185 2046 65 698.75
Xeon E7-‐8870 Westmere (Nehalem-‐C) Intel 2010 20 2.4 10 96 130 2600 32
Opteron 6180 SE K10 (Magny-‐Cours) AMD 2010 24 2.5 12 120 140 904 45
POWER7 POWER7 IBM 2010 32 4.14 8 264.96 150 1200 45 567
Tesla C2070 Fermi NVIDIA 2010 224 1.15 448 515 247 3100 40 529
Radeon HD 5870 Evergreen (Cypress XT) AMD 2010 320 0.85 1600 544 188 2154 40 334
Opteron 6282 SE Bulldozer (Interlagos) AMD 2011 32 2.6 16 166.4 140 1200 32
HPC-‐ACE SPARC64 VIIIfx SPARC64 (Venus) Fujitsu 2011 32 2 8 128 58 760 45 513
Xeon E5-‐4650 Sandy Bridge Intel 2011 32 2.7 8 172.8 130 2270 32
Godson-‐3 L3B Loongson 3B ICT 2011 64 1.05 8 128 40 582.6 65 300
Tesla M2090 Fermi NVIDIA 2011 256 1.301 512 666 225 3000 40
Radeon HD 6970 Northern Islands (Cayman XT) AMD 2011 384 0.88 1536 676 250 2640 40 389
Opteron 6386 SE Piledriver (Abu Dhabi) AMD 2012 32 2.8 16 179.2 140 1308 32
Itanium 9560 Itanium 9500 (Poulson) Intel 2012 48 2.53 8 242.88 170 3100 32 544
HPC-‐ACE SPARC64 VIXfx SPARC64 Fujitsu 2012 64 1.848 16 236.5 110 45
Blue Gene/Q PowerPC A2 IBM 2012 64 1.6 18 204.8 55 1470 45 428
Xeon Phi 5110P MIC Intel 2012 480 1.053 60 1011 225 5000 22 350
Radeon HD 7970 Southern Islands (TahiE XT) AMD 2012 512 0.925 2048 947 230 4313 28 352
Radeon HD 7970 GHz Edi=on Southern Islands (TahiE XT2) AMD 2012 512 1.05 2048 1075 230 4313 28 352
Tesla K20X Kepler GK110 NVIDIA 2012 896 0.732 2688 1312 235 7080 28 561
Quadro K6000 GK110 NVIDIA 2013 960 0.901 2880 1732 225 7080 28 561
Core i7-‐4770K Haswell Intel 2013 32 3.8 4 243.2 84 1400 22 177
SPARC64 XIfx SPARC64 XIfx Fujitsu 2014 272 2.2 34 1196.8 160 3750 20 600
FirePro W9100 Hawaii XT AMD 2014 1408 0.93 2816 2618.9 275 6200 28 438
FirePro S9150 Hawaii XT GL AMD 2014 1408 0.9 2816 2534.4 235 6200 28 438
GeForce GTX Titan Black GK110-‐430 NVIDIA 2014 960 0.889 2880 1707 250 7080 28 561
POWER8 POWER8 IBM 2014 24 4.2 12 201.6 250 4200 22 675
50+ Single-‐chip μ-‐Processors (1990 – 2014)
© S. Sedukhin, University of Aizu
-
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
Num
ber o
f DP
FMA/
cloc
k (F
MA
units
)
1990 1995 2000 2005 2010 2015 2020
1248
163264
128256512
1024204840968192
1638432768
1990 1995 2000 2005 2010 2015 2020Num
ber o
f Tra
nsis
tors
(Mill
ions
)
9/24/2014 7 © S. Sedukhin, University of Aizu
-
0.031
0.063
0.125
0.25
0.5
1
2
4
8
Clo
ck
Sp
ee
d (
GH
z)
1990 1995 2000 2005 2010 2015 2020
1
2
4
8
16
32
64
128
256
512
1990 1995 2000 2005 2010 2015 2020
1
2
4
8
16
32
64
128
256
512
1024
Die
Size
(mm
^2)
1990 1995 2000 2005 2010 2015 2020
9/24/2014 8 © S. Sedukhin, University of Aizu
-
9/24/2014 9
0.01
0.1
1
10
100
1000
10000
DP
Pe
ak
Pe
rfo
rm
an
ce
(G
FLO
PS
)
1990 1995 2000 2005 2010 2015 2020
AMD W9100 (2619) AMD S9150 (2534) NVIDIA GTX 780 Ti (1707) Fujitsu SPARC64 XI (1196.8) Intel Haswell-EP 2686v3 (1008) IBM POWER8 (201.6)
NVIDIA GK110 (1732) NVIDIA K20X (1312)
AMD 5870 (544) IBM PowerXCell 8i (109)
AMD Opteron 880 (19.2) AMD Opteron 850 (9.6)
Intel Pentium III (2.2) Hitachi SuperH-4 (1.4)
HP PA-8000 (0.4)
IBM POWER1 (0.08)
Intel Itanium (3.2)
AMD Opteron 6386 (179.2) Intel Haswell (243.2)
Intel Sandy Bridge (172.8) Fujitsu SPARC64 VII+(48)
IBM PowerPC 450 (13.6)
Intel Itanium2 (6.67)
Intel Xeon 7140M (27.2)
DP Peak Performance (GFLOPS)
2014: Peak Performance (GFLOPS) 1. AMD W9100 2619 2. AMD S9150 2534 3. NVIDIA GTX 780 Ti 1707 4. NVIDIA GTX 980 1537 5. Fujitsu SPARC64 XIfx 1197 6. Intel Haswell-‐EP 1008 7. IBM POWER8 202
© S. Sedukhin, University of Aizu
-
0.01
0.10
1.00
10.00
100.00
0.1 1 10 100
Po
we
r/T
hro
ug
hp
ut (
W/G
FLO
PS
)
Area/Throughput (mm²/GFLOPS)
Merced
MadisonPentium III
MontecitoOpteron
SPARC64 VIXeon 7140
Xeon 3.80E
Itanium 9300Xeon 7350
PowerPC 460
Xeon 7650PenrynGT200
POWER8PowerXCell8i
POWER7 SPARC64 VIII
Itanium 9560
LoongsonPowerPC A2
Tesla C2070
Radeon 5870Radeon HD 6970
Radeon 7970MIC
SPARC64 XIfx
Tesla K20X
GTX 780GK110-410
W9100 S9150
Haswell
GTX 980
Haswell-EP
9/24/2014 10
1 W/mm2
0.1 W/mm2
© S. Sedukhin, University of Aizu
-
9/24/2014 11
10 20 50 100 200 500 1.000 2.000 3.000 4.000 10.000 GFLOPS
4
2
1
2014
Processor Vendor Clock Period (GHz)
Die Area (mm2)
TDP (Waks)
# DP FMA units
Peak Performance (GFLOPS)
GFLOPS/W Area/GFLOPS (mm2/GFLOPS)
POWER8 IBM 4.2 675 250 24 201.6 0.81 3.35
SPARC64 XIfx Fujitsu 2.2 600 160 272 1197 7.5 0.50
S9150 AMD 0.90 438 235 1408 2534 10.8 0.17
Performancepeak (GFLOPS) = #FMAunits × 2FLOPs × CLKperiod (GHz)
© S. Sedukhin, University of Aizu
-
0.01
0.1
1
10
100
1990 1995 2000 2005 2010 2015 2020
AMD S9150 (10.8) AMD W9100 (9.5) NVIDIA GTX 980 (9.3) Intel Haswell (8.4) Fujitsu SPARC 64 XI (7.5) NVIDIA GK110-430 (6.8) IBM POWER8 (0.8)
NVIDIA GK110 (7.7) NVIDIA K20X (5.6)
AMD 5870 (2.9)
PowerXCell8i (1.2)
IBM PowerPC 450 (0.37)
Intel Opteron 850 (0.1) Intel Pentium III (0.06)
Hitachi SuperH (0.93)
IBM Power1 (0.02) Intel Itanium (0.03)
Intel Opteron 880 (0.2)
Compu=ng Efficiency: Performance per Wak 50 GFLOPS/Wak for Exaflopic Super
@20MW
2014: Efficiency (GFLOPS/W) AMD S9150 10.8 AMD W9110 9.5 NVIDIA GTX 980 9.3 Intel Haswell-‐EP 8.4 Fujitsu SPARC64 XIfx 7.5 NVIDIA GTX 780 Ti 6.8 IBM POWER8 0.8
2014: Peak Performance (GFLOPS) 1. AMD W9100 2619 2. AMD S9150 2534 3. NVIDIA GTX 780 Ti 1707 4. NVIDIA GTX 980 1537 5. Fujitsu SPARC64 XIfx 1197 6. Intel Haswell-‐EP 1008 7. IBM POWER8 202
9/24/2014 12 © S. Sedukhin, University of Aizu
-
Even more FMA Units in Supercomputers ! IBM Roadrunner’s Peak Performance:
12,960 PowerXCell 8i × 8 SPEs × 2-‐way FMAs/clock = 207.360 FMAs/clock (@3.2 GHz) ≈ 1.3 PFLOPS ! K Supercomputer: 80,000 SPARC64 VIIIfx chips, each chip is 8-‐core CPU 128GFLOPS, i.e.
640,000 cores × 4-‐way FMAs/clock = 2.56M FMAs /clock (@2 GHz) = 10.24 PFLOPS ! IBM BlueGene/Q (Sequoia): 1.3M cores × 4-‐way FMAs/clock
= 5.2M FMAs/clock ≈ 10M FLOPs/clock (@2 GHz) ≈ 20 PFLOPS ! IBM Mira: 16-‐core Power A2 (@1.6GHz); 750K cores × 4-‐way = 3M FMAs/clock
≈ 6.14MFLOPs/clock ≈ 9.83 PFLOPS ! GPU-‐based Supers:
– Tianhe-‐1A: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 7,168 GPUs ≈ 1.8M FMAs/clock ≈ 5.9 PFLOPS – TSUBAME 2: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 4,224 GPUs ≈ 1.1M FMAs/clock ≈ 3.5 PFLOPS
! Tianhe-‐2 : 16,000 nodes × (2 Intel Xeon Ivy Bridge + 3 Intel Xeon Phi): – Phi: 480 × 3 = 1440 FMA/clock/node × 16,000 ≈ 23M FMAs/clock × 2FLOPs @1.053GHz ≈ 48.5PFLOPS – Ivy Bridge: 12 cores × 4 FMAs/cycle × 2 = 96 FMA/clock/node × 16,000 ≈ 1.54M FMAs/clock × 2FLOPs @2.2GHz ≈ 6.8PFLOPS – Total: 23M + 1.54M ≈ 25M FMAs/clock; 48.555PFLOPS + 6.855PFLOPS ≈ 55PFLOPS – Footprint: 720 m2 ≈ 27 m × 27 m
! Synchroniza=on of Processes in such mul=-‐node GALS-‐Supers is provided by distributed Global “Barrier Synchroniza=on” in hardware or/and in so[ware!
– “Barrier SynchronizaEon” on 32-‐processor Shared Memory SGI Origin 300 System: 232,000 cycles ≈ 22MFLOPs – Time of “Barrier SynchronizaEon” depends on the size of system – Scalability of MPI_Barrier in “Earth Simulator”: ~3 μsec – 333KHz while operaEonal frequency is 1GHz, i.e. difference is 3000
Emes…
9/24/2014 13 © S. Sedukhin, University of Aizu
-
Source: Matzke, D., "Will physical scalability sabotage performance gains?," Computer , vol.30, no.9, pp.37,39, Sep 1997
Source: Agarwal, V., Hrishikesh, V., M., Keckler, S., W., Burger, D. , “Clock rate versus IPC: the end of the road for convenEonal microarchitectures.” SIGARCH Comput. Archit. News 28, 2 (May 2000), 248-‐259
0.031
0.063
0.125
0.25
0.5
1
2
4
8
Clo
ck
Sp
ee
d (
GH
z)
1990 1995 2000 2005 2010 2015 2020
Trends for “Clock Region” or “Span of Control”
9/24/2014 14 © S. Sedukhin, University of Aizu
-
• Performance P (GFLOPS) = #FPUs × 2FLOPs × F (GHz),
where F is a clock cycle speed or an operaEon frequency. • The number of FPUs (#FPUs) defines the area (A) of a system. Clock period T = 1/F should be long enough to “cover” A in a single clock. • The maximum area, “reachable” in a single clock period would be π R2, for planar technology (area of a circle); A = (4/3) π R3, for 3D technology (volume of a ball/sphere); where a “reachability” radius R = S T = S / F and S is the speed of a clock-‐signal propagaEon in a media (wires, opEc, etc.): S = k Ċ, the speed-‐of-‐light Ċ and 0.5 < k < 1.
More Performance and Less Power Consump=on by Frequency Reduc=on
9/24/2014 15 © S. Sedukhin, University of Aizu
-
• Reducing a frequency F m-‐Emes, i.e. increasing the “reachability” radius R m-‐Emes, leads to increasing area A and, therefore, the number of FPUs,
m2 Emes, for planar technology; m3 Emes, for cubical technology. • Because each FPU becomes m-‐Emes “slower”, the performance P is increased m Emes, for planar technology; m2 Emes, for cubical technology. • The power consumpEon E of a processor is the funcEon of frequency F : E = C V2 F , where C is capacitance, V is voltage. Hence, reducEon of frequency m-‐Emes will proporEonally reduces a power. Reality: • VLSI technology uses not Euclidian, but Manhacan geometry (metric), i.e. area will be not
a circle/sphere, but a square/cube! • Historically, VLSI technology increases the chip area (die size) very slowly (only ×2, from
350 to 700 mm2 for the last 25 years), while decreases the feature size exponenEally ( ×50, from 1000 to 20 nm for the same period of Eme)!
More Performance and Less Power Consump=on by Frequency Reduc=on
9/24/2014 16 © S. Sedukhin, University of Aizu
-
The Light Speed Barrier limits the size of synchronous computer by the raEo D ≤ S / F , where D is the diameter of a system (mm) , S is the speed of clock-‐signal propagaEon (mm/s), S = kĊ, the speed-‐of-‐light Ċ ≈ 31011 mm/s, 0.5 < k < 1, and F is an operaEonal frequency in Hz (1/s).
Heat Barrie
r
Classical theory of compu=ng where the =me of data transferring to processor is neglected. It is theory of slow compu=ng as in the classical mechanics – the theory of slow mo=on.
Rela=vis=c theory of compu=ng pays most aken=on to transferring data to processor. This is theory of fast compu=ng as in the rela=vis=c, speed-‐of-‐light mo=on, mechanics.
D ≈ (D
ie Size)
½
Tianhe-‐2 Footprint 720 ≈ 27×27 (m2)
Chip Area 700 ≈ 27×27 (mm2)
9/24/2014 17 © S. Sedukhin, University of Aizu
-
9/24/2014 18
Typical DGEMM Performance on CPUs & GPUs
512 FMAs @0.925 GHz (947 GFLOPS) AMD “Tahi=” HD 7970 384 FMAs @0.880 GHz (676 GFLOPS) AMD “Cayman” HD 6970 512 FMAs @0.650 GHz (666 GFLOPS) NVIDIA “Fermi” Tesla M2090 96 FMAs @1.085 GHz (122 GFLOPS) NVIDIA “Kepler” GTX 670 OC 32 FMAs @2.7 GHz (173 GFLOPS) Intel “Sandy Bridge” Core i7 3960X 32 FMAs @2.6 GHz (166 GFLOPS) AMD “Bulldozer” FX-‐8150
Source: Matsumoto, K.; Nakasato, N.; Sedukhin, S.G., "Performance Tuning of Matrix MulEplicaEon in OpenCL on Different GPUs and CPUs, " High Performance CompuBng, Networking, Storage and Analysis (SCC), 2012 SC Companion: , pp.396-‐405, 10-‐16 Nov. 2012
For today supercomputers with more than 106 FMA units, Nmax ≈ 107 and N1/2 ≈ 106 Performance scalability is not acceptable for mobile/embedded applica=ons
© S. Sedukhin, University of Aizu
-
Tree-‐structured Flat Machines: Low Performance Scalability
• Today scienEfic-‐oriented computers are very inerBal: iniEal data are very far from FPUs
• Pipelined FPU w/few cycles latency requires a few concurrent threads (no fine-‐grained implementaEon)
• Storage hierarchy adds more overhead w/each addiEonal level of hierarchy (coarse-‐grained memory accesses)
• Data in storage hierarchy are copied at each level: no compuEng-‐in-‐place is possible
• Chips are power limited and most power is spent for (global) data moving and replicaBon
• processor/memory/interconnect are scaled differently (a progressive computer scaling is provided by drasEcally increasing the data or problem size, Nmax for FLOPSmax)
9/24/2014 19
Source: Bill Dally, Chief Scien;st & Sr. VP of Research, NVIDIA
© S. Sedukhin, University of Aizu
-
• Data reusing by local data movement between FPUs (not by using hierarchical data storage and global mulEple data replicaEon)
• Fine-‐grained data processing (compuEng and data access/exchange)
• CombinaEon w/parallel read-‐out sensors and stacked memory are possible (compuEng-‐in-‐place)
• Processor/Memory/Interconnect as a unified element of structural scalability which keeps a single image of the system like a biological cell.
• 2D/3D machines for compuEng 2D/3D tensor data
• More specialized, but more reacEve computer organizaEon
• Global synchronizaEon by locally coordinated (asynchronous) massively data circulaEon
9/24/2014 20
Target: Mesh/Torus-‐structured Machines
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
P LM
N
k
ji
© S. Sedukhin, University of Aizu
-
ARITHMETIC & ALGEBRA RelaEons between
9/24/2014 21 © S. Sedukhin, University of Aizu
-
FMA and Algebraic Semiring
FMA: { R, (× +), 0.0, 1.0 } • Set of real numbers
– R • Fused arithmeEc
mulBply-‐and-‐add: – (× +)
• Two constants from R:
– 0.0 – 1.0
Semiring: { S, ⊗, ⊕, 0, 1 } • Set of numbers
– S • Two algebraic operaEons
– mulBply ⊗ – add ⊕
• Two constants from S: – 0 – 1
9/24/2014 22 © S. Sedukhin, University of Aizu
-
Algebraic Path Problem
l Problems from different disciplines represented as a single algorithmic scheme (rich in FMA operaEons) - Linear Algebra
- compuEng the inverse of a matrix - Graph and Network Problems
- transiEve&reflexive closure and transiEve reducEon - shortest distance problems (distance funcEons) - capacity problems (max flow, network capacity, tunnel-‐problem) - connecEvity measures for reliability networks - stochasEc communicaEon network problems
- Regular Language Problems - correspondence: regular expressions and finite state automata
l UnificaEon based on the theory of algebraic semirings – Different semirings for different applicaEons
l SoluEon as a case of a matrix closure problem
9/24/2014 23 © S. Sedukhin, University of Aizu
-
24/24
The Algebraic Path Problem Defini=on
! Let G = (V, E, w) be a weighted graph, V = {0,1,…,n-‐1} is a set of n verEces; E = V × V is a set of edges; w: E → S is an edge weighEng funcEon whose values taken from the set S
! The weighEng funcEon belongs to the so called path algebra or algebraic semiring (S, ⊕, ⊗, ∗, 0, 1)
9/24/2014 24 © S. Sedukhin, University of Aizu
-
l A closed semiring (S, ⊕, ⊗, ∗, 0, 1) is an algebraic structure defined by: - Set of scalar elements S - Two binary operaEons:
l AddiBon ⊕ : S × S → S
l MulBplicaBon ⊗ : S × S → S
- Unary operaEon called closure ∗ : S → S - Two constants 0 and 1 in S , 0 and 1 are the neutral elements for ⊕ and ⊗ , respecEvely.
Scalar Semiring
9/24/2014 25 © S. Sedukhin, University of Aizu
-
Different Semirings and Associa=ve Problems
9/24/2014 26
S ⊕ ⊗ (*) 0 1 Basic&Problem& Core&Algorithm&
0,1 ∨ ∧ 1* =a 0 1 Transitive*Closure* Warshall*ℜ + × 1)1(* −−= aa 0 1 Matrix*Inversion* Gauss5Jordan*
}{+∞∪ℜ+ min + 0* =a ∞+ 0 All*Pairs*Shortest*Paths* Floyd*}{−∞∪ℜ+ max + 0* =a ∞− 0 Maximum*Cost*(Critical*Path)* ?*
}{+∞∪ℜ+ max min ∞=*a 0 ∞+ Maximum*Capacity*Paths* ?*
]1,0[ℜ max × 1* =a 0 1 Maximum*Reliability*Paths* ?*
]1,0[ℜ min × 1* =a 0 1 Minimum*Reliability*Paths* New*
}{+∞∪ℜ+ min max 0* =a ∞+ 0 Minimum*Spanning*Tree* Maggs5Plotkin*
Given an n×n matrix A: distance/cost matrix of a weighted n-‐vertex graph. The APP is to find the closure A* of a matrix A in different algebraic semirings:
A* = Īn ⊕ A ⊕ A2 ⊕ A3 ⊕ … = Ī ⊕ A ⊗ A* SoluEon by unified Gauss-‐Jordan/Warshall/Floyd (GJWF) algorithm
© S. Sedukhin, University of Aizu
-
l ( Sbxb, ⊕, ⊗, ∗, 0, I ) is an algebraic structure defined by: - Set of b×b matrices Sbxb over a closed scalar semiring (S, ⊕, ⊗, ∗, 0, 1)
- Two binary operaEons: l Matrix AddiBon ⊕ : Sbxb × Sbxb → Sbxb l Matrix MulBplicaBon ⊗ : Sbxb × Sbxb → Sbxb
- A unary operaEon: closure ∗ of a matrix: Sbxb → Sbxb - Two n×n matrices of constants in Sbxb - 0 where all elements are equal to 0 (zero matrix) - I where all diagonal elements are equal to 1 (idenEty matrix)
Matrix Semiring
9/24/2014 27 © S. Sedukhin, University of Aizu
-
Scalar FMA Opera=on ⇔ Matrix FMA Opera=on Arithme=c: Scalar Fused Mul=ply-‐Add
• (×, +)-‐algebra: MMA
• (+, min)-‐algebra: SPP
• (+, max)-‐algebra: CRP
• (min, max)-‐algebra: MCP
• (×, max)-‐algebra: MRP
• (max, min)-‐algebra: MST
Matrix Algebra: Matrix Fused Mul=ply-‐Add
9/24/2014 28
bacc ⋅+←
),min( bacc +←
),max( bacc +←
)},max(,min{ bacc←
),max( bacc ×←
)},max(,min{ bacc←
• (×, +)-‐algebra:
• (+, min)-‐algebra:
• (+,max)-‐algebra:
• (min, max)-‐algebra:
• (×, max)-‐algebra:
• (max, min)-‐algebra:
∑−
=⋅+←
1
0
n
k kjikijijcccc
)}(min,min{1
0 kjik
n
kijijcccc +←
−
=
)}(max,max{1
0 kjik
n
kijijcccc +←
−
=
)}],{min(max,max[1
0 kjik
n
kijijcccc
−
=←
)}(max,max{1
0 kjik
n
kijijcccc ×←
−
=
)}],{max(min,min[1
0 kjik
n
kijijcccc
−
=←
MMA– Matrix MulEply-‐Add SPP– Shortest Paths Problem
CRP– CriEcal Path Problem MCP– Max. Capacity Paths
MRP– Max. Reliable Paths MST– Min. Spanning Tree
© S. Sedukhin, University of Aizu
-
Accelerators: Scalar FMA Unit ⇔ Matrix FMA Unit
Arithme=c: Scalar Fused Mul=ply-‐Update Unit
Matrix Algebra: Matrix FMA Array Processor
9/24/2014 29
No need to understand how FMA unit is internally constructed! No need to understand how “Big Mul=plier” is internally constructed!
© S. Sedukhin, University of Aizu
-
XGEMM-‐based Algorithms/Architecture
• GEMM in Different Algebras (XGEMM)
• Chaining Matrix Products
• Focal-‐Plane I/O for Streaming Data • CompuEng-‐near-‐Data
CBACCBACCBAC
T
T
⊕⊗←
⊕⊗←
⊕⊗←
!
!
!
!!!!
!T
T
CBADCBAD
⊗⊗←
⊗⊗←
30 9/24/2014
Streaming 2D Data from Sensor Array or Stacked Memory
© S. Sedukhin, University of Aizu
-
EXTREMELY-‐SCALABLE GENERAL MATRIX MULTIPLY-‐ADD ALGORITHM
Algorithmically and Technologically
9/24/2014 31 © S. Sedukhin, University of Aizu
-
Matrix-‐by-‐Matrix Mul=ply-‐Add (MMA)
C← A×B+C, where A, B, and C are dense (n×n) matrices:
c(k ) (i, j)← a(i, k) ⋅b(k, j)k=0
n−1
∑ + c(k−1)(i, j), 0 ≤ i, j < n;
• The computational index space is a bounded grid of n×n×n index points: ℑ = {(i, j, k)T : 0 ≤ i, j, k < n}⊆ Z3, where in each index point p = (i, j, k)T ∈ℑ we have to update c(i, j) by implementing scalar multiply-add operation: c(i, j)← a(i, k) ⋅b(k, j)+ c(i, j), i.e. all three scalars a(i, k), b(k, j), and c(i, j) should be available in this index point before computing is started.• Because there are only 3n2 input matrix data, but n3 index points, no more than n2 data independent index points can be activated (i.e., no more than n2 multiply-add operations per time-step) if no data replication is considered. • Hence, n is the minimal number of "read - compute - write" steps to implement a matrix-by-matrix multiply-add.
Example:
32 9/24/2014
p = (2,1, 2)T ∈ℑ
c(2)(2,1)← a(2, 2) ⋅b(2,1)+ c(1)(2,1)C (2)[2,1]← A[2, 2]×B[2,1]+C (1)[2,1]
© S. Sedukhin, University of Aizu
-
! A Bme-‐scheduling funcEon: step(p): Z3 → Z , for all p=(i,j,k)T from I – linear or modular form: step(p) = (αTp) or step(p) = (αTp) mod_n where α = (αi, αj, αk)T is a Eme-‐scheduling vector
! A space-‐scheduling funcEon: alloca=on(p): Z3 → Z2 , for all p=(i,j,k)T from I – linear projecEon method: alloca=on(p): S × p, where S is a 2×3 space transformaEon matrix corresponding to the projecEon vector η, such that S×η = Ō and (αTη) ≠ 0
9/24/2014 33
Time-‐Space Scheduling of MMA
© S. Sedukhin, University of Aizu
-
Grid-‐based Scheduling The time scheduling function: step(p) =αT ⋅ p, where αT = (αi
B,α jA,αk
C )T = (0, 0,1)T , i.e. step(p) = k,
p = (i, j,k)T ∈ℑ = (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Z3;On each step(p) = s ∈ [0,n), broadcasting of the n-elementcolumn as ∈ A and n-element row bs
T ∈ B to update an n2 -element matrix C (s) are required. Hence, BCS-scheduling corresponds an implementation on each time-step s = 0,1,...,n−1 the rank-1 update: C (s+1) ←C (s) + as ⊗ bs
T .All initial/intermediate/final data are resided inside the index space ℑ.
The total number of “broadcast-‐compute-‐shi” steps is n.
34 9/24/2014
Broadcast-‐Compute-‐Shi[ (BCS) Scheduling
© S. Sedukhin, University of Aizu
-
Systolic or Mesh Scheduling The time scheduling function: step(p) =αT ⋅ p, where α = (αi
B,α jA,αk
C )T = (1,1,1)T , i.e. step(p) = i+ j + k,
p = (i, j,k)T ∈ℑ= (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Z3;It is a "broadcast-to-pipeline" or "time-multiplexing" version of a BCS scheduling.All initial data is aligned and located on the hyper-plane outside of the index space ℑ.The same will be for the final data.
The total number of “all-‐shi-‐compute” steps is 3n-‐2.
35 9/24/2014
All-‐Shi[-‐Compute (ASC) Scheduling
© S. Sedukhin, University of Aizu
-
Cylindrical Scheduling The time scheduling function: step(p) = (αT ⋅ p)modn, where α = (αi
B,α jA,αk
C )T = (−1, 0,1)T , i.e. step(p) = (k − i)modn,p = (i, j,k)T ∈ℑ⊆ Z3;This scheduling requires on each step(p) = s ∈ [0,n) :s-th dioganal n-vector as ∈ A be broadcast(α j
A = 0) along !j -axis to compute (update) a matrix C
and, then, matrices B and C are rolled opposite!i -axis or orbit (αi
B = -1) and along !k -orbit (αk
C =1).All initial/intermediate/final data are resided inside the index space ℑ.
The total number of “broadcast-‐compute-‐roll” steps is n.
36 9/24/2014
Broadcast-‐Compute-‐Roll (BCR) Scheduling
© S. Sedukhin, University of Aizu
-
Orbital or Toroidal Scheduling The modular time scheduling function: step(p) = (αT ⋅ p)modn, where α = (αi
B,α jA,αk
C )T = (±1,±1,±1)T , i.e. step(p) = (±i± j ± k)modn,
p = (i, j,k)T ∈ℑ = (i, j,k)T ;0 ≤ i, j,k < n{ }⊆ Zn3;The computational index space ℑ is a 3D torus.At each step(p) = s ∈ [0,n) :computing: c(i, j)← c(i, j)+ a(i,k) ⋅b(k, j);roll - all: a(i,k),b(k, j), and c(i, j) are rolledalong ± j-,±i-, and ± k-orbit, respectevely.All initial/intermediate/final data are resided inside the index space ℑ.
The total number of “compute-‐roll-‐all” steps is n.
37 9/24/2014
Compute-‐Roll-‐All (CRA) Scheduling
© S. Sedukhin, University of Aizu
-
38
step(p) = [-i - j + k] mod n
000
011
022
033
101
112
123
130
202
213
220
231
303
310
321
332
i
j
ijk
310
321
332
303 213
202
231
220
123
112
101
130
033
022
011
000
A-‐staBonary and in a canonical layout
c
c
c
c
b b
b
b
303 310 321 332 220 231
112 123 130 000 011 022 033
C-‐staBonary and in a canonical layout
b b
b b
a a a a
213 101
202
(-‐1,0,0)T (0,-‐1,0)T
(0,0,-‐1) T
k
B-‐staBonary and in a canonical layout
303 213 123 033
202 112 022 332
101 011 321 231
000 310 220 130 c
c
c
c
a a a a
Cannon’s Algorithm [1969]
9/24/2014
3D Index-‐Space à 2D Processor-‐Space Alloca=on
step(p) = 0, n = 4
© S. Sedukhin, University of Aizu
-
Comparison of Different MMA Implementa=ons
BCS ASC BCR CRA # =me-‐steps n 3n-‐2 n n
Index Space 3D Grid 3D Mesh 3D Cylinder 3D Torus
I/O Data Loca=on Inside Outside Inside Inside
Data reusing/step 2n 1…3n(n+1)/2…1 n+n2 2n2
Data Movement (Global/Local)
Bcast/Shi All Shi Bcast/Roll All Roll
Compu=ng-‐in-‐Place Possible Impossible Possible Possible
3D à 2D Projec=on 2D Grid Agarwal, at al.(1994) van de Geijn, (1995)
2D Mesh Kung, Leiserson, (1979)
2D Torus Fox, Oco, Hey, (1987)
2D Torus Cannon (1969)
Scalability Bad Good Bad Good
Final Selec=on No No No Yes
39 9/24/2014 © S. Sedukhin, University of Aizu
-
Three Forms of MMA
• Any Eme-‐step scheduling funcEon defines only matrix data distribuEon among the acEve index points and does not specify how compuEng is performed actually.
• By using the same Eme scheduling funcEon we can implement either
∑
∑
∑
−
=
−
=
−
=
⋅+←⇔+←
⋅+←⇔+←
⋅+←⇔+←
1
0
1
0
1
0
);,(),(),(),( )(
);,(),(),(),( )(
;),(),(),(),( )(
n
j
T
n
i
T
n
k
jkbjickiakiaACBAj
jickiajkbjkbBCABi
jkbkiajicjicCABCk
!
!
!
40 9/24/2014
AccumulaEon along: NN-‐, TN-‐, and NT-‐forms ¾ of GEMM
© S. Sedukhin, University of Aizu
-
Chaining of Matrix Products
• Because in the Orbital Scheduling the resulEng data is properly resided inside the index space, by using the three forms of accumulaEon (k), (i), and (j) we can efficiently implement MMAs chaining.
• Two MMAs Chaining: (!j,!k ) :E← (CBT + A)D+E;
(!i ,!k ) :E← D(ATC +B)+E;
(!k,!j ) :E← (AB+C)DT +E;
(!i ,!j ) :E← DT (ATC +B)+E;
(!j,!i ) :E← (CBT + A)T D+E;
(!k,!i ) :E← DT (AB+C)+E.
41 9/24/2014
• 2n Eme-‐steps are needed to complete this chaining
© S. Sedukhin, University of Aizu
-
GEMM-‐BASED ORBITAL ALGORITHMS
9/24/2014 42 © S. Sedukhin, University of Aizu
-
Systolic → Orbital Rescheduling
u Linear Algebra § Matrix-‐by-‐matrix mulEplicaEon, § SoluEon linear systems (matrix
inversion), § LU, QR, SVD decomposiEons, § ...
u Digital Signal Processing § FIR, IIR, ID/2D convoluEon, § 2D DFT, DCT, DHT, ... § Dynamic scene analysis, § Image resampling, § InterpolaEon, § 1D/2D median filtering, § Geometric warping, § …
u Non-‐numerical applica=ons § Data structures (stack/queue,
searching, priority queue, sorEng), § Graph algorithms (transiEve closure,
minimum spanning tree, connected components,...)
§ Language recogniEon (string matching, regular expression)
§ Dynamic programming, § RelaEonal database operaEons, § Encoders (polynomial division) § ...
9/24/2014 43 © S. Sedukhin, University of Aizu
-
9/24/2014 44 © S. Sedukhin, University of Aizu
-
2-‐Dimensional Separable Transforms
• Let X=[x(i,k)] be n×n signal or image data. A 2D forward transform of X is defined as
• A 2D inverse transform: • The transform coefficient matrix C
can be – symmetric (C = CT) and unitary
(C-1 = C*T) as for DFT or DHT – unitary and real as in DCT – only ±1 and be symmetric and
orthogonal as in DWHT
9/24/2014 45
!!x(u,v) = c(i,u) x(i,k) ⋅c(k,v)k=0
n−1
∑i=0
n−1
∑
in a matrix form:!!X =CT × X ×C
TCXCX ××= !!
• Chaining of Matrix Products
• Totally 2n time-steps are needed to implement any 2D transform
.)(:),(
;)(:),(;)(:),(;)(:),(
;)(:),(
;)(:),(
ECABDEik
EDACBEijEBCADEjiEDCABEjk
EBCADEki
EDACBEkj
T
TT
TT
T
T
T
++←
++←
++←
++←
++←
++←
!!
!!
!!
!!
"!
!!
© S. Sedukhin, University of Aizu
-
9/24/2014 46 © S. Sedukhin, University of Aizu
-
9/24/2014 47 © S. Sedukhin, University of Aizu
-
9/24/2014 48 © S. Sedukhin, University of Aizu
-
9/24/2014 49 9/24/2014 49 © S. Sedukhin, University of Aizu
-
Data Manipula=on by Matrix-‐Matrix Mul=ply-‐Add
! A generic form of MMA operaEon: D ← MMA[⊗,⊕](A,B,C) : D ← A ⊗ B ⊕ C
! Row/column interchange: D(n,n) ← MMA[×,+](P(n,n),A(n,n),zero(n,n)) where P(n,n) is a (i,j)-‐permutaEon matrix
! Rows/columns RotaEon ! Scalar Data ReplicaEon (Broadcast)
9/24/2014 50 © S. Sedukhin, University of Aizu
-
Global Reduc=on and Broadcast ! Generic form:
B(n,n) ← ones(n,n) ⊗ A(n,n) ⊗ ones(n,n) ! For example, summaEon and broadcast:
q C(n,n) ← MMA[×,+](ones(n,n),A(n,n),zero(n,n)) q D(n,n) ← MMA[×,+](C(n,n),ones(n,n),zero(n,n))
! maximum and broadcast: v C(n,n) ← MMA[×,max](ones(n,n),A(n,n),-‐inf(n,n)) v C(n,n) ← MMA[×,max](C(n,n),ones(n,n),-‐inf(n,n))
9/24/2014 51 © S. Sedukhin, University of Aizu
-
CONCLUSION IN ONE SLIDE
9/24/2014 52 © S. Sedukhin, University of Aizu
-
9/24/2014
Matrix Array Processor Accelerated Functions Target Applications
• 2D/3D Toroidal Array Processor for Computing-near-Data
– Basic Operation: Matrix multiply-add – 3-D toroidally connected Banks of Memory with attached scalar FMA units – Using Slower and Simpler Cores – Multidimensional I/O
• Keeping Integrity of Data • Highly scalable computing/ memory/interconnect fabric
Computing
• Matrix Mathematics – GEMM (BLAS Level 3): C=C+A×B; C=C+AT×B; C=C+A×BT – Linear Algebra
• Graph (Path) Algorithms GEMM in Different Algebras – Transitive Closure – All Pairs Shortest/Longest Paths – Critical Path – Maximum Capacity Path – Most Reliable Path – Minimum Spanning Tree
• Multidimensional Linear Transforms – DFT 2D/3D Fwd/Inv Separable Transforms – DCT Y=(CT×(X×C))×C – DHT X=(C×(Y×CT))×CT – DWH – DST
Data Manipulation • Rotation • Permutation • Transposition • Copy • Replication • Reduction • Broadcast
• Medical Imaging / Visualization • Radar Systems • Sonar Systems • Defense and Security IT • Surveillance • Wireless Communications • Network Processing • Voice and Pattern Recognition • Computational Chemistry • Climate Modeling • Data Mining and Analysis • Game Physics / Physics Simulation • Life Sciences & Biotechnology
– Computational chemistry and biology – In silico drug discovery – Gene sequencing – Pharmacogenomics – Protein folding – Molecular dynamics – Personalized medicine – Genomics, Proteomics, Metabolomics – Simulation of biological systems
• Geophysical Science – Seismic data processing – Petroleum Reservoir Modeling
• Financial Analysis and Modeling • …
k
ji
53 9/24/2014 © S. Sedukhin, University of Aizu
-
9/24/2014
The Human Brain • Number of neurons 1011 • Number of synapses (adult) 1014 (2,000-‐5,000 per neuron) • Power consumpEon (adult): 20-‐40 Wacs (0.5-‐4 nW/neuron) • Maximum firing frequency of neuron: 250-‐2,000 Hz (0.5-‐4 ms intervals) • Signal propagaEon speed inside axon 90 m/s sheathed,
-
THANK YOU !
9/24/2014 55 © S. Sedukhin, University of Aizu