mitch horton, stanimire tomov, jack dongarra innovative...
TRANSCRIPT
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
MAGMA QR, 1 GPU, All Available Cores
Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory
University of Tennessee
20 July, 2011
Outline
1) Motivation2) Algorithm Description3) Algorithm Tuning4) Algorithm Optimization5) Results6) What's Next7) Power Efficiency
Motivation
Moore’s Law
The number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years. This trend has continued for more than half a century and is expected to continue until 2015 or 2020 or later.
Wikipedia Kepler
6,000,000,0002011
Motivation
May’s Law
Software efficiency halves every 18 months, compensating Moore's Law.
Wikipedia
Motivation
Nothing you can't spell will ever work.
Will Rogers
Sourcebook of Parallel Computing
Dongarra, Foster, Fox
QUARK LAPACK Factoriza:on (6 cores, Q)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
GPU Update
Sequen:al LAPACK Update (6 cores, P)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
QUARK LAPACK Factoriza:on (6 cores, Q)
GPU Update
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
Sequen:al LAPACK Update (6 cores, P)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
Sequen:al LAPACK Update (6 cores, P)
GPU Update
QUARK LAPACK Factoriza:on (6 cores, Q)
GPU Update
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
Sequen:al LAPACK Update (6 cores, P)
QUARK LAPACK Factoriza:on (6 cores, Q)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
Sequen:al LAPACK Update (6 cores, P)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
Sequen:al LAPACK Update (6 cores, P)
MAGMA QR1 GPUAll Available Cores
2 x 6 cores
AlgorithmDescrip/on
3360 x 3360
NB = 128IB = 12OB = 128
MAGMA 1.0+
Op:mized Panel
Factoriza:on6 cores
Algorithm Tuning Nightmare 8 x 6 cores, double precision
0
125
250
375
500
800 20803360
46405920
72008480
976011040
1232013600
1522016800
18080
Q P NB OB IB
Matrix Size
Algorithm Tuning Nightmare 2 x 6 cores, double precision
0
125
250
375
500
8002080 3360
46405920
72008480
976011040
1232013600
1522016800
18080
Q P NB OB IB
Matrix Size
5920 x 128, 4 cores, IB=8
0
5
10
15
20
0 5000 10000 15000 20000
Gflo
p/s
Matrix Size
Performance of multicore QR factorization on Tall Skinny MatricesComparing Different Algorithms
12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak
Single Node, Single GPU
RecursiveLeft Looking Execution
Left Looking InsertionParallel MKL
Algorithm Optimization
800 x 64, 6 cores, IB=12
23840 x 192, 2 cores, IB=16
Single Precision
LeV Looking Inser:on LeV Looking Execu:on Recursive
Results
0
100
200
300
400
500
600
700
800
900
0 5000 10000 15000 20000 25000
Gflo
p/s
Matrix Size
Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions
48 Cores (8 x 6-cores), 2.8 GHz-AMD Opteron 8439 SE, 129 GB, Peak 1080 Gflop/s [ig] 1 GeForce GTX 480 - 1.041 GHz Clock - Theoretical Peak: 1.401 * 2 * 480 = 1.34496 Tflop/s
numactl --interleave=all
SingleDouble
Complex SingleComplex Double
Results
0
100
200
300
400
500
600
700
800
0 5000 10000 15000 20000
Gflo
p/s
Matrix Size
Performance of MAGMA QR with 1 GPU and all Available CoresComparing Precisions
12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak
Single Node, Single GPU
DoubleSingle
Complex SingleComplex Double
Results
0
50
100
150
200
250
300
0 5000 10000 15000 20000
Gflo
p/s
Matrix Size
Performance of MAGMA QR with 1 GPU and all Available Cores, Double PrecisionComparing Against MAGMA 1.0 and MKL
12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak
Single Node, Single GPU
1 GPU, 2 sockets, New Approach1 GPU, 1 Socket, MAGMA 1.0
1 GPU, 1 Core, MAGMA 1.00 GPUs, 1 Socket, MKL
24
0
100
200
300
400
500
600
0 2000 4000 6000 8000 10000
Gflo
p/s
Matrix Size
Performance of MAGMA LU with 1 GPU and all Available Cores, Single PrecisionComparing Against MAGMA 1.0 and MKL
12 Cores (2 x 6-cores) 2.8 GHz X5660, 23 GB, 270 Gflop/s Peak [keeneland] Tesla M2070, 1.1 GHz, 5.4 GB, 1.03 Tflop/s Peak
Single Node, Single GPU
1 GPU, All Cores1 GPU, 1 Socket, MAGMA 1.0
0 GPUs, 2 Sockets, MKL
Results
Power Efficiency
Math is free. Transistors are free. Power is expensive. Performance Per Watt = Performance.
Jen Hsun (gen shyuhn) [jensen] Huang, CEO of Nvidia
Power Efficiency
Peak performance of any system is essen:ally limited by the amount of power it can draw and the amount of heat it can dissipate. Consequently, performance per wa] of a GPU design translates directly into peak performance of a system that uses that design.
While performance per wa] is useful, absolute power requirements are also important. Claims of improved performance per wa] may be used to mask increasing power demands. For instance, though newer genera:on GPU architectures may provide be]er performance per wa], con:nued performance increases can negate the gains in efficiency, and the GPUs con:nue to consume large amounts of power.
Wikipedia
28
Power Efficiency
A Google engineer has warned that if the performance per watt of today's computers doesn't improve, the electrical costs of running them could end up far greater than the initial hardware price tag.
"If performance per watt is to remain constant over the next few years, power costs could easily overtake hardware costs, possibly by a large margin," Luiz Andre Barroso, who previously designed processors for Digital Equipment Corp., said in a September paper.
Google recently unveiled a major new datacenter site in a remote part of Oregon, where power costs are a frac/on of those at Google's home base in Silicon Valley.
29
Power Efficiency
This has nothing to do with being "green." Every system and subsystem has to fit within some power budget.
Brough Turner 2007
30
Power Efficiency
"You want a battery on this device and that device that lasts three or four days? I do, too. Well, if we have much more high-performing systems at much lower watts that will trickle down into your cell phone and this recorder and everything else. So, if I can get a processor that runs on 5 watts, as opposed to 100 watts or whatever . . . Voila, batteries are going to last a lot longer ... “
ORNL scientific computing chief Jeff Nichols