hybrid algorithms features and support - icl · magma uses a hybridization methodology where...
TRANSCRIPT
MAGMA uses a hybridization methodology where algorithms of interest are split into tasks of varying granularity and their execution scheduled over the available hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the GPU.
PERFORMANCE & ENERGY EFFICIENCY
INDUSTRY COLLABORATION
MAGMA LU factorization in double precision arithmetic
K40
P100
CPUMKL
Gflops/WattMATRIX SIZE
Gflop
/s
NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz
K40 NVIDIA Pascal GPU56 MP x 64 @ 1.19 GHz
P100 NVIDIA Volta GPU80 MP x 64 @ 1.37 GHz
V100Intel Xeon E5-2650 v3 (Haswell)2 x 10 cores @ 2.30 GHz
CPU
FEATURES AND SUPPORT
Linear system solvers
Eigenvalue problem solvers
Auxiliary BLAS
Batched LA
Sparse LA
CPU/GPU Interface
Multiple precision support
Non-GPU-resident factorizations
Multicore and multi-GPU support
MAGMA Analytics/DNN
LAPACK testing
Linux
Windows
Mac OS
HYBRID ALGORITHMS2.3 FOR CUDA
MAGMA MIC 1.4 FOR Intel Xeon PhiclMAGMA 1.4 FOR OpenCL
V100
MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra libraries for heterogeneous architectures. MAGMA is designed and implemented by the team that developed LAPACK and ScaLAPACK, incorporating the latest developments in hybrid synchronization- and communication-avoiding algorithms, as well as dynamic runtime systems. Interfaces for the current LAPACK and BLAS standards are supported to allow computational scientists to seamlessly port any linear algebra reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs to deliver the fastest possible time to
accurate solution within given energy constraints. FIND OUT MORE AT http://icl.utk.edu/magma
NEW
IN COLLABORATION WITH WITH SUPPORT FROM
SPONSORED BYU.S. Departmentof Defense National Science Foundation
The objective of the Innovative Computing Laboratory’s IPCC is the development and optimization of numerical
linear algebra libraries and technologies for applications, while tackling current challenges in
heterogeneous Intel® Xeon Phi™ coprocessor-based High Performance Computing.
Long-term collaboration and support on the development of clMAGMA,
the OpenCL™ port of MAGMA.
Long-term collaboration and support on the development of MAGMA.
Intel ParallelComputing Center
0
1000
2000
3000
4000
5000
6000
2K 4K 6K 8K 10K 12K 14K 16K 18K 20K 22K 24K 26K 28K 30K 32K 34K 36K
1414
2121
4.44.42.62.6
Sparse matrix - vector product (SpMV)in double precision arithmetic
from The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/
2.3 DRIVER ROUTINES
ABBREVIATIONSGESPD/HPDTR D&C B&I It MP
GeneralSymmetric/Hermitian Positive DefiniteTriangularDivide & Conquer Bisection & Inverse IterationMixed-precision Iterative Refinement
LIN
EAR
EQU
ATIO
NS
LEASTSQUARES
STAN
DAR
DEV
P
STAND.SVP
GEN
ERAL
IZED
EVP
GE
SPD/HPD
GE
GE
SY/HE
GE
SPD/HPD
Solve using LUSolve using MPSolve using CholeskySolve using MPSolve LLS using QRSolve using MPCompute e-values,optionally e-vectorsComputes all e-values,optionally e-vectorsRange ( D&C )Range ( B&I It. )Range ( MRRR )Compute SVD,optionally s-vectorsCompute all e-values,optionally e-vectorsRange ( D&C )Range ( B&I It.)Range ( MRRR )
{sdcz}gesv{zc,ds}gesv{sdcz}posv{zc,ds}posv{sdcz}gels{zc,ds}geqrsv{sdcz}geev
{sd}syevd{cz}heevd{cz}heevdx{cz}heevx{cz}heevr{sdcz}gesvd{sdcz}gesdd{sd}sygvd{cz}hegvd{cz}hegvdx{cz}hegvx{cz}hegvr
✓ ✓ ✓ ✓ ✓ ✓✓✓✓✓✓✓✓✓
✓✓✓✓✓✓ ✓✓✓✓✓
NAMING CONVENTIONmagma_{routine name}[_gpu]
MATRIX OPERATION ROUTINEINTERFACES
CPU GPU
SPARSE
BATCHED
PERFORMANCE
ROUTINES
PRECONDITIONERS
KERNELS
DATA FORMATS
BiCG, BiCGSTAB, Block-Asynchronous Jacobi, CG, CGS, GMRES, IDR, Iterative refinement, LOBPCG, LSQR, QMR, TFQMR
ILU / IC, Jacobi, ParILU, ParILUT, Block Jacobi, ISAI
SpMV, SpMM
CSR, ELL, SELL-P, CSR5, HYB
NVIDIA Pascal P100 GPU56 MP x 64 @ 1.19 GHz
TEST MATRICES0
10
20
30
40
50
60
70
80
90
GFLO
P/S
• Deep learning• Structural mechanics• Astrophysics
• Sparse direct solvers• High-order FEM
simulations
BATCHED FACTORIZATION OF A SET OFSMALL MATRICES IN PARALLEL
LU, QR, and CholeskySolvers and matrix inversionAll BLAS 3 (fixed + variable)SYMV, GEMV (fixed + variable)
✓✓✓✓
ROUTINES
DEVICES
APPLICATIONS / LIBRARIES
AlgorithmicVariantsKernel DesignInlining &
Code GenerationAutotuning
CPUs GPUs CoprocessorsKNC/KNL
MAGMA BatchedFramework & AbstractionsPa
nel P
1 i
Pane
l P1 i
Pane
l P1 i
Numerous applications requirefactorization of many small matrices
in double precision arithmetic on 1 million matricesPERFORMANCE OF BATCHED LU
MATRIX SIZE
CSR (cuSPARSE 8.0)HYB (cuSPARSE 8.0)SELL-P (MAGMA 2.3)CSR5 (MAGMA 2.3)
U.S. Departmentof Defense National Science Foundation
0
100
200
300
400
500
600
700
1 16 20 24 288 124 32
P100
IN COLLABORATION WITH WITH SUPPORT FROM
SPONSORED BY
GFLO
P/S V100