hybrid algorithms features and support - icl · magma uses a hybridization methodology where...

MAGMA uses a hybridization methodology where algorithms of interest are split into tasks of varying granularity and their execution scheduled over the available hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the GPU.

PERFORMANCE & ENERGY EFFICIENCY

INDUSTRY COLLABORATION

MAGMA LU factorization in double precision arithmetic

K40

P100

CPUMKL

Gflops/WattMATRIX SIZE

Gflop

/s

NVIDIA K40 GPU 15 MP x 192 @ 0.88 GHz

K40 NVIDIA Pascal GPU56 MP x 64 @ 1.19 GHz

P100 NVIDIA Volta GPU80 MP x 64 @ 1.37 GHz

V100Intel Xeon E5-2650 v3 (Haswell)2 x 10 cores @ 2.30 GHz

CPU

FEATURES AND SUPPORT

Linear system solvers

Eigenvalue problem solvers

Auxiliary BLAS

Batched LA

Sparse LA

CPU/GPU Interface

Multiple precision support

Non-GPU-resident factorizations

Multicore and multi-GPU support

MAGMA Analytics/DNN

LAPACK testing

Linux

Windows

Mac OS

HYBRID ALGORITHMS2.3 FOR CUDA

MAGMA MIC 1.4 FOR Intel Xeon PhiclMAGMA 1.4 FOR OpenCL

V100

MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra libraries for heterogeneous architectures. MAGMA is designed and implemented by the team that developed LAPACK and ScaLAPACK, incorporating the latest developments in hybrid synchronization- and communication-avoiding algorithms, as well as dynamic runtime systems. Interfaces for the current LAPACK and BLAS standards are supported to allow computational scientists to seamlessly port any linear algebra reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs to deliver the fastest possible time to

accurate solution within given energy constraints. FIND OUT MORE AT http://icl.utk.edu/magma

NEW

IN COLLABORATION WITH WITH SUPPORT FROM

SPONSORED BYU.S. Departmentof Defense National Science Foundation

The objective of the Innovative Computing Laboratory’s IPCC is the development and optimization of numerical

linear algebra libraries and technologies for applications, while tackling current challenges in

heterogeneous Intel® Xeon Phi™ coprocessor-based High Performance Computing.

Long-term collaboration and support on the development of clMAGMA,

the OpenCL™ port of MAGMA.

Long-term collaboration and support on the development of MAGMA.

Intel ParallelComputing Center

0

1000

2000

3000

4000

5000

6000

2K 4K 6K 8K 10K 12K 14K 16K 18K 20K 22K 24K 26K 28K 30K 32K 34K 36K

1414

2121

4.44.42.62.6

Sparse matrix - vector product (SpMV)in double precision arithmetic

from The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/

2.3 DRIVER ROUTINES

ABBREVIATIONSGESPD/HPDTR D&C B&I It MP

GeneralSymmetric/Hermitian Positive DefiniteTriangularDivide & Conquer Bisection & Inverse IterationMixed-precision Iterative Refinement

LIN

EAR

EQU

ATIO

NS

LEASTSQUARES

STAN

DAR

DEV

P

STAND.SVP

GEN

ERAL

IZED

EVP

GE

SPD/HPD

GE

GE

SY/HE

GE

SPD/HPD

Solve using LUSolve using MPSolve using CholeskySolve using MPSolve LLS using QRSolve using MPCompute e-values,optionally e-vectorsComputes all e-values,optionally e-vectorsRange ( D&C )Range ( B&I It. )Range ( MRRR )Compute SVD,optionally s-vectorsCompute all e-values,optionally e-vectorsRange ( D&C )Range ( B&I It.)Range ( MRRR )

{sdcz}gesv{zc,ds}gesv{sdcz}posv{zc,ds}posv{sdcz}gels{zc,ds}geqrsv{sdcz}geev

{sd}syevd{cz}heevd{cz}heevdx{cz}heevx{cz}heevr{sdcz}gesvd{sdcz}gesdd{sd}sygvd{cz}hegvd{cz}hegvdx{cz}hegvx{cz}hegvr

✓ ✓ ✓ ✓ ✓ ✓✓✓✓✓✓✓✓✓

✓✓✓✓✓✓ ✓✓✓✓✓

NAMING CONVENTIONmagma_{routine name}[_gpu]

MATRIX OPERATION ROUTINEINTERFACES

CPU GPU

SPARSE

BATCHED

PERFORMANCE

ROUTINES

PRECONDITIONERS

KERNELS

DATA FORMATS

BiCG, BiCGSTAB, Block-Asynchronous Jacobi, CG, CGS, GMRES, IDR, Iterative refinement, LOBPCG, LSQR, QMR, TFQMR

ILU / IC, Jacobi, ParILU, ParILUT, Block Jacobi, ISAI

SpMV, SpMM

CSR, ELL, SELL-P, CSR5, HYB

NVIDIA Pascal P100 GPU56 MP x 64 @ 1.19 GHz

TEST MATRICES0

10

20

30

40

50

60

70

80

90

GFLO

P/S

• Deep learning• Structural mechanics• Astrophysics

• Sparse direct solvers• High-order FEM

simulations

BATCHED FACTORIZATION OF A SET OFSMALL MATRICES IN PARALLEL

LU, QR, and CholeskySolvers and matrix inversionAll BLAS 3 (fixed + variable)SYMV, GEMV (fixed + variable)

✓✓✓✓

ROUTINES

DEVICES

APPLICATIONS / LIBRARIES

AlgorithmicVariantsKernel DesignInlining &

Code GenerationAutotuning

CPUs GPUs CoprocessorsKNC/KNL

MAGMA BatchedFramework & AbstractionsPa

nel P

1 i

Pane

l P1 i

Pane

l P1 i

Numerous applications requirefactorization of many small matrices

in double precision arithmetic on 1 million matricesPERFORMANCE OF BATCHED LU

MATRIX SIZE

CSR (cuSPARSE 8.0)HYB (cuSPARSE 8.0)SELL-P (MAGMA 2.3)CSR5 (MAGMA 2.3)

U.S. Departmentof Defense National Science Foundation

0

100

200

300

400

500

600

700

1 16 20 24 288 124 32

P100

IN COLLABORATION WITH WITH SUPPORT FROM

SPONSORED BY

GFLO

P/S V100

hybrid algorithms features and support - icl · magma uses a hybridization methodology where...

Documents