latoolbox: a multi-platform sparse linear algebra toolbox

Karlsruhe Institute of Technology

KIT – University of the State of Baden-Wuerttemberg and

National Research Center of the Helmholtz Association www.kit.edu

LAtoolbox:A Multi-platform Sparse Linear Algebra ToolboxVincent Heuveline, Dimitar Lukarski, Jan-Philipp WeissNVIDIA GPU Technology Conference • San Jose, CA • May 17, 2012 • Session S0291

Engineering Mathematics and Computing Lab (EMCL)

http://www.gputechconf.com/page/home.html


Motivation HiFlow3 LAtoolbox Performance Test Summary

The Shift towards Multicore/ManycoreParadigm shift

The paradigm shift towards manycore is well on the wayBut still many problems have to be overcome

Scalable algorithms and implementations, ubiquitous parallelism,programmers’ productivity, ...

Many processors concepts have evolved (and some have gone)x86, SPARC (2-12 cores)NVIDIA / AMD GPUs (240 - 1536 cores)Accelerated Processing Units(2-4 cores + 400 cores, hybrid CPU/GPU)NVIDIA Denver (ARM-CPUs + GPUs on a chip)Cell, ClearSpeed (8+1, 92 cores)Intel SCC, MIC (48, 32/80 cores)Tile64, Epiphany (64, 64 - 4096 cores)

Manycore trend: much higher core count, less complex cores2/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT



Manycore – Revolution or Evolution?

Manycore DevicesHigh core count and incredible computing performanceFine-grained parallelism, stream processors + wide vector unitsFast on-chip bandwidth, caches + local memory

Impact on the SoftwareWithout changing the software there will be no more speedupsFor performance benefits the mathematical models, algorithms,numerical schemes and programs need to be adapted,re-engineered and further developedNew ways of parallel thinking required

3/22 May 17, 2012 Heuveline, Lukarski, Weiss - A Multi-Platform Sparse Linear Algebra Toolbox EMCL / KIT



The Programming Challenge

Zoo of languages and programming approachesOpenMP, CUDA, OpenCL, OpenACC, ArBB, TBB, pthreads,UPC/PGAS, Cilk Plus, and many many more

Who can handle them all? Which is the best approach?We need some further abstractions!

Three main questionsPortability: How portable is my program?(in terms of platforms and performance)Flexibility: Can I easily modify the software?(new modules, algorithms, platforms, functionality, ...)Scalability: How scalable is my algorithm/programming model?(ready for future platforms? ready for exascale?)




The Algorithmic ChallengeWe have to rethink our methods and approaches!

Typical way of parallel thinking (?)Cluster-like problem decomposition with many fat nodes

Substructure problem into huge blocks (of same size)Process blocks in parallelMinimize communication between blocksBut typically: sequential processing on the block-level

New way of parallel thinkingDecompose problem into blocks (of varying size)

There should be only a lower bound on the block size

Process blocks in sequential order (or in parallel)But: use scalable parallelism on the block-level !




Two Solution ApproachesWe would like to show you two things:

I: How to build scalable and efficient algorithmsHow to re-arrange algorithms for scalable parallelismHow to make them ready for GPUs and multicore CPUsHow to keep mathematical quality and efficiency

II: How to build portable software (this talk!)Modular building of solver schemes

Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time

Highly optimized implementationsTransparent to the programmer/user




Two Solution ApproachesWe would like to show you two things:

I: How to build scalable and efficient algorithms

We have demonstrated new algorithms for parallel preconditionerson GPUs in Session S0289, Wednesday, May 16, 9:00am.Please check these slides for further details.

II: How to build portable software (this talk!)Modular building of solver schemes

Single code base for GPUs, CPUs and other acceleratorsSeamless integration of accelerators by unified interfacesChoice of the platform can be taken at run time

Highly optimized implementationsTransparent to the programmer/user




HiFlow3 - Parallel Finite Element SoftwareApplication Fields: Numerical simulation in computationalfluid dynamics, meteorology, medical engineering and energyresearchPerformance: Scalability proven on huge clusters andGPU-accelerated systemsFlexibility: Generic, modular and extensible software formulti-purpose simulation based on object-orientation in C++




HiFlow3 - Modules

Mesh: Distributed and flexiblemesh handlingFEM: Finite element spacesDoF: Management of degrees offreedomIntegration routines: Numericalintegration and assemblyroutinesLinear algebra/solvers:Multi-platform linear algebra andhardware-aware computing




Features of the LAtoolboxCommunication and computation layer of HiFlow3

Multi-platform parallelization(GPUs, multicore-CPUs, accelerators, ...)Unified interfaces across platforms for all linear algebraoperations and solversAbstract data structuresPlatform-adapted library implementations chosen at run timeNo hardware knowledge required for building applicationsMinimal communication costs for intra-node data exchange

Solvers and preconditionersIterative linear solvers: CG, (F)GMRES, geometric multigridNewton-like solvers for nonlinear problemsParallel preconditioners for GPUs and CPUs




LAtoolbox - A Two Level Library

Parallelism is introduced on two levels:Coarse-grained parallelism by means of distributed grids anddistributed data structuresFine-grained parallelism by means of platform-optimized linearalgebra back-ends




Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (SpMV)

P0

P1

P2

P3

Figure: Domain partitioning: Degrees of Freedom (DoF) of process P0 aremarked in green (interior DoF in diagonal block); the remaining DoFrepresent inter-process couplings for process P0 (ghost DoF inoff-diagonal block)




Data Distribution and Communicationfor Sparse Matrix-Vector Multiplication (cont.) P0

︸︷︷︸

diagonal block

•••••

︸︷︷︸

interior

+

P1 P2 P3

︸︷︷︸

offdiagonal block

••••

︸︷︷︸

ghost

Sparse Matrix-Vector Multiplication:start asynchronous communicationexchange ghost valuesyint = Adiag xint

synchronize communicationyint = yint + Aoffdiag xghost




Advanced Features

Advanced Features of the LAtoolboxNo irregular memory access over the PCIe bus by blockingtechniquesRaw access to the data is possible with explicit declaration ofdevice-specific classes (e.g. for nested loops, irregular memoryaccess, etc.)Possible mixture of different types of platforms in the cluster(e.g. nodes with and without GPUs)




lmpLAtoolbox - Sample Source Code 1/2

lVector<double> *x, *y ;

// initialize a vector on a specific platformx = init_vector<double>(size,"vec x", platf, impl);

// clone y as xy = x->CloneWithoutContent();

// Usage of BLAS 1y->CopyFrom(*x); // y = xx->Scale(5.0); // x = x * 5.0y->Axpy(*x, 3.3 ); // y = y + 3.3*x




lmpLAtoolbox - Sample Source Code 2/2lMatrix<double> *mat, *mat_c;

// init empty matrix on a specific platformmat = init_matrix<double>(0,0,0,"A",platf,impl,CSR);

// init CPU matrixmat_c = init_matrix<double>(0,0,0,"A",CPU,OpenMP,CSR);mat_c->ReadFile("matrix.mt");

// Copy the sparse structuremat->CopyStructureFrom(*mat_c);// Copy only the values of the matrixmat->CopyFrom(*mat_c);

mat->VectorMult(*y, x); // x = mat*y




Example: Geometric Multigrid SolverModel problem: −∆u = f in Ω := (0, 1)2 \ (0, 0.5)2

Solve Poisson problem on the L-shaped domainZero Dirichlet boundary conditions, i. e. u = 0 on ∂Ω

Locally refined and statically adapted meshMixed finite elements: Q1 elements on quadrilaterals and P1elements on triangles (avoiding hanging nodes)Problem size: 3,211,425 DoF

Matrix-based geometric multigrid solverStart with zero initial guessRelative residual of 10−6 taken as stopping criterionVarious parallel smoothers: Multi-colored Gauss-Seidel,power(q)-pattern enhanced ILU(p), FSAI




Solution in the L-shaped domain

Figure: Locally refined mesh for the L-shaped domain (left) and discretesolution of the Poisson problem for f = 8π2 cos(2πx) cos(2πy) (right).




Performance Data for the Multigrid SolverRun time comparison for various smoothers and platforms

0

5

10

15

20

25

30

35

40

CPUseq CPUOpenMP GPU

Tim

e [s

ec]

Poisson problem - multi-grid V-cycle

r-JacobiGS

SGSILU(0,1)ILU(1,1)ILU(1,2)

FSAI1FSAI2FSAI3

Platform: 2x Xeon quadcore CPU + Tesla S1070 4G memoryBest multigrid solver with ILU(1,2) takes 2.75sec on the GPUSignificantly faster than the best preconditioned CG solver(with ILU(1,2)) on the GPU with 158sec




Example: CG Solver for a 3D problem3D-Neumann-Laplace problem

−∆u = f, in Ω = [0, 1]3

∂u∂n

= 0, on ∂Ω

Configuration of the conjugate gradient solver:Q1 finite elementswith 2.1 M degrees of freedomCG solver with absolute tolerance 10−14

510 iterations neededDouble precision computations

Solution on a cluster with 16 GPUs:8 nodes with two NVIDIATesla M1060 GPU each, 4G memory20 Gbit/s Infiniband (small DDR switch)




Strong Scalability Resultsfor the 3D Neumann problem on a 16-GPU-cluster

1

2

4

8

1 2 4 8

Spe

edup

number of nodes

Strong Scalability Test on a GPU Cluster

IdealCPU

1xGPU/node2xGPU/node

Close to linear speedup for all GPU and CPU configurationsSame source code for all setups




SummaryWe have presented the easy-to-use Linear Algebra Toolbox

Modular building of applicationsSame source code for all platformsNo hardware knowledge requiredCan be compiled on all platformsChoice of the platform can be taken at run time

Two-level parallelism with minimal communication costsDistributed data across nodesLinear algebra backends for GPUs, CPUs, and others

We have demonstratedScalability – very good strong scalability for GPUs and CPUsPortability – same source code for CPUs and GPUsFlexibility – easy modifications of solvers and algorithmsExtensibility – add new backends for new platforms




Contact and AcknowledgementsCheck our open-source software

The LAtoolbox is part of the HiFlow3 open source parallel finiteelement package with a rich suite of solvers.See http://www.hiflow3.org

Further [email protected]@kit.eduhttp://www.emcl.kit.eduhttp://www.hiflow3.org

The Engineering Mathematics and Computing Lab(EMCL) at KIT is an NVIDIA CUDA Research Centerwith kind support by NVIDIA.


latoolbox: a multi-platform sparse linear algebra toolbox

Documents