a foundational communication layer and a linear algebraic ...bisse101/education/pa/y... · huawei...

A foundational communication layer and a linearalgebraic programming methodology

A. N. YzelmanComputing Technology LabHuawei Technologies Zurich

30th of October, 2019

Contents

Lightweight Parallel FoundationsCommunicationPerformance modelExecution, interoperability, and more

GraphBLASHow to achieve high performance: past lessons learnedLessons learned applied to GraphBLAS

Most work done while at the Paris R&D centre.

Huawei

Huawei is a large company with expanding business areas.I around 180 000 employees worldwide

I around 70 000 engage in R&DI 23 research centres in Europe alone

I 100+ billion USD yearly revenueI 25–30 percent year-over-year revenue increases

Computing Technologies Lab in Zurich (est. late 2019):

I long-term fundamental researchI hardware, architecture, software, tools, theory, and models:

I microarchitectures, systems, interconnects, storage, memory, ...I compilers, HW/SW co-design, HW design tools, SW tools, ...I model-driven design, lower bounds, programming models, ...

I Unique (in Huawei)I novel lab to foster open research.

Lightweight Parallel Foundations



Lightweight Parallel Foundations

LPF is a communications layer alike BSPlib.

What is LPF not about:

I politics

I making parallel programming easier

I replacing BSPlib, MPI, MapReduce, Spark, . . .

What it is about:

I ‘close to the metal’ performance

I guaranteed performance

I enhanced robustness

I formal semantics

I interoperability

Joint work with Wijnand Suijlen (Huawei Technologies France).

I introductory paper: https://arxiv.org/abs/1906.03196

https://arxiv.org/abs/1906.03196

Application Programming Interface

Communication:

I lpf get, lpf put: one-sided, non-buffered, non-blocking

I lpf sync: block until local comms (in & out) completed

I lpf msg attr t, lpf sync attr t: allows for extensions

Alike bsp hpget,hpput,sync

No other comms primitives!

Alike BSPlib, require memory registration:

I lpf memslot t: LPF registers slots instead of addresses

I lpf register local,global: local vs. remote reference

I lpf deregister: takes both local & global slots

Changes are valid immediately.

Overlapping writes

Suppose p > 1 processes all execute this BSPlib code:

I bsp push reg( &l err, sizeof(l err) );

I bsp push reg( &g err, sizeof(g err) );

I bsp sync();

I ...

I bsp hpput( 0, &l err, 0, &g err, 0, sizeof(l err) );

I bsp sync();

Then g err at process 0 becomes undefined.

LPF has clear semantics for such cases:

I for each memory region subject to overlapping writes

I consider all incoming communication requests M

I LPF must guarantee the result is a serialisation of M

I Overlapping reads and writes remain undefined(!)

Communication example

Suppose p > 1 processes, k of which have a local error l err.

Each process executes:

I lpf err t l err, g err; bsp memslot t s lerr, s gerr;

I lpf register local( ctx, &l err, sizeof(l err), &s lerr );

I lpf register global( ctx, &g err, sizeof(g err), &s gerr );

I ...

I if l err != LPF SUCCESS thenlpf put( ctx,

s lerr, 0,0, s gerr, offset,sizeof(l err), LPF MSG DEFAULT );

I lpf sync( ctx, LPF SYNC DEFAULT );

By conflict resolution, g err at PID 0 will be a valid error code.

Performance model

Communication is not free.

Bulk Synchronous Parallel (BSP, Valiant 1990):

I A current superstep i is ended by an lpf sync

I Let tsi the number of bytes transmitted by PID s at step i

I Let r si the number of bytes received by PID s at step i

I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,

I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.

I Majority of effort goes into ensuring compliance to model.

Some questions:

I How does an algorithm know the value for g and l?

I Can we guarantee anything about the other LPF primitives?

Performance model






I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tki

I The subsequent lpf sync takes at most hig + l time,I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.


Some questions:



Performance model






I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,

I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.


Some questions:



Immortal algorithms: reduction

Every process has a number α to be reduced at process 0:

I direct all-to-one: (p − 1)(g + 1) + l flops.

I binary tree: dlog2 pe(g + l + 1) flops.

I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.

I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:

I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).

I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.

I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.

Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.

Machine and performance introspection

lpf probe returns the number of parallel processes the machinesupports, how many are currently free, as well as two functions:

I double g( bsp pid t p, size t wordsize, bsp sync attr t );

I double l( bsp pid t p, size t wordsize, bsp sync attr t );

Other performance guarantees

Time spent in communication (lpf sync) defined by:

I algorithm’s puts/gets (source, destination, and size);

I machine parameters captured by g , l ;

I BSP’s performance model.

Time spent in computation:

I user code;

I calls to LPF primitives(!).

Some performance guarantees:

I lpf register local,global: O(size)

I lpf get, lpf put, lpf deregister: Θ(1)

I lpf probe: Ω(1)

Rationale: Asymptotic behaviour must be defined or process-localcomplexity analysis becomes impossible.

Memory guarantees & buffer controlMemory may be constrained:

I small manycore device, or

I core counts that grow faster than available memory.

Even taking Θ(p) memory per process may be too much.

LPF’s internal buffers are explicitly controlled at run-time:

I lpf resize memory register( lpf t context, size t max regs )

I lpf resize message queue( lpf t context, size t max msgs )

Compare these to calling to reserve ahead for a date.

Initial values (fresh contexts): 0 memslots, 0 messages(!)

I first calls in an LPF program are likely these two primitives.

LPF doesn’t forgive:

I registering without reservation: undefined behaviour (UB).

I sending (or receiving!) a message without reservation: UB.

Some performance resultsWe have a POSIX Threads LPF implementation:

15 20 25 300

20

40

60

80

log(n): Vector length

5n lo

g(n)

/ T

: Effe

ctiv

e G

FLO

P/s

ec

FFTW

MKL

LPF - Pthread

HPBSP FFT compared to FFTW3 and Intel MKLI 8-socket shared-memory Intel Xeon E7-8890 v2

I Local memory: 2.8 Gbyte/s/core.

I immortal FFT algorithm: Valiant (’90), Inda & Bisseling (’01).

Some performance resultsIbverbs LPF implementation + PThreads LPF = Hybrid LPF:

15 20 25 300

20

40

60

80

log(n): Vector length

5n lo

g(n)

/ T

: Effe

ctiv

e G

FLO

P/s

ec

FFTW

MKL

LPF - Hybrid - RB

HPBSP FFT compared to FFTW3 and Intel MKLI 8 node times 2 socket Intel Xeon E5-2650

I Local memory: 4.3 Gbyte/s/core, FDR IB: 0.4 Gbyte/s/core.

I immortal FFT algorithm: Valiant (’90), Inda & Bisseling (’01).

Execution types

An LPF SPMD program has the following signature:

I void (*f)(lpf t context, lpf pid t s, lpf pid t P, lpf args t args);

An LPF SPMD program is started through one of:

I lpf exec( lpf t context, bsp pid t P,lpf spmd t program, lpf args t args );

I lpf hook( lpf init t init, lpf spmd t program, lpf args t args );

I lpf rehook( lpf t ctx, lpf spmd t program, lpf args t args );

These primitives block until the LPF program completes.

lpf args t consists of six fields:

I const void * input, size t input size,

I void * output, size t output size,

I const lpf func t f symbols, size t f size.

Hello World

void hello world( lpf t ctx, lpf pid t s, lpf pid t P, lpf args t args )I (void) ctx; (void) args;

I (void) printf( ”Hello world from PID %d / %d\n”, s, P );

int main( int argc, char ** argv ) I const lpf err t err = lpf exec(

LPF ROOT,LPF MAX P,&hello world,LPF NO ARGS

);

I return err == LPF SUCCESS ? 0 : 255;

Encapsulation

Libraries that implement algorithms using LPF should

I define an lpf spmd t entry function, and

I perform I/O using lpf args t.

The library should be aware of exec vs. hook:

I in exec, should distribute input or read input in parallel

I in (re)hook, transfer input 1:1 between host and slave

Example:

I double * x, * y; size t n;

I ... // initialise x, y, n and retrieve or compute x

I lpf args t fft args; fft args.in = x; fft args.out = y;

I fft args.in size = fft args.out size = 2 * n * sizeof(double);

I lpf rehook( context, &lpf fft, fft args );

I ... //use the FFT of x, now stored in y

Interoperability

Past decade, a wildgrowth of parallel programming frameworks:

I Big Data: MapReduce, Spark, Giraph,

Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .

I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk, TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .

LPF aims not to make all these obsolete

Big data’s success due to programmer efficiency.

I not due to performance efficiency.

LPF instead enables interoperability: immortal algorithms should

I transparently integrate with any parallel framework, thus

I be enabled for use as widely as possible.

Rationale: users should use tools best suited for the job at hand.

Interoperability


I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph,

Storm, Hive, Presto, Samza, Heron, Kudu, . . .









Interoperability


I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .

I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk,

TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .








Interoperability


I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .









Interoperability

The lpf hook is like lpf rehook

I but from arbitrary parallel contexts, not only LPF contexts;

I requires a valid lpf init t instance;

I creating an lpf init t is implementation-defined.

In our LPF implementation, can retrieve one using TCP/IP:

I char * hostname = NULL, * portname = NULL;

I lpf pid t process id = 0, nprocs = 0;

I lpf args t args = LPF NO ARGS;

I ... // user code decides the above variables

I lpf init t init = LPF INIT NONE;

I lpf mpi initialize over tcp( hostname, portname, 30 000,process id, nprocs, &init );

I lpf hook( init, &spmd, args );

I ... //user code continues making use of args.out

Interoperability

A bridge between Big Data and HPC (Y., PMAA ’16):

I Spark I/O via native RDDs and native Scala interfaces;

I Rely on serialisation and the JNI, switch to C;

I Intercept Spark’s exec. model via lpf hook, switch to SPMD;

I Set up and enable inter-process RDMA communications.

cage15, n = 5 154 859, nz = 99 199 551

Robustness

LPF allows for the following return codes (lpf err t):

I LPF SUCCESS

I LPF OUT OF MEMORY

I LPF ERR FATAL

All errors except LPF ERR FATAL:

I must have no side-effects

I must be mitigable

On encountering LPF ERR FATAL:

I any subsequent call to any LPF primitive results in UB;

I user code can only clean up and return.

All errors are local;

I no globally consistent error codes, even for lpf sync(!)

Formal semantics

LPF’s 12 primitives have their semantics formally defined.

Enables use of formal methods in parallel software. E.g.,

I verification of (correct) use of LPF primitives

I automatic cost analysis

Combined with sequential verification tools:

I verification of user programs

I verified parallel code generation

State of an LPF machine is a vector of sequential states;

I only the lpf sync induces global change,

I calling any other primitive incurs local changes only.

BSPlib: Tesson & Loulergue, 2007; Gava & Fortin, 2008;Jokabsson et al., 2017; Jokabsson, 2018. For LPF:

I modified to work with unbuffered DRMA

Extensibility

Functional and performance semantics extended throughattributes:

I lpf msg attr t: an attribute attached to a put or get

I lpf sync attr t: an attribute attached to a superstep

Example of a change in functional semantics:

I Message attribute changing a put into an accumulate

Example of a change in performance semantics:

I optimisations for pre-defined communication patterns

I zero-cost synchronisation (Alpert & Philbin, ’97)

Example of a change in both:

I message attribute introducing staleness (Xing et al., ’15)

Summary

LPF is a communication layer for immortal algorithms:

1. allows implementing portable immortal algorithms

2. allows interoperable use of immortal algorithms

LPF is robust:

I formal specification of primitives

I error codes and mitigation

LPF supports easier-to-use, higher-level libraries:

I Encapsulation (lpf rehook) aids library design (FFT, e.g.)

I Bulk-Synchronous Message Passing (BSMP, send/move)

I Collectives library (Suijlen, 2019)

I BSPlib interface

LPF is extensible.

GraphBLAS



Motivation

What is GraphBLAS not about:

I politics

I replacing BSPlib, MPI, MapReduce, Spark, . . .

What it is about:

I making parallel programming easier

I close-to-the-metal performance

I guaranteed performance

I interoperability

Community:

http://www.graphblas.org

Note that the standard API is C11, ours is C++11.

I joint work with Jonathan M. Nash & Daniel Di Nardo

http://www.graphblas.org

History

APIs & Software:

I Basic Linear Algebra Subroutines, Lawson et al., 1979

I Sparse BLAS, Remington & Pozo, 1996

I Combinatorial BLAS (CombBLAS), Buluc & Gilbert, 2011I GraphBLAS, GraphBLAS.org, 2015 onwards

I GraphBLAS Template Library (GBTL), C++, CPU + GPUMcMillan et al., CMU and others, 2015 onwards;

I This work, C++11, CPU + cluster + mobileY. et al., Huawei Paris, 2016 onwards;

I SuiteSparse::GraphBLAS, C11, CPUDavis, TAMU, 2018 onwards;

I GraphBLAST, C11 (+Gunrock), GPUYang et al., U. Illinois and others, 2019 onwards.

History

Conceptually, the idea of using math concepts in programming, orgeneralised linear algebraic concepts, seems somewhat recurrent:

I The Design and Analysis of Computer Algorithms,Aho, Hopcroft, Ullman (1974)

I Introduction to Algorithms (first edition?),Cormen, Leiserson, Rivest (1990)

I Elements of Programming,Alexander Stepanov & Paul McJones (2009)

I Graph Algorithms in the Language of Linear Algebra,Jeremy Kepner & John Gilbert (2011)

I From Mathematics to Generic Programming,Alexander Stepanov & Daniel Rose (2015)

Core concepts

I scalars, α ∈ D: standard C++ (POD) types

I vectors, x ∈ Dn, sparse and dense: grb::Vector< D > x;

I matrices, A ∈ Dm×n, sparse: grb::Matrix< D > A;

I In the above templates, D may be void for pattern-only data.

I unary operators, f : D1 → D2

I binary operators, g : D1 × D2 → D3

I monoid, < D1, D2, D3, ⊕, 0 >I ⊕ : D1 × D2 → D3

I ∀a ∈ D1, b ∈ D2 : ⊕(a, 0) = a, ⊕(0, b) = bI ⊕ must be associative

I semiring, < D1, D2, D3, D4, ⊕, ⊗, 0, 1 >I ⊗ : D1 × D2 → D3, forms a monoid with 1I ⊕ : D3 × D4 → D4, forms a commutative monoid with 0I Left-distributive: ⊗(m,⊕(a, b)) = ⊕(⊗(m, a),⊗(m, b))I Right-distributive: ⊗(⊕(a, b),m) = ⊕(⊗(a,m),⊗(b,m))I ∀a ∈ D1, b ∈ D2 : ⊗(a, 0) = 0, ⊗(0, b) = 0

GraphBLASGraph algorithms in the language of linear algebra:

I a graph (V ,E ) is a sparse matrix A ∈ D|V |×|V |E

I an edge is a nonzero (i , j) coordinate in A

I an edge weight is a nonzero aij ∈ A

I a vertex is a coordinate i , 0 ≤ i < |V |I a vertex weight is a vector nonzero xi , x ∈ D

|V |V

Graph algos as (sparse) linear algebra, parametrised in semirings:I different from C++ overloaded ‘+’, ‘*’ operators

I same sparse container may be subject to different semirings,reinterpreting computation for statically typed A and x :

I grb::mxv( y, A, x, ring1 );I grb::mxv( y, A, x, ring2 );

Kepner & Gilbert: GA in the Language of LADOI: 10.1137/1.9780898719918 (SIAM, 2011)

10.1137/1.9780898719918

Examples

(1) PageRank

I canonical example of a graph LA algo

I ‘regular’ LA (R,+, ∗, 0, 1): dot products, sparse matrix–densevector multiplication, element-wise vectors ops, ...

I With L the link matrix, row sum its row sums, inner looponly, pseudo code and standard+GraphBLAS C++ primitives:

1: for iter = 0 to max iter while r > tol do2: δ = (pr , row sum), grb::dot3: pr next = pr .* row sum, grb::apply4: δ = 1/n(α ∗ δ + 1− α), standard C++ scalar ops

5: pr next = pr nextL, grb::vxm6: pr next = pr next .+ δ, grb::foldl7: r = ||pr − pr next||1, grb::norm8: pr = pr next, std::swap

Examples

(2) Nearest-neighbours

I a canonical graph algorithm example

I modified semiring: (N0,min,+,∞, 0)

I starting from vertex d :

0

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

0

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

d

a

x ind = 0, aad = 5, x in

a = “zero ′′ =∞, so xouta = min0 + 5,∞.

graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/

http://www.texample.net/tikz/examples/prims-algorithm/

Examples





5

0

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

0

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

d

a

Now move to the next row and see how vertex b interacts with d ...



Examples





5

0

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

0

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

d

b

w(d) = 0, w(a, b) = 9, xb =∞ (‘zero’), so yb = min0 + 9,∞.



Examples





59

0

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

0

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

d

b

We’ll now finish the iteration in one go...



Examples





59

0156

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

0

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

d

a

b

e

f

x now are the shortests 1-hop distances from d . Now on to A2y ...



Examples





59

0156

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

59

0156

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

a

b

d

From a to b: min5 + 7, 9 = 9. From a to d : min5 + 5, 0 = 0.



Examples





59

170

156

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

59

0156

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

b

a

c

d e

From b to a, c , d , e, followed by d to a, b, e, f



Examples





59

170

156

24

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

59

0156

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

e

b c

f g

From e to b, c , f , g; now only f remains to do in this iteration



Examples





59

170

146

17

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

59

0156

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

e

f g

From f to e, g. x , the shortest distances from d within two hops.



Examples





59

170

146

17

︸︷︷︸x+=Ax

+=

x 7 57 x 8 9 7

8 x 55 9 x 15 6

7 5 15 x 8 96 8 x 11

9 11 x

︸︷︷︸

A

59

0156

︸︷︷︸

x

a

b c

d e

f g

7

85

9

7

515

6

8

9

11

Stop condition: keep iterating until x does not change



Examples

(2) Single-source shortests paths

I Modify semiring: (N0,min,+,∞, 0) and start from vertex d

I In C++, with uint an unsigned int:

1: grb::semiring< uint, grb::operators::min, grb::operators::add,grb::identities::infinity, grb::identities::zero > spR;

2: n = grb::nrows(A); grb::vector< bool > mask( n );3: grb::vector< uint > x( n ), y( n );4: grb::set( x, d, 0 ); grb::set( mask, x ); grb::set( y, x );5: grb::Vector< bool > eWiseEq( n ); bool eq = false;6: while !eq do7: grb::mxv< descriptors::invert mask | descriptors::no casting

| grb::descriptors::in place >( y, mask, A, x, spR );8: grb::apply( eWiseEq, x, y, operators::is equal< uint >() );9: eq = true; std::swap( x, y );

10: grb::foldl( eq, eWiseEq, operators::is equal< bool >() );11: return x

Examples

Gabor Szarnyas recently produced an overview of algos:

I Breadth-First Search, Θ(|E |)I Single-source Shortests Paths, Bellman-Ford, Θ(|V ||E |)I All-pairs shortest paths, Θ(|V |3)

I Minimum spanning tree, Boruvka, |E | log |V |I Maximum flow, Θ(|V ||E |2)

Some algorithms not at best known complexities:

I Dijkstra (SSSP), Prim (APSP), max. independent sets...

Some recent work:

I Linear Algebraic Depth-First Search, Spampinato et al., 2019

I DFS previously a canonical ‘counter-example’ to GraphBLAS

I Adds permutation-based stack semantics to GraphBLAS

Summary

GraphBLAS is:

I a way to express graph algorithms in linear algebraI a C11 standard with two compliant implementations

I Tim Davis’ SuiteSparse::GraphBLASI IBM’s GPI with a C11 GraphBLAS wrapper (Moreira et al.)

I two native C++ implementations (CMU, Huawei)

I increasingly many algorithms: LAGraph

Debate:

I LA is the ‘right way’ to look at graph algorithms

I one alternative: ‘think like a vertex’, a lot of adoption

Less debatable:

Summary

GraphBLAS is:

I a way to express graph algorithms in linear algebraI a C11 standard with two compliant implementations

I Tim Davis’ SuiteSparse::GraphBLASI IBM’s GPI with a C11 GraphBLAS wrapper (Moreira et al.)

I two native C++ implementations (CMU, Huawei)

I increasingly many algorithms: LAGraph

Debate:

I LA is the ‘right way’ to look at graph algorithms

I one alternative: ‘think like a vertex’, a lot of adoption

Less debatable:

I Sparse LA software techniques can speed up graph algos

Central obstacles for SpMV multiplication performance

Shared-memory:

I limited memory throughput,

I inefficient cache use, and

I non-uniform memory access (NUMA) issues.

Distributed-memory:

I inefficient network use.

Initial analysis of Ax = b (SpMV) by roofline:

I read input matrix once

I two flops per nonzero

Arithmetic intensity is very low:

I minimise footprint of A

I vectorisation shouldn’t help(Image taken from da Silva et al., DOI 10.1155/2013/428078, Creative Commons Attribution License)


Shared-memory:




Distributed-memory:





Arithmetic intensity is very low

:




Shared-memory:




Distributed-memory:





Arithmetic intensity is very low:



Bandwidth

Compression leads to better performance:

I Coordinate format storage (COO):

i = (0, 0, 1, 1, 2, 2, 2, 3)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)

for k = 0 to nz − 1yik := yik + vk · xjk

The coordinate (COO) format: two flops versus five data words.

Θ(3nz) storage.

Bandwidth

Compression leads to better performance:

I Compressed Row Storage (CRS):

i = (0, 0, 1, 1, 2, 2, 2, 3)i start = (0, 2, 4, 7, 8)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)

for i = 0 to m − 1for k = i start i to i start i+1 − 1

yi := yi + vk · xjk

The CRS format:

From Θ(3nz) storage to Θ(2nz + m + 1).

Can do the same column-wise, leading to CCS.

Inefficient cache use

Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order (CRS):

Accesses on the input vector are completely unpredictable.

Inefficient cache use

Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order (CRS):

Two orthogonal solution classes:

I reordering matrix rows and/or columns

I reordering matrix nonzeroes (not compatible with CRS!)

Matrix permutations

Goes back to the 70s, linked to cache reuse for SpMVs by the 90s:I Das et al. (1994). The design and implementation of a parallel unstructured Euler solver;

I Sivan Toledo. (1997). Improving the memory-system performance of sparse-matrix vector multiplication;

I Pinar & Heath. (1999). Improving performance of sparse matrix-vector multiplication.

Network data movement during distributed-memory SpMV andcache misses during shared-memory SpMV are both bounded by∑

i

(λi − 1) ,

the λ− 1-metric (row-net model and zig-zag storage; Y & B, ’09).I Lengauer, T. (1990). Combinatorial algorithms for integrated circuit layout. Springer Science & Business

Media.

I Catalyurek, U. V., & Aykanat, C. (1999). Hypergraph-partitioning-based decomposition for parallelsparse-matrix vector multiplication. IEEE Transactions on Parallel and Distributed Systems, 10(7), 673-693.

I Catalyurek, U. V., & Aykanat, C. (2001). A Fine-Grain Hypergraph Model for 2D Decomposition ofSparse Matrices. In IPDPS (Vol. 1, p. 118).

I Vastenhouw, B., & Bisseling, R. H. (2005). A two-dimensional data distribution method for parallel sparsematrix-vector multiplication. SIAM review, 47(1), 67-95.

I Bisseling, R. H., & Meesen, W. (2005). Communication balancing in parallel sparse matrix-vectormultiplication. Electronic Transactions on Numerical Analysis, 21, 47-65.

Matrix permutations

Row-net model:

I columns correspond to vertices, rows to hyperedges.

1 2 3 4

1 2

43

Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods by A. N. Yzelman& Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).

Matrix permutations

Can be done in 2D (medium- & fine-grain models) too. In practice:

Figure: the Stanford link matrix (left) and its 20-part reordering (right).

Sequential execution using CRS on Stanford:

18.99 (original), 9.92 (1D), 9.35 (2D) ms/mul.

Ref.: A Fine-Grain Hypergraph Model for 2D Decomposition of Sparse Matrices by Catalyurek & Aykanat (2001)Two-dimensional cache-oblivious sparse matrix-vector multiplication by Yzelman & Bisseling (2011)

A medium-grain method for fast 2D bipartitioning of sparse matrices by Pelt & Bisseling (2014)A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric SpMV by Alappat et al. (2019)

Nonzero reorderings

2D permutations cannot rely on CRS:

I vertical separators pollute cache; thus

I need to store nonzeroes in specific order.

This makes sense by itself as well:

Blocking combined with cache-oblivious traversals

Nonzero reorderings





Hilbert on blocked dense matrix storage: Lorton & Wise, 2007

Nonzero reorderings





Hilbert with COO: Haase, Liebmann, & Plank, 2007

Nonzero reorderings





Hilbert with compression: Yzelman & Bisseling, ECMI ’09

Nonzero reorderings





Blocking with Morton inside blocks, COO+CRS: Buluc et al., 2009

Nonzero reorderings





Morton-ordered blocks, quadtree store: Martone et al., 2010

Nonzero reorderings





Hilbert-ordered blocks, fully compressed: Y & Bisseling, 2011

Nonzero reorderings





Blocking with Hilbert-ordered blocks: Y & Roose, 2014

Nonzero reorderings2D permutations cannot rely on CRS:




Sequential SpMV multiplication on the Wikipedia ’07 link matrix:

345 (CRS), 203 (Hilbert), 245 (blocked Hilbert) ms/mul.

Cache-efficiency and bandwidth

Need to consider the whole picture; good cache efficiency but nocompression? Compression but no cache optimisation? No gain!

A =

4 1 3 00 0 2 30 0 0 27 5 1 1

Bi-directional incremental CRS (BICRS):

A =

V [7 5 4 1 2 3 3 2 1 1]

∆J [0 1 3 1 5 4 5 4 3 1]

∆I [3 -3 -2 1 -1 1 1 1]

Allows arbitrary traversals. Storage: Θ(2nz + row jumps + 1).

Vectorisation: no use, right?

Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?Latency-bound, not bandwidth-bound! Gather/scatter is critical.

Ref.: Yzelman. Generalised vectorisation for sparse matrix–vector multiplication. (2015)


Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?

Latency-bound, not bandwidth-bound! Gather/scatter is critical.



Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?Latency-bound, not bandwidth-bound! Gather/scatter is critical.


NUMA

Each socket has local main memory where access is fast.

Memory

CPUs

Memory access between sockets is slower, leading to non-uniformmemory access (NUMA): access different sockets, different speed.

NUMA

Access to only one socket: limited bandwidth

NUMA

Interleave memory pages across sockets: emulate uniform access

NUMA

Explicit data placement on sockets: best performance

One-dimensional data placementCoarse-grain row-wise distribution, compressed, cache-optimised:

I explicit allocation of separate matrix parts per core,

I explicit allocation of the output vector on the various sockets,

I interleaved allocation of the input vector.

Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–VectorMultiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).

Two-dimensional data placement

Distribute row- and column-wise (individual nonzeroes):

I most work touches only local data,

I inter-process communication minimised by partitioning;

I incurs cost of partitioning.

Ref.: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication,IEEE Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

Ref.: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library forshared-memory parallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

Some kernels illustrated

Fine-grained CRS, using OpenMP:

I no pre-processing required (vs. sequential)

I use numactl –interleave=all on NUMA system

#omp parallel for private( i, k ) schedule( dynamic, 8 )

1: for i = 0 to m − 1 do2: for k = Ii to Ii+1 − 1 do3: add Vk · xJk to yi

Compressed Sparse Blocks, using Cilk:

I Block A into β × β submatrices,

I use numactl –interleave=all on NUMA system,

I omitted: need buffer if multiple threads on a row.

1: cilk for each row of blocks2: for each block3: do SpMV with nonzeroes in Z-curve order


1D SpMV multiplication:

I when loading in A, distribute rows over threads

1: perform local SpMV

(That’s really it!)

Caveat:

I take care of vector operations– they’re distributed!


2D SpMV multiplication:

I partition and reorder A

1: for each j s.t. ∃aij local to s while xj is not local do2: bsp get xj from remote process3: bsp sync()

4: Perform SpMV y = Ax , using only those aij local to s

5: for each i s.t. ∃ aij local to s while yi is not local do6: bsp send (yi , i) to the owner of yi7: bsp sync()

8: while bsp qsize() > 0 do9: (α, i) = bsp move()

10: add α to yi

Results

Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:

21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.

4 sockets, 10 core Intel Xeon E7-4870

Results

Average speedup on six large matrices:2 x 6 4 x 10 8 x 8

–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Blocking, Morton, 1D FG, CSB 7.9 24.3 26.3Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6Hilbert, Blocking, 2D, BICRS† − 21.3 30.8

†: uses an updated test set. (Added for reference versus a good 2D algorithm.)

As NUMA scales up, 1D algorithms lose efficiency.

Results

Cross-platform:

Structured Unstructured Average

Intel Xeon Phi 21.6 8.7 15.22x Ivy Bridge CPU 23.5 14.6 19.0

NVIDIA K20X GPU 16.7 13.3 15.0

∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication,IEEE Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).

†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).

Application to GraphBLAS

In summary: we know how to do high performance parallel SpMV.

I what about sparse matrix–sparse vector multiply (SpMSpV)?

I what about masks?

Suppose y = Ax , x sparse, no mask.

I Best data structure? Column-major!

Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).

I Best data structure? Row-major!

Suppose y = Ax , x sparse, and masked?

I ...




I what about masks?


I Best data structure?

Column-major!




I ...




I what about masks?






I ...




I what about masks?




I Best data structure?

Row-major!


I ...




I what about masks?






I ...


Sequential mode, four different SpM(Sp)V kernels:

I y = Ax , loop over nonzero indices of x : scatter, column-major

I y = Ax , loop over mask indices: gather, row-major

I y = xA, loop over nonzero indices of x : scatter, row-major

I y = xA, loop over mask indices: gather, column-major

Store matrix twice (row- and column-major). At runtime:

I choose variant with smallest loop-size.

Vector data structure, Θ(1) ops required for:

I checking if the ith vector entry is (non)zero

I jump to the next nonzero (iteration);

Use both an array & a stack to maintain vector nonzero indices.


We start simple. For shared-memory parallel:

I use OpenMP-based parallelisation

For distributed-memory parallel:

I 1D row-wise block-cyclic distribution of matrices

I matching block-cyclic distribution of vectors

I rely on OpenMP (or sequential) backend

Due to row-wise 1D distribution over p processes:

I y = Ax SpM(Sp)V requires globally synchronised/replicated x

I y = xA SpM(Sp)V results in p vectors ys , while y =∑

k yk

Use stack-based synchronisation/reduction when useful:

I choose variant with lowest communication cost,

I requires standard collectives: allreduce, allgather, alltoall(v).

In relation to graph frameworks

Since Google’s Pregel, a plethora of graph frameworks:

I CombBLAS, Gunrock, GraphX, Giraph, Ligra, PowerGraph,PowerLyra, Galois, GraphChi, TigerGraph, Venus, Neo4j,ArangoDB, Titan, ...

One of the main Ligra features:

I automatic push/pull ‘direction-optimisation’

I also done by us: scatter/gather & row/col kernel selection

In relationship to other frameworks:

I MapReduce maps to BLAS level 1 (only vector ops)

I Pregel’s vertex-centric programs map directly to SpM(Sp)V

I See, e.g., GraphMAT by Sundaram et al. (2015).

Interoperability

One way of dealing with many auxiliary frameworks:

I be compatible with most of them

I matches LPF’s philosophy (Suijlen & Y, 2019)

I distributed-memory GraphBLAS built on top of LPF

I allows interfacing with any framework over TCP/IP

I LPF processes reside in same process space as the host

Applications:

I Spark ML acceleration using LPF

I Graph Analytics acceleration using LPF + GraphBLAS

I GraphBLAS on Spark

I Graph DB on Edge

Future work & Outlook

Implementation:

I use of Suijlen’s new BSP collectives (2019)

I compatibility layer with the C11 standard

Future Work:

I incorporate more advanced matrix partitioning methods?I how to extend to multi-linear algebra (tensor computations)?

I applicable to hypergraph computations?

I graph- and sparse neural networksI for which other (graph) algorithms is GraphBLAS applicable?

I which additions would enlarge suitable areas significantly?

Backup slides

GraphBLAS

A working example:

#include <graphblas.hpp>

int main()

const size_t num_cities = ... //some input matrix size

grb::init();

grb::Matrix< double > distances( num_cities, num_cities );

grb::build( distances, ... ); //input data from file

//or memory

grb::Vector< double > x( num_cities ), y( num_cities );

grb::set( x, 0.0, 4 ); //set city number 4 to

//have distance 0.0

...

GraphBLAS

A working example (continued):

...

//declare an alternative semiring on doubles:

grb::Semiring< double, double, double, double,

grb::operators::min, //‘plus’

grb::operators::add, //‘multiply’

grb::identities::infinity //‘0’

grb::identitites::zero //‘1’

> ring;

//calculate the shortest distances from all cities to

//city #4, allowing only a single path

grb::mxv( y, distances, x, ring );

...

GraphBLAS

A working example (continued):

...


//city #4, allowing only a single path

grb::mxv( y, distances, x, ring );


//city #4, allowing two ‘hops’

grb::mxv( x, distances, y, ring );

//example output via iterators and exit:

writeResult( x.cbegin(), x.cend(), ... );

grb::finalize();

return 0;

Vectorised BICRS

Vectorised BICRS: l = p × q = 2× 2

A =

4 1 3 00 0 2 31 0 0 27 0 1 1

while there are nonzero blocks do

load next l row indices into r5

r5 = (0, 1, 2, 3)

gather output vector elements into r6 (using r5)

r6 = (y0, y1, y2, y3)

for offset = 0 to l − 1 step pset r0 to all zerohandle all nonzero blocks sharing these rows...

Vectorised BICRS: 2× 2

A =

4 1 3 00 0 2 31 0 0 27 0 1 1

for each nonzero block do

load nonzeroes into r3, load nonzero column indices in r4

r3 = (4, 1, 2, 3), r4 = (0, 1, 2, 3)

gather corresponding elements from x into r2 (using r4)

r2 = (x0, x1, x2, x3)

do vectorised multiply-add

r1 = r1 + r2 r3

Vectorised BICRS: 2× 2

A =

4 1 3 00 0 2 31 0 0 27 0 1 1

while there are nonzero blocks do

...set r0 to all zerohandle all nonzero blocks sharing these rowsreduce output r1 into r6

r6 = (yoffset + (r1)1 + (r1)2, yoffset+1 + (r1)3 + (r1)4, . . .)

end forscatter r6 unto y (using r5)

end while

Sequential SpMV: reordering

Input


Column partitioning


Column permutation

Permuting to SBD form

Mixed row detection

Permuting to SBD form

Row permutation


No cache misses

1 cache miss per row


3 cache misses per row


No cache misses


3 cache misses







1D (p = 20, ε = 0.1) Finegrain (p = 100, ε = 0.1)

a foundational communication layer and a linear algebraic ...bisse101/education/pa/y... · huawei...

Documents