a foundational communication layer and a linear algebraic ...bisse101/education/pa/y... · huawei...
TRANSCRIPT
A foundational communication layer and a linearalgebraic programming methodology
A. N. YzelmanComputing Technology LabHuawei Technologies Zurich
30th of October, 2019
Contents
Lightweight Parallel FoundationsCommunicationPerformance modelExecution, interoperability, and more
GraphBLASHow to achieve high performance: past lessons learnedLessons learned applied to GraphBLAS
Most work done while at the Paris R&D centre.
Huawei
Huawei is a large company with expanding business areas.I around 180 000 employees worldwide
I around 70 000 engage in R&DI 23 research centres in Europe alone
I 100+ billion USD yearly revenueI 25–30 percent year-over-year revenue increases
Computing Technologies Lab in Zurich (est. late 2019):
I long-term fundamental researchI hardware, architecture, software, tools, theory, and models:
I microarchitectures, systems, interconnects, storage, memory, ...I compilers, HW/SW co-design, HW design tools, SW tools, ...I model-driven design, lower bounds, programming models, ...
I Unique (in Huawei)I novel lab to foster open research.
Lightweight Parallel Foundations
Lightweight Parallel FoundationsCommunicationPerformance modelExecution, interoperability, and more
GraphBLASHow to achieve high performance: past lessons learnedLessons learned applied to GraphBLAS
Lightweight Parallel Foundations
LPF is a communications layer alike BSPlib.
What is LPF not about:
I politics
I making parallel programming easier
I replacing BSPlib, MPI, MapReduce, Spark, . . .
What it is about:
I ‘close to the metal’ performance
I guaranteed performance
I enhanced robustness
I formal semantics
I interoperability
Joint work with Wijnand Suijlen (Huawei Technologies France).
I introductory paper: https://arxiv.org/abs/1906.03196
Lightweight Parallel Foundations
LPF is a communications layer alike BSPlib.
What is LPF not about:
I politics
I making parallel programming easier
I replacing BSPlib, MPI, MapReduce, Spark, . . .
What it is about:
I ‘close to the metal’ performance
I guaranteed performance
I enhanced robustness
I formal semantics
I interoperability
Joint work with Wijnand Suijlen (Huawei Technologies France).
I introductory paper: https://arxiv.org/abs/1906.03196
Application Programming Interface
Communication:
I lpf get, lpf put: one-sided, non-buffered, non-blocking
I lpf sync: block until local comms (in & out) completed
I lpf msg attr t, lpf sync attr t: allows for extensions
Alike bsp hpget,hpput,sync
No other comms primitives!
Alike BSPlib, require memory registration:
I lpf memslot t: LPF registers slots instead of addresses
I lpf register local,global: local vs. remote reference
I lpf deregister: takes both local & global slots
Changes are valid immediately.
Overlapping writes
Suppose p > 1 processes all execute this BSPlib code:
I bsp push reg( &l err, sizeof(l err) );
I bsp push reg( &g err, sizeof(g err) );
I bsp sync();
I ...
I bsp hpput( 0, &l err, 0, &g err, 0, sizeof(l err) );
I bsp sync();
Then g err at process 0 becomes undefined.
LPF has clear semantics for such cases:
I for each memory region subject to overlapping writes
I consider all incoming communication requests M
I LPF must guarantee the result is a serialisation of M
I Overlapping reads and writes remain undefined(!)
Overlapping writes
Suppose p > 1 processes all execute this BSPlib code:
I bsp push reg( &l err, sizeof(l err) );
I bsp push reg( &g err, sizeof(g err) );
I bsp sync();
I ...
I bsp hpput( 0, &l err, 0, &g err, 0, sizeof(l err) );
I bsp sync();
Then g err at process 0 becomes undefined.
LPF has clear semantics for such cases:
I for each memory region subject to overlapping writes
I consider all incoming communication requests M
I LPF must guarantee the result is a serialisation of M
I Overlapping reads and writes remain undefined(!)
Overlapping writes
Suppose p > 1 processes all execute this BSPlib code:
I bsp push reg( &l err, sizeof(l err) );
I bsp push reg( &g err, sizeof(g err) );
I bsp sync();
I ...
I bsp hpput( 0, &l err, 0, &g err, 0, sizeof(l err) );
I bsp sync();
Then g err at process 0 becomes undefined.
LPF has clear semantics for such cases:
I for each memory region subject to overlapping writes
I consider all incoming communication requests M
I LPF must guarantee the result is a serialisation of M
I Overlapping reads and writes remain undefined(!)
Overlapping writes
Suppose p > 1 processes all execute this BSPlib code:
I bsp push reg( &l err, sizeof(l err) );
I bsp push reg( &g err, sizeof(g err) );
I bsp sync();
I ...
I bsp hpput( 0, &l err, 0, &g err, 0, sizeof(l err) );
I bsp sync();
Then g err at process 0 becomes undefined.
LPF has clear semantics for such cases:
I for each memory region subject to overlapping writes
I consider all incoming communication requests M
I LPF must guarantee the result is a serialisation of M
I Overlapping reads and writes remain undefined(!)
Communication example
Suppose p > 1 processes, k of which have a local error l err.
Each process executes:
I lpf err t l err, g err; bsp memslot t s lerr, s gerr;
I lpf register local( ctx, &l err, sizeof(l err), &s lerr );
I lpf register global( ctx, &g err, sizeof(g err), &s gerr );
I ...
I if l err != LPF SUCCESS thenlpf put( ctx,
s lerr, 0,0, s gerr, offset,sizeof(l err), LPF MSG DEFAULT );
I lpf sync( ctx, LPF SYNC DEFAULT );
By conflict resolution, g err at PID 0 will be a valid error code.
Communication example
Suppose p > 1 processes, k of which have a local error l err.
Each process executes:
I lpf err t l err, g err; bsp memslot t s lerr, s gerr;
I lpf register local( ctx, &l err, sizeof(l err), &s lerr );
I lpf register global( ctx, &g err, sizeof(g err), &s gerr );
I ...
I if l err != LPF SUCCESS thenlpf put( ctx,
s lerr, 0,0, s gerr, offset,sizeof(l err), LPF MSG DEFAULT );
I lpf sync( ctx, LPF SYNC DEFAULT );
By conflict resolution, g err at PID 0 will be a valid error code.
Communication example
Suppose p > 1 processes, k of which have a local error l err.
Each process executes:
I lpf err t l err, g err; bsp memslot t s lerr, s gerr;
I lpf register local( ctx, &l err, sizeof(l err), &s lerr );
I lpf register global( ctx, &g err, sizeof(g err), &s gerr );
I ...
I if l err != LPF SUCCESS thenlpf put( ctx,
s lerr, 0,0, s gerr, offset,sizeof(l err), LPF MSG DEFAULT );
I lpf sync( ctx, LPF SYNC DEFAULT );
By conflict resolution, g err at PID 0 will be a valid error code.
Performance model
Communication is not free.
Bulk Synchronous Parallel (BSP, Valiant 1990):
I A current superstep i is ended by an lpf sync
I Let tsi the number of bytes transmitted by PID s at step i
I Let r si the number of bytes received by PID s at step i
I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,
I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.
I Majority of effort goes into ensuring compliance to model.
Some questions:
I How does an algorithm know the value for g and l?
I Can we guarantee anything about the other LPF primitives?
Performance model
Communication is not free.
Bulk Synchronous Parallel (BSP, Valiant 1990):
I A current superstep i is ended by an lpf sync
I Let tsi the number of bytes transmitted by PID s at step i
I Let r si the number of bytes received by PID s at step i
I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,
I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.
I Majority of effort goes into ensuring compliance to model.
Some questions:
I How does an algorithm know the value for g and l?
I Can we guarantee anything about the other LPF primitives?
Performance model
Communication is not free.
Bulk Synchronous Parallel (BSP, Valiant 1990):
I A current superstep i is ended by an lpf sync
I Let tsi the number of bytes transmitted by PID s at step i
I Let r si the number of bytes received by PID s at step i
I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tki
I The subsequent lpf sync takes at most hig + l time,I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.
I Majority of effort goes into ensuring compliance to model.
Some questions:
I How does an algorithm know the value for g and l?
I Can we guarantee anything about the other LPF primitives?
Performance model
Communication is not free.
Bulk Synchronous Parallel (BSP, Valiant 1990):
I A current superstep i is ended by an lpf sync
I Let tsi the number of bytes transmitted by PID s at step i
I Let r si the number of bytes received by PID s at step i
I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,
I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.
I Majority of effort goes into ensuring compliance to model.
Some questions:
I How does an algorithm know the value for g and l?
I Can we guarantee anything about the other LPF primitives?
Performance model
Communication is not free.
Bulk Synchronous Parallel (BSP, Valiant 1990):
I A current superstep i is ended by an lpf sync
I Let tsi the number of bytes transmitted by PID s at step i
I Let r si the number of bytes received by PID s at step i
I Each call to lpf put at process s to k increases tsi and rkiI Each call to lpf get at process s from k increases r si and tkiI The subsequent lpf sync takes at most hig + l time,
I with hi = maxs maxtsi , r si , the superstep’s h-relation andI with g , l machine-specific parameters.
I Majority of effort goes into ensuring compliance to model.
Some questions:
I How does an algorithm know the value for g and l?
I Can we guarantee anything about the other LPF primitives?
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Immortal algorithms: reduction
Every process has a number α to be reduced at process 0:
I direct all-to-one: (p − 1)(g + 1) + l flops.
I binary tree: dlog2 pe(g + l + 1) flops.
I optimal: minb∈2,3,...,pdlogb pe((b − 1)(g + 1) + l) flops.
I.e., optimal algorithm depends on p, g , and l . Examples:I p = 100, g = 250, l = 10 000, cost in flops:
I 71 757 for b = 2 (binary tree),I 24 518 for b = 10,I 34 849 for b = 100 (direct).
I p = 100, g = 25, l = 10 000, direct is optimal at 12 574 flops.
I p = 10 000, g = 250, l = 100 000, b = 100 is optimal.
Rationale: an immortal algorithm requires run-timeintrospection of the performance model’s parameters.
Machine and performance introspection
lpf probe returns the number of parallel processes the machinesupports, how many are currently free, as well as two functions:
I double g( bsp pid t p, size t wordsize, bsp sync attr t );
I double l( bsp pid t p, size t wordsize, bsp sync attr t );
Other performance guarantees
Time spent in communication (lpf sync) defined by:
I algorithm’s puts/gets (source, destination, and size);
I machine parameters captured by g , l ;
I BSP’s performance model.
Time spent in computation:
I user code;
I calls to LPF primitives(!).
Some performance guarantees:
I lpf register local,global: O(size)
I lpf get, lpf put, lpf deregister: Θ(1)
I lpf probe: Ω(1)
Rationale: Asymptotic behaviour must be defined or process-localcomplexity analysis becomes impossible.
Other performance guarantees
Time spent in communication (lpf sync) defined by:
I algorithm’s puts/gets (source, destination, and size);
I machine parameters captured by g , l ;
I BSP’s performance model.
Time spent in computation:
I user code;
I calls to LPF primitives(!).
Some performance guarantees:
I lpf register local,global: O(size)
I lpf get, lpf put, lpf deregister: Θ(1)
I lpf probe: Ω(1)
Rationale: Asymptotic behaviour must be defined or process-localcomplexity analysis becomes impossible.
Memory guarantees & buffer controlMemory may be constrained:
I small manycore device, or
I core counts that grow faster than available memory.
Even taking Θ(p) memory per process may be too much.
LPF’s internal buffers are explicitly controlled at run-time:
I lpf resize memory register( lpf t context, size t max regs )
I lpf resize message queue( lpf t context, size t max msgs )
Compare these to calling to reserve ahead for a date.
Initial values (fresh contexts): 0 memslots, 0 messages(!)
I first calls in an LPF program are likely these two primitives.
LPF doesn’t forgive:
I registering without reservation: undefined behaviour (UB).
I sending (or receiving!) a message without reservation: UB.
Some performance resultsWe have a POSIX Threads LPF implementation:
15 20 25 300
20
40
60
80
log(n): Vector length
5n lo
g(n)
/ T
: Effe
ctiv
e G
FLO
P/s
ec
FFTW
MKL
LPF - Pthread
HPBSP FFT compared to FFTW3 and Intel MKLI 8-socket shared-memory Intel Xeon E7-8890 v2
I Local memory: 2.8 Gbyte/s/core.
I immortal FFT algorithm: Valiant (’90), Inda & Bisseling (’01).
Some performance resultsIbverbs LPF implementation + PThreads LPF = Hybrid LPF:
15 20 25 300
20
40
60
80
log(n): Vector length
5n lo
g(n)
/ T
: Effe
ctiv
e G
FLO
P/s
ec
FFTW
MKL
LPF - Hybrid - RB
HPBSP FFT compared to FFTW3 and Intel MKLI 8 node times 2 socket Intel Xeon E5-2650
I Local memory: 4.3 Gbyte/s/core, FDR IB: 0.4 Gbyte/s/core.
I immortal FFT algorithm: Valiant (’90), Inda & Bisseling (’01).
Execution types
An LPF SPMD program has the following signature:
I void (*f)(lpf t context, lpf pid t s, lpf pid t P, lpf args t args);
An LPF SPMD program is started through one of:
I lpf exec( lpf t context, bsp pid t P,lpf spmd t program, lpf args t args );
I lpf hook( lpf init t init, lpf spmd t program, lpf args t args );
I lpf rehook( lpf t ctx, lpf spmd t program, lpf args t args );
These primitives block until the LPF program completes.
lpf args t consists of six fields:
I const void * input, size t input size,
I void * output, size t output size,
I const lpf func t f symbols, size t f size.
Hello World
void hello world( lpf t ctx, lpf pid t s, lpf pid t P, lpf args t args )I (void) ctx; (void) args;
I (void) printf( ”Hello world from PID %d / %d\n”, s, P );
int main( int argc, char ** argv ) I const lpf err t err = lpf exec(
LPF ROOT,LPF MAX P,&hello world,LPF NO ARGS
);
I return err == LPF SUCCESS ? 0 : 255;
Hello World
void hello world( lpf t ctx, lpf pid t s, lpf pid t P, lpf args t args )I (void) ctx; (void) args;
I (void) printf( ”Hello world from PID %d / %d\n”, s, P );
int main( int argc, char ** argv ) I const lpf err t err = lpf exec(
LPF ROOT,LPF MAX P,&hello world,LPF NO ARGS
);
I return err == LPF SUCCESS ? 0 : 255;
Encapsulation
Libraries that implement algorithms using LPF should
I define an lpf spmd t entry function, and
I perform I/O using lpf args t.
The library should be aware of exec vs. hook:
I in exec, should distribute input or read input in parallel
I in (re)hook, transfer input 1:1 between host and slave
Example:
I double * x, * y; size t n;
I ... // initialise x, y, n and retrieve or compute x
I lpf args t fft args; fft args.in = x; fft args.out = y;
I fft args.in size = fft args.out size = 2 * n * sizeof(double);
I lpf rehook( context, &lpf fft, fft args );
I ... //use the FFT of x, now stored in y
Interoperability
Past decade, a wildgrowth of parallel programming frameworks:
I Big Data: MapReduce, Spark, Giraph,
Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .
I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk, TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .
LPF aims not to make all these obsolete
Big data’s success due to programmer efficiency.
I not due to performance efficiency.
LPF instead enables interoperability: immortal algorithms should
I transparently integrate with any parallel framework, thus
I be enabled for use as widely as possible.
Rationale: users should use tools best suited for the job at hand.
Interoperability
Past decade, a wildgrowth of parallel programming frameworks:
I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph,
Storm, Hive, Presto, Samza, Heron, Kudu, . . .
I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk, TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .
LPF aims not to make all these obsolete
Big data’s success due to programmer efficiency.
I not due to performance efficiency.
LPF instead enables interoperability: immortal algorithms should
I transparently integrate with any parallel framework, thus
I be enabled for use as widely as possible.
Rationale: users should use tools best suited for the job at hand.
Interoperability
Past decade, a wildgrowth of parallel programming frameworks:
I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .
I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk,
TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .
LPF aims not to make all these obsolete
Big data’s success due to programmer efficiency.
I not due to performance efficiency.
LPF instead enables interoperability: immortal algorithms should
I transparently integrate with any parallel framework, thus
I be enabled for use as widely as possible.
Rationale: users should use tools best suited for the job at hand.
Interoperability
Past decade, a wildgrowth of parallel programming frameworks:
I Big Data: MapReduce, Spark, Giraph, Flink, Ligra,PowerGraph, Storm, Hive, Presto, Samza, Heron, Kudu, . . .
I HPC: MPI, OpenMP, BSPlib, PThreads, Cilk, TBB, Legion,OCR, Chapel, Charm++, PaRSec, OmpSs, StarPU, HPX, . . .
LPF aims not to make all these obsolete
Big data’s success due to programmer efficiency.
I not due to performance efficiency.
LPF instead enables interoperability: immortal algorithms should
I transparently integrate with any parallel framework, thus
I be enabled for use as widely as possible.
Rationale: users should use tools best suited for the job at hand.
Interoperability
The lpf hook is like lpf rehook
I but from arbitrary parallel contexts, not only LPF contexts;
I requires a valid lpf init t instance;
I creating an lpf init t is implementation-defined.
In our LPF implementation, can retrieve one using TCP/IP:
I char * hostname = NULL, * portname = NULL;
I lpf pid t process id = 0, nprocs = 0;
I lpf args t args = LPF NO ARGS;
I ... // user code decides the above variables
I lpf init t init = LPF INIT NONE;
I lpf mpi initialize over tcp( hostname, portname, 30 000,process id, nprocs, &init );
I lpf hook( init, &spmd, args );
I ... //user code continues making use of args.out
Interoperability
A bridge between Big Data and HPC (Y., PMAA ’16):
I Spark I/O via native RDDs and native Scala interfaces;
I Rely on serialisation and the JNI, switch to C;
I Intercept Spark’s exec. model via lpf hook, switch to SPMD;
I Set up and enable inter-process RDMA communications.
cage15, n = 5 154 859, nz = 99 199 551
Robustness
LPF allows for the following return codes (lpf err t):
I LPF SUCCESS
I LPF OUT OF MEMORY
I LPF ERR FATAL
All errors except LPF ERR FATAL:
I must have no side-effects
I must be mitigable
On encountering LPF ERR FATAL:
I any subsequent call to any LPF primitive results in UB;
I user code can only clean up and return.
All errors are local;
I no globally consistent error codes, even for lpf sync(!)
Formal semantics
LPF’s 12 primitives have their semantics formally defined.
Enables use of formal methods in parallel software. E.g.,
I verification of (correct) use of LPF primitives
I automatic cost analysis
Combined with sequential verification tools:
I verification of user programs
I verified parallel code generation
State of an LPF machine is a vector of sequential states;
I only the lpf sync induces global change,
I calling any other primitive incurs local changes only.
BSPlib: Tesson & Loulergue, 2007; Gava & Fortin, 2008;Jokabsson et al., 2017; Jokabsson, 2018. For LPF:
I modified to work with unbuffered DRMA
Extensibility
Functional and performance semantics extended throughattributes:
I lpf msg attr t: an attribute attached to a put or get
I lpf sync attr t: an attribute attached to a superstep
Example of a change in functional semantics:
I Message attribute changing a put into an accumulate
Example of a change in performance semantics:
I optimisations for pre-defined communication patterns
I zero-cost synchronisation (Alpert & Philbin, ’97)
Example of a change in both:
I message attribute introducing staleness (Xing et al., ’15)
Summary
LPF is a communication layer for immortal algorithms:
1. allows implementing portable immortal algorithms
2. allows interoperable use of immortal algorithms
LPF is robust:
I formal specification of primitives
I error codes and mitigation
LPF supports easier-to-use, higher-level libraries:
I Encapsulation (lpf rehook) aids library design (FFT, e.g.)
I Bulk-Synchronous Message Passing (BSMP, send/move)
I Collectives library (Suijlen, 2019)
I BSPlib interface
LPF is extensible.
GraphBLAS
Lightweight Parallel FoundationsCommunicationPerformance modelExecution, interoperability, and more
GraphBLASHow to achieve high performance: past lessons learnedLessons learned applied to GraphBLAS
Motivation
What is GraphBLAS not about:
I politics
I replacing BSPlib, MPI, MapReduce, Spark, . . .
What it is about:
I making parallel programming easier
I close-to-the-metal performance
I guaranteed performance
I interoperability
Community:
http://www.graphblas.org
Note that the standard API is C11, ours is C++11.
I joint work with Jonathan M. Nash & Daniel Di Nardo
Motivation
What is GraphBLAS not about:
I politics
I replacing BSPlib, MPI, MapReduce, Spark, . . .
What it is about:
I making parallel programming easier
I close-to-the-metal performance
I guaranteed performance
I interoperability
Community:
http://www.graphblas.org
Note that the standard API is C11, ours is C++11.
I joint work with Jonathan M. Nash & Daniel Di Nardo
History
APIs & Software:
I Basic Linear Algebra Subroutines, Lawson et al., 1979
I Sparse BLAS, Remington & Pozo, 1996
I Combinatorial BLAS (CombBLAS), Buluc & Gilbert, 2011I GraphBLAS, GraphBLAS.org, 2015 onwards
I GraphBLAS Template Library (GBTL), C++, CPU + GPUMcMillan et al., CMU and others, 2015 onwards;
I This work, C++11, CPU + cluster + mobileY. et al., Huawei Paris, 2016 onwards;
I SuiteSparse::GraphBLAS, C11, CPUDavis, TAMU, 2018 onwards;
I GraphBLAST, C11 (+Gunrock), GPUYang et al., U. Illinois and others, 2019 onwards.
History
Conceptually, the idea of using math concepts in programming, orgeneralised linear algebraic concepts, seems somewhat recurrent:
I The Design and Analysis of Computer Algorithms,Aho, Hopcroft, Ullman (1974)
I Introduction to Algorithms (first edition?),Cormen, Leiserson, Rivest (1990)
I Elements of Programming,Alexander Stepanov & Paul McJones (2009)
I Graph Algorithms in the Language of Linear Algebra,Jeremy Kepner & John Gilbert (2011)
I From Mathematics to Generic Programming,Alexander Stepanov & Daniel Rose (2015)
Core concepts
I scalars, α ∈ D: standard C++ (POD) types
I vectors, x ∈ Dn, sparse and dense: grb::Vector< D > x;
I matrices, A ∈ Dm×n, sparse: grb::Matrix< D > A;
I In the above templates, D may be void for pattern-only data.
I unary operators, f : D1 → D2
I binary operators, g : D1 × D2 → D3
I monoid, < D1, D2, D3, ⊕, 0 >I ⊕ : D1 × D2 → D3
I ∀a ∈ D1, b ∈ D2 : ⊕(a, 0) = a, ⊕(0, b) = bI ⊕ must be associative
I semiring, < D1, D2, D3, D4, ⊕, ⊗, 0, 1 >I ⊗ : D1 × D2 → D3, forms a monoid with 1I ⊕ : D3 × D4 → D4, forms a commutative monoid with 0I Left-distributive: ⊗(m,⊕(a, b)) = ⊕(⊗(m, a),⊗(m, b))I Right-distributive: ⊗(⊕(a, b),m) = ⊕(⊗(a,m),⊗(b,m))I ∀a ∈ D1, b ∈ D2 : ⊗(a, 0) = 0, ⊗(0, b) = 0
Core concepts
I scalars, α ∈ D: standard C++ (POD) types
I vectors, x ∈ Dn, sparse and dense: grb::Vector< D > x;
I matrices, A ∈ Dm×n, sparse: grb::Matrix< D > A;
I In the above templates, D may be void for pattern-only data.
I unary operators, f : D1 → D2
I binary operators, g : D1 × D2 → D3
I monoid, < D1, D2, D3, ⊕, 0 >I ⊕ : D1 × D2 → D3
I ∀a ∈ D1, b ∈ D2 : ⊕(a, 0) = a, ⊕(0, b) = bI ⊕ must be associative
I semiring, < D1, D2, D3, D4, ⊕, ⊗, 0, 1 >I ⊗ : D1 × D2 → D3, forms a monoid with 1I ⊕ : D3 × D4 → D4, forms a commutative monoid with 0I Left-distributive: ⊗(m,⊕(a, b)) = ⊕(⊗(m, a),⊗(m, b))I Right-distributive: ⊗(⊕(a, b),m) = ⊕(⊗(a,m),⊗(b,m))I ∀a ∈ D1, b ∈ D2 : ⊗(a, 0) = 0, ⊗(0, b) = 0
GraphBLASGraph algorithms in the language of linear algebra:
I a graph (V ,E ) is a sparse matrix A ∈ D|V |×|V |E
I an edge is a nonzero (i , j) coordinate in A
I an edge weight is a nonzero aij ∈ A
I a vertex is a coordinate i , 0 ≤ i < |V |I a vertex weight is a vector nonzero xi , x ∈ D
|V |V
Graph algos as (sparse) linear algebra, parametrised in semirings:I different from C++ overloaded ‘+’, ‘*’ operators
I same sparse container may be subject to different semirings,reinterpreting computation for statically typed A and x :
I grb::mxv( y, A, x, ring1 );I grb::mxv( y, A, x, ring2 );
Kepner & Gilbert: GA in the Language of LADOI: 10.1137/1.9780898719918 (SIAM, 2011)
GraphBLASGraph algorithms in the language of linear algebra:
I a graph (V ,E ) is a sparse matrix A ∈ D|V |×|V |E
I an edge is a nonzero (i , j) coordinate in A
I an edge weight is a nonzero aij ∈ A
I a vertex is a coordinate i , 0 ≤ i < |V |I a vertex weight is a vector nonzero xi , x ∈ D
|V |V
Graph algos as (sparse) linear algebra, parametrised in semirings:I different from C++ overloaded ‘+’, ‘*’ operators
I same sparse container may be subject to different semirings,reinterpreting computation for statically typed A and x :
I grb::mxv( y, A, x, ring1 );I grb::mxv( y, A, x, ring2 );
Kepner & Gilbert: GA in the Language of LADOI: 10.1137/1.9780898719918 (SIAM, 2011)
GraphBLASGraph algorithms in the language of linear algebra:
I a graph (V ,E ) is a sparse matrix A ∈ D|V |×|V |E
I an edge is a nonzero (i , j) coordinate in A
I an edge weight is a nonzero aij ∈ A
I a vertex is a coordinate i , 0 ≤ i < |V |I a vertex weight is a vector nonzero xi , x ∈ D
|V |V
Graph algos as (sparse) linear algebra, parametrised in semirings:I different from C++ overloaded ‘+’, ‘*’ operators
I same sparse container may be subject to different semirings,reinterpreting computation for statically typed A and x :
I grb::mxv( y, A, x, ring1 );I grb::mxv( y, A, x, ring2 );
Kepner & Gilbert: GA in the Language of LADOI: 10.1137/1.9780898719918 (SIAM, 2011)
GraphBLASGraph algorithms in the language of linear algebra:
I a graph (V ,E ) is a sparse matrix A ∈ D|V |×|V |E
I an edge is a nonzero (i , j) coordinate in A
I an edge weight is a nonzero aij ∈ A
I a vertex is a coordinate i , 0 ≤ i < |V |I a vertex weight is a vector nonzero xi , x ∈ D
|V |V
Graph algos as (sparse) linear algebra, parametrised in semirings:I different from C++ overloaded ‘+’, ‘*’ operators
I same sparse container may be subject to different semirings,reinterpreting computation for statically typed A and x :
I grb::mxv( y, A, x, ring1 );I grb::mxv( y, A, x, ring2 );
Kepner & Gilbert: GA in the Language of LADOI: 10.1137/1.9780898719918 (SIAM, 2011)
Examples
(1) PageRank
I canonical example of a graph LA algo
I ‘regular’ LA (R,+, ∗, 0, 1): dot products, sparse matrix–densevector multiplication, element-wise vectors ops, ...
I With L the link matrix, row sum its row sums, inner looponly, pseudo code and standard+GraphBLAS C++ primitives:
1: for iter = 0 to max iter while r > tol do2: δ = (pr , row sum), grb::dot3: pr next = pr .* row sum, grb::apply4: δ = 1/n(α ∗ δ + 1− α), standard C++ scalar ops
5: pr next = pr nextL, grb::vxm6: pr next = pr next .+ δ, grb::foldl7: r = ||pr − pr next||1, grb::norm8: pr = pr next, std::swap
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
0
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
0
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
d
a
x ind = 0, aad = 5, x in
a = “zero ′′ =∞, so xouta = min0 + 5,∞.
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
5
0
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
0
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
d
a
Now move to the next row and see how vertex b interacts with d ...
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
5
0
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
0
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
d
b
w(d) = 0, w(a, b) = 9, xb =∞ (‘zero’), so yb = min0 + 9,∞.
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
0
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
0
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
d
b
We’ll now finish the iteration in one go...
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
0156
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
0
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
d
a
b
e
f
x now are the shortests 1-hop distances from d . Now on to A2y ...
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
0156
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
59
0156
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
a
b
d
From a to b: min5 + 7, 9 = 9. From a to d : min5 + 5, 0 = 0.
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
170
156
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
59
0156
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
b
a
c
d e
From b to a, c , d , e, followed by d to a, b, e, f
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
170
156
24
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
59
0156
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
e
b c
f g
From e to b, c , f , g; now only f remains to do in this iteration
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
170
146
17
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
59
0156
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
e
f g
From f to e, g. x , the shortest distances from d within two hops.
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Nearest-neighbours
I a canonical graph algorithm example
I modified semiring: (N0,min,+,∞, 0)
I starting from vertex d :
59
170
146
17
︸ ︷︷ ︸x+=Ax
+=
x 7 57 x 8 9 7
8 x 55 9 x 15 6
7 5 15 x 8 96 8 x 11
9 11 x
︸ ︷︷ ︸
A
59
0156
︸ ︷︷ ︸
x
a
b c
d e
f g
7
85
9
7
515
6
8
9
11
Stop condition: keep iterating until x does not change
graph illustration in Tikz from http://www.texample.net/tikz/examples/prims-algorithm/
Examples
(2) Single-source shortests paths
I Modify semiring: (N0,min,+,∞, 0) and start from vertex d
I In C++, with uint an unsigned int:
1: grb::semiring< uint, grb::operators::min, grb::operators::add,grb::identities::infinity, grb::identities::zero > spR;
2: n = grb::nrows(A); grb::vector< bool > mask( n );3: grb::vector< uint > x( n ), y( n );4: grb::set( x, d, 0 ); grb::set( mask, x ); grb::set( y, x );5: grb::Vector< bool > eWiseEq( n ); bool eq = false;6: while !eq do7: grb::mxv< descriptors::invert mask | descriptors::no casting
| grb::descriptors::in place >( y, mask, A, x, spR );8: grb::apply( eWiseEq, x, y, operators::is equal< uint >() );9: eq = true; std::swap( x, y );
10: grb::foldl( eq, eWiseEq, operators::is equal< bool >() );11: return x
Examples
Gabor Szarnyas recently produced an overview of algos:
I Breadth-First Search, Θ(|E |)I Single-source Shortests Paths, Bellman-Ford, Θ(|V ||E |)I All-pairs shortest paths, Θ(|V |3)
I Minimum spanning tree, Boruvka, |E | log |V |I Maximum flow, Θ(|V ||E |2)
Some algorithms not at best known complexities:
I Dijkstra (SSSP), Prim (APSP), max. independent sets...
Some recent work:
I Linear Algebraic Depth-First Search, Spampinato et al., 2019
I DFS previously a canonical ‘counter-example’ to GraphBLAS
I Adds permutation-based stack semantics to GraphBLAS
Summary
GraphBLAS is:
I a way to express graph algorithms in linear algebraI a C11 standard with two compliant implementations
I Tim Davis’ SuiteSparse::GraphBLASI IBM’s GPI with a C11 GraphBLAS wrapper (Moreira et al.)
I two native C++ implementations (CMU, Huawei)
I increasingly many algorithms: LAGraph
Debate:
I LA is the ‘right way’ to look at graph algorithms
I one alternative: ‘think like a vertex’, a lot of adoption
Less debatable:
Summary
GraphBLAS is:
I a way to express graph algorithms in linear algebraI a C11 standard with two compliant implementations
I Tim Davis’ SuiteSparse::GraphBLASI IBM’s GPI with a C11 GraphBLAS wrapper (Moreira et al.)
I two native C++ implementations (CMU, Huawei)
I increasingly many algorithms: LAGraph
Debate:
I LA is the ‘right way’ to look at graph algorithms
I one alternative: ‘think like a vertex’, a lot of adoption
Less debatable:
Summary
GraphBLAS is:
I a way to express graph algorithms in linear algebraI a C11 standard with two compliant implementations
I Tim Davis’ SuiteSparse::GraphBLASI IBM’s GPI with a C11 GraphBLAS wrapper (Moreira et al.)
I two native C++ implementations (CMU, Huawei)
I increasingly many algorithms: LAGraph
Debate:
I LA is the ‘right way’ to look at graph algorithms
I one alternative: ‘think like a vertex’, a lot of adoption
Less debatable:
I Sparse LA software techniques can speed up graph algos
Central obstacles for SpMV multiplication performance
Shared-memory:
I limited memory throughput,
I inefficient cache use, and
I non-uniform memory access (NUMA) issues.
Distributed-memory:
I inefficient network use.
Initial analysis of Ax = b (SpMV) by roofline:
I read input matrix once
I two flops per nonzero
Arithmetic intensity is very low:
I minimise footprint of A
I vectorisation shouldn’t help(Image taken from da Silva et al., DOI 10.1155/2013/428078, Creative Commons Attribution License)
Central obstacles for SpMV multiplication performance
Shared-memory:
I limited memory throughput,
I inefficient cache use, and
I non-uniform memory access (NUMA) issues.
Distributed-memory:
I inefficient network use.
Initial analysis of Ax = b (SpMV) by roofline:
I read input matrix once
I two flops per nonzero
Arithmetic intensity is very low
:
I minimise footprint of A
I vectorisation shouldn’t help(Image taken from da Silva et al., DOI 10.1155/2013/428078, Creative Commons Attribution License)
Central obstacles for SpMV multiplication performance
Shared-memory:
I limited memory throughput,
I inefficient cache use, and
I non-uniform memory access (NUMA) issues.
Distributed-memory:
I inefficient network use.
Initial analysis of Ax = b (SpMV) by roofline:
I read input matrix once
I two flops per nonzero
Arithmetic intensity is very low:
I minimise footprint of A
I vectorisation shouldn’t help(Image taken from da Silva et al., DOI 10.1155/2013/428078, Creative Commons Attribution License)
Bandwidth
Compression leads to better performance:
I Coordinate format storage (COO):
i = (0, 0, 1, 1, 2, 2, 2, 3)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)
for k = 0 to nz − 1yik := yik + vk · xjk
The coordinate (COO) format: two flops versus five data words.
Θ(3nz) storage.
Bandwidth
Compression leads to better performance:
I Compressed Row Storage (CRS):
i = (0, 0, 1, 1, 2, 2, 2, 3)i start = (0, 2, 4, 7, 8)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)
for i = 0 to m − 1for k = i start i to i start i+1 − 1
yi := yi + vk · xjk
The CRS format:
From Θ(3nz) storage to Θ(2nz + m + 1).
Can do the same column-wise, leading to CCS.
Bandwidth
Compression leads to better performance:
I Compressed Row Storage (CRS):
i = (0, 0, 1, 1, 2, 2, 2, 3)i start = (0, 2, 4, 7, 8)j = (0, 4, 2, 4, 1, 3, 5, 2)v = (a00, a04, . . . , a32)
for i = 0 to m − 1for k = i start i to i start i+1 − 1
yi := yi + vk · xjk
The CRS format:
From Θ(3nz) storage to Θ(2nz + m + 1).
Can do the same column-wise, leading to CCS.
Inefficient cache use
Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order (CRS):
Accesses on the input vector are completely unpredictable.
Inefficient cache use
Visualisation of the SpMV multiplication Ax = y with nonzeroesprocessed in row-major order (CRS):
Two orthogonal solution classes:
I reordering matrix rows and/or columns
I reordering matrix nonzeroes (not compatible with CRS!)
Matrix permutations
Goes back to the 70s, linked to cache reuse for SpMVs by the 90s:I Das et al. (1994). The design and implementation of a parallel unstructured Euler solver;
I Sivan Toledo. (1997). Improving the memory-system performance of sparse-matrix vector multiplication;
I Pinar & Heath. (1999). Improving performance of sparse matrix-vector multiplication.
Network data movement during distributed-memory SpMV andcache misses during shared-memory SpMV are both bounded by∑
i
(λi − 1) ,
the λ− 1-metric (row-net model and zig-zag storage; Y & B, ’09).I Lengauer, T. (1990). Combinatorial algorithms for integrated circuit layout. Springer Science & Business
Media.
I Catalyurek, U. V., & Aykanat, C. (1999). Hypergraph-partitioning-based decomposition for parallelsparse-matrix vector multiplication. IEEE Transactions on Parallel and Distributed Systems, 10(7), 673-693.
I Catalyurek, U. V., & Aykanat, C. (2001). A Fine-Grain Hypergraph Model for 2D Decomposition ofSparse Matrices. In IPDPS (Vol. 1, p. 118).
I Vastenhouw, B., & Bisseling, R. H. (2005). A two-dimensional data distribution method for parallel sparsematrix-vector multiplication. SIAM review, 47(1), 67-95.
I Bisseling, R. H., & Meesen, W. (2005). Communication balancing in parallel sparse matrix-vectormultiplication. Electronic Transactions on Numerical Analysis, 21, 47-65.
Matrix permutations
Row-net model:
I columns correspond to vertices, rows to hyperedges.
1 2 3 4
1 2
43
Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods by A. N. Yzelman& Rob H. Bisseling in SIAM Journal of Scientific Computation 31(4), pp. 3128-3154 (2009).
Matrix permutations
Can be done in 2D (medium- & fine-grain models) too. In practice:
Figure: the Stanford link matrix (left) and its 20-part reordering (right).
Sequential execution using CRS on Stanford:
18.99 (original), 9.92 (1D), 9.35 (2D) ms/mul.
Ref.: A Fine-Grain Hypergraph Model for 2D Decomposition of Sparse Matrices by Catalyurek & Aykanat (2001)Two-dimensional cache-oblivious sparse matrix-vector multiplication by Yzelman & Bisseling (2011)
A medium-grain method for fast 2D bipartitioning of sparse matrices by Pelt & Bisseling (2014)A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric SpMV by Alappat et al. (2019)
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Blocking combined with cache-oblivious traversals
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Hilbert on blocked dense matrix storage: Lorton & Wise, 2007
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Hilbert with COO: Haase, Liebmann, & Plank, 2007
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Hilbert with compression: Yzelman & Bisseling, ECMI ’09
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Blocking with Morton inside blocks, COO+CRS: Buluc et al., 2009
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Morton-ordered blocks, quadtree store: Martone et al., 2010
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Hilbert-ordered blocks, fully compressed: Y & Bisseling, 2011
Nonzero reorderings
2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Blocking with Hilbert-ordered blocks: Y & Roose, 2014
Nonzero reorderings2D permutations cannot rely on CRS:
I vertical separators pollute cache; thus
I need to store nonzeroes in specific order.
This makes sense by itself as well:
Sequential SpMV multiplication on the Wikipedia ’07 link matrix:
345 (CRS), 203 (Hilbert), 245 (blocked Hilbert) ms/mul.
Cache-efficiency and bandwidth
Need to consider the whole picture; good cache efficiency but nocompression? Compression but no cache optimisation? No gain!
A =
4 1 3 00 0 2 30 0 0 27 5 1 1
Bi-directional incremental CRS (BICRS):
A =
V [7 5 4 1 2 3 3 2 1 1]
∆J [0 1 3 1 5 4 5 4 3 1]
∆I [3 -3 -2 1 -1 1 1 1]
Allows arbitrary traversals. Storage: Θ(2nz + row jumps + 1).
Vectorisation: no use, right?
Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?Latency-bound, not bandwidth-bound! Gather/scatter is critical.
Ref.: Yzelman. Generalised vectorisation for sparse matrix–vector multiplication. (2015)
Vectorisation: no use, right?
Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?
Latency-bound, not bandwidth-bound! Gather/scatter is critical.
Ref.: Yzelman. Generalised vectorisation for sparse matrix–vector multiplication. (2015)
Vectorisation: no use, right?
Much faster with vectorisation on Xeon Phi 7120A (KNC). Why?Latency-bound, not bandwidth-bound! Gather/scatter is critical.
Ref.: Yzelman. Generalised vectorisation for sparse matrix–vector multiplication. (2015)
NUMA
Each socket has local main memory where access is fast.
Memory
CPUs
Memory access between sockets is slower, leading to non-uniformmemory access (NUMA): access different sockets, different speed.
NUMA
Access to only one socket: limited bandwidth
NUMA
Interleave memory pages across sockets: emulate uniform access
NUMA
Explicit data placement on sockets: best performance
One-dimensional data placementCoarse-grain row-wise distribution, compressed, cache-optimised:
I explicit allocation of separate matrix parts per core,
I explicit allocation of the output vector on the various sockets,
I interleaved allocation of the input vector.
Ref.: Yzelman and Roose, “High-Level Strategies for Parallel Shared-Memory Sparse Matrix–VectorMultiplication”, IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2013.31 (2014).
Two-dimensional data placement
Distribute row- and column-wise (individual nonzeroes):
I most work touches only local data,
I inter-process communication minimised by partitioning;
I incurs cost of partitioning.
Ref.: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication,IEEE Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).
Ref.: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library forshared-memory parallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).
Some kernels illustrated
Fine-grained CRS, using OpenMP:
I no pre-processing required (vs. sequential)
I use numactl –interleave=all on NUMA system
#omp parallel for private( i, k ) schedule( dynamic, 8 )
1: for i = 0 to m − 1 do2: for k = Ii to Ii+1 − 1 do3: add Vk · xJk to yi
Compressed Sparse Blocks, using Cilk:
I Block A into β × β submatrices,
I use numactl –interleave=all on NUMA system,
I omitted: need buffer if multiple threads on a row.
1: cilk for each row of blocks2: for each block3: do SpMV with nonzeroes in Z-curve order
Some kernels illustrated
1D SpMV multiplication:
I when loading in A, distribute rows over threads
1: perform local SpMV
(That’s really it!)
Caveat:
I take care of vector operations– they’re distributed!
Some kernels illustrated
2D SpMV multiplication:
I partition and reorder A
1: for each j s.t. ∃aij local to s while xj is not local do2: bsp get xj from remote process3: bsp sync()
4: Perform SpMV y = Ax , using only those aij local to s
5: for each i s.t. ∃ aij local to s while yi is not local do6: bsp send (yi , i) to the owner of yi7: bsp sync()
8: while bsp qsize() > 0 do9: (α, i) = bsp move()
10: add α to yi
Some kernels illustrated
2D SpMV multiplication:
I partition and reorder A
1: for each j s.t. ∃aij local to s while xj is not local do2: bsp get xj from remote process3: bsp sync()
4: Perform SpMV y = Ax , using only those aij local to s
5: for each i s.t. ∃ aij local to s while yi is not local do6: bsp send (yi , i) to the owner of yi7: bsp sync()
8: while bsp qsize() > 0 do9: (α, i) = bsp move()
10: add α to yi
Some kernels illustrated
2D SpMV multiplication:
I partition and reorder A
1: for each j s.t. ∃aij local to s while xj is not local do2: bsp get xj from remote process3: bsp sync()
4: Perform SpMV y = Ax , using only those aij local to s
5: for each i s.t. ∃ aij local to s while yi is not local do6: bsp send (yi , i) to the owner of yi7: bsp sync()
8: while bsp qsize() > 0 do9: (α, i) = bsp move()
10: add α to yi
Some kernels illustrated
2D SpMV multiplication:
I partition and reorder A
1: for each j s.t. ∃aij local to s while xj is not local do2: bsp get xj from remote process3: bsp sync()
4: Perform SpMV y = Ax , using only those aij local to s
5: for each i s.t. ∃ aij local to s while yi is not local do6: bsp send (yi , i) to the owner of yi7: bsp sync()
8: while bsp qsize() > 0 do9: (α, i) = bsp move()
10: add α to yi
Results
Sequential CRS on Wikipedia ’07: 472 ms/mul. 40 threads BICRS:
21.3 (1D), 20.7 (2D) ms/mul. Speedup: ≈ 22x.
4 sockets, 10 core Intel Xeon E7-4870
Results
Average speedup on six large matrices:2 x 6 4 x 10 8 x 8
–, 1D fine-grained, CRS∗ 4.6 6.8 6.2Blocking, Morton, 1D FG, CSB 7.9 24.3 26.3Hilbert, Blocking, 1D, BICRS∗ 5.4 19.2 24.6Hilbert, Blocking, 2D, BICRS† − 21.3 30.8
†: uses an updated test set. (Added for reference versus a good 2D algorithm.)
As NUMA scales up, 1D algorithms lose efficiency.
Results
Cross-platform:
Structured Unstructured Average
Intel Xeon Phi 21.6 8.7 15.22x Ivy Bridge CPU 23.5 14.6 19.0
NVIDIA K20X GPU 16.7 13.3 15.0
∗: Yzelman and Roose, High-Level Strategies for Parallel Shared-Memory Sparse Matrix–Vector Multiplication,IEEE Trans. Parallel and Distributed Systems, doi:10.1109/TPDS.2013.31 (2014).
†: Yzelman, Bisseling, Roose, and Meerbergen, MulticoreBSP for C: a high-performance library for shared-memoryparallel programming, Intl. J. Parallel Programming, doi:10.1007/s10766-013-0262-9 (2014).
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure? Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure? Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure?
Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure? Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure? Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure? Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure? Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure?
Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure? Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure? Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
In summary: we know how to do high performance parallel SpMV.
I what about sparse matrix–sparse vector multiply (SpMSpV)?
I what about masks?
Suppose y = Ax , x sparse, no mask.
I Best data structure? Column-major!
Suppose y = Ax , x dense, but masked (y(find(m == 0)) = 0).
I Best data structure? Row-major!
Suppose y = Ax , x sparse, and masked?
I ...
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
Sequential mode, four different SpM(Sp)V kernels:
I y = Ax , loop over nonzero indices of x : scatter, column-major
I y = Ax , loop over mask indices: gather, row-major
I y = xA, loop over nonzero indices of x : scatter, row-major
I y = xA, loop over mask indices: gather, column-major
Store matrix twice (row- and column-major). At runtime:
I choose variant with smallest loop-size.
Vector data structure, Θ(1) ops required for:
I checking if the ith vector entry is (non)zero
I jump to the next nonzero (iteration);
Use both an array & a stack to maintain vector nonzero indices.
Application to GraphBLAS
We start simple. For shared-memory parallel:
I use OpenMP-based parallelisation
For distributed-memory parallel:
I 1D row-wise block-cyclic distribution of matrices
I matching block-cyclic distribution of vectors
I rely on OpenMP (or sequential) backend
Due to row-wise 1D distribution over p processes:
I y = Ax SpM(Sp)V requires globally synchronised/replicated x
I y = xA SpM(Sp)V results in p vectors ys , while y =∑
k yk
Use stack-based synchronisation/reduction when useful:
I choose variant with lowest communication cost,
I requires standard collectives: allreduce, allgather, alltoall(v).
Application to GraphBLAS
We start simple. For shared-memory parallel:
I use OpenMP-based parallelisation
For distributed-memory parallel:
I 1D row-wise block-cyclic distribution of matrices
I matching block-cyclic distribution of vectors
I rely on OpenMP (or sequential) backend
Due to row-wise 1D distribution over p processes:
I y = Ax SpM(Sp)V requires globally synchronised/replicated x
I y = xA SpM(Sp)V results in p vectors ys , while y =∑
k yk
Use stack-based synchronisation/reduction when useful:
I choose variant with lowest communication cost,
I requires standard collectives: allreduce, allgather, alltoall(v).
Application to GraphBLAS
We start simple. For shared-memory parallel:
I use OpenMP-based parallelisation
For distributed-memory parallel:
I 1D row-wise block-cyclic distribution of matrices
I matching block-cyclic distribution of vectors
I rely on OpenMP (or sequential) backend
Due to row-wise 1D distribution over p processes:
I y = Ax SpM(Sp)V requires globally synchronised/replicated x
I y = xA SpM(Sp)V results in p vectors ys , while y =∑
k yk
Use stack-based synchronisation/reduction when useful:
I choose variant with lowest communication cost,
I requires standard collectives: allreduce, allgather, alltoall(v).
Application to GraphBLAS
We start simple. For shared-memory parallel:
I use OpenMP-based parallelisation
For distributed-memory parallel:
I 1D row-wise block-cyclic distribution of matrices
I matching block-cyclic distribution of vectors
I rely on OpenMP (or sequential) backend
Due to row-wise 1D distribution over p processes:
I y = Ax SpM(Sp)V requires globally synchronised/replicated x
I y = xA SpM(Sp)V results in p vectors ys , while y =∑
k yk
Use stack-based synchronisation/reduction when useful:
I choose variant with lowest communication cost,
I requires standard collectives: allreduce, allgather, alltoall(v).
In relation to graph frameworks
Since Google’s Pregel, a plethora of graph frameworks:
I CombBLAS, Gunrock, GraphX, Giraph, Ligra, PowerGraph,PowerLyra, Galois, GraphChi, TigerGraph, Venus, Neo4j,ArangoDB, Titan, ...
One of the main Ligra features:
I automatic push/pull ‘direction-optimisation’
I also done by us: scatter/gather & row/col kernel selection
In relationship to other frameworks:
I MapReduce maps to BLAS level 1 (only vector ops)
I Pregel’s vertex-centric programs map directly to SpM(Sp)V
I See, e.g., GraphMAT by Sundaram et al. (2015).
In relation to graph frameworks
Since Google’s Pregel, a plethora of graph frameworks:
I CombBLAS, Gunrock, GraphX, Giraph, Ligra, PowerGraph,PowerLyra, Galois, GraphChi, TigerGraph, Venus, Neo4j,ArangoDB, Titan, ...
One of the main Ligra features:
I automatic push/pull ‘direction-optimisation’
I also done by us: scatter/gather & row/col kernel selection
In relationship to other frameworks:
I MapReduce maps to BLAS level 1 (only vector ops)
I Pregel’s vertex-centric programs map directly to SpM(Sp)V
I See, e.g., GraphMAT by Sundaram et al. (2015).
In relation to graph frameworks
Since Google’s Pregel, a plethora of graph frameworks:
I CombBLAS, Gunrock, GraphX, Giraph, Ligra, PowerGraph,PowerLyra, Galois, GraphChi, TigerGraph, Venus, Neo4j,ArangoDB, Titan, ...
One of the main Ligra features:
I automatic push/pull ‘direction-optimisation’
I also done by us: scatter/gather & row/col kernel selection
In relationship to other frameworks:
I MapReduce maps to BLAS level 1 (only vector ops)
I Pregel’s vertex-centric programs map directly to SpM(Sp)V
I See, e.g., GraphMAT by Sundaram et al. (2015).
Interoperability
One way of dealing with many auxiliary frameworks:
I be compatible with most of them
I matches LPF’s philosophy (Suijlen & Y, 2019)
I distributed-memory GraphBLAS built on top of LPF
I allows interfacing with any framework over TCP/IP
I LPF processes reside in same process space as the host
Applications:
I Spark ML acceleration using LPF
I Graph Analytics acceleration using LPF + GraphBLAS
I GraphBLAS on Spark
I Graph DB on Edge
Interoperability
One way of dealing with many auxiliary frameworks:
I be compatible with most of them
I matches LPF’s philosophy (Suijlen & Y, 2019)
I distributed-memory GraphBLAS built on top of LPF
I allows interfacing with any framework over TCP/IP
I LPF processes reside in same process space as the host
Applications:
I Spark ML acceleration using LPF
I Graph Analytics acceleration using LPF + GraphBLAS
I GraphBLAS on Spark
I Graph DB on Edge
Interoperability
One way of dealing with many auxiliary frameworks:
I be compatible with most of them
I matches LPF’s philosophy (Suijlen & Y, 2019)
I distributed-memory GraphBLAS built on top of LPF
I allows interfacing with any framework over TCP/IP
I LPF processes reside in same process space as the host
Applications:
I Spark ML acceleration using LPF
I Graph Analytics acceleration using LPF + GraphBLAS
I GraphBLAS on Spark
I Graph DB on Edge
Interoperability
One way of dealing with many auxiliary frameworks:
I be compatible with most of them
I matches LPF’s philosophy (Suijlen & Y, 2019)
I distributed-memory GraphBLAS built on top of LPF
I allows interfacing with any framework over TCP/IP
I LPF processes reside in same process space as the host
Applications:
I Spark ML acceleration using LPF
I Graph Analytics acceleration using LPF + GraphBLAS
I GraphBLAS on Spark
I Graph DB on Edge
Future work & Outlook
Implementation:
I use of Suijlen’s new BSP collectives (2019)
I compatibility layer with the C11 standard
Future Work:
I incorporate more advanced matrix partitioning methods?I how to extend to multi-linear algebra (tensor computations)?
I applicable to hypergraph computations?
I graph- and sparse neural networksI for which other (graph) algorithms is GraphBLAS applicable?
I which additions would enlarge suitable areas significantly?
Backup slides
GraphBLAS
A working example:
#include <graphblas.hpp>
int main()
const size_t num_cities = ... //some input matrix size
grb::init();
grb::Matrix< double > distances( num_cities, num_cities );
grb::build( distances, ... ); //input data from file
//or memory
grb::Vector< double > x( num_cities ), y( num_cities );
grb::set( x, 0.0, 4 ); //set city number 4 to
//have distance 0.0
...
GraphBLAS
A working example (continued):
...
//declare an alternative semiring on doubles:
grb::Semiring< double, double, double, double,
grb::operators::min, //‘plus’
grb::operators::add, //‘multiply’
grb::identities::infinity //‘0’
grb::identitites::zero //‘1’
> ring;
//calculate the shortest distances from all cities to
//city #4, allowing only a single path
grb::mxv( y, distances, x, ring );
...
GraphBLAS
A working example (continued):
...
//calculate the shortest distances from all cities to
//city #4, allowing only a single path
grb::mxv( y, distances, x, ring );
//calculate the shortest distances from all cities to
//city #4, allowing two ‘hops’
grb::mxv( x, distances, y, ring );
//example output via iterators and exit:
writeResult( x.cbegin(), x.cend(), ... );
grb::finalize();
return 0;
Vectorised BICRS
Vectorised BICRS: l = p × q = 2× 2
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
while there are nonzero blocks do
load next l row indices into r5
r5 = (0, 1, 2, 3)
gather output vector elements into r6 (using r5)
r6 = (y0, y1, y2, y3)
for offset = 0 to l − 1 step pset r0 to all zerohandle all nonzero blocks sharing these rows...
Vectorised BICRS: 2× 2
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
for each nonzero block do
load nonzeroes into r3, load nonzero column indices in r4
r3 = (4, 1, 2, 3), r4 = (0, 1, 2, 3)
gather corresponding elements from x into r2 (using r4)
r2 = (x0, x1, x2, x3)
do vectorised multiply-add
r1 = r1 + r2 r3
Vectorised BICRS: 2× 2
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
for each nonzero block do
load nonzeroes into r3, load nonzero column indices in r4
r3 = (4, 1, 2, 3), r4 = (0, 1, 2, 3)
gather corresponding elements from x into r2 (using r4)
r2 = (x0, x1, x2, x3)
do vectorised multiply-add
r1 = r1 + r2 r3
Vectorised BICRS: 2× 2
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
for each nonzero block do
load nonzeroes into r3, load nonzero column indices in r4
r3 = (4, 1, 2, 3), r4 = (0, 1, 2, 3)
gather corresponding elements from x into r2 (using r4)
r2 = (x0, x1, x2, x3)
do vectorised multiply-add
r1 = r1 + r2 r3
Vectorised BICRS: 2× 2
A =
4 1 3 00 0 2 31 0 0 27 0 1 1
while there are nonzero blocks do
...set r0 to all zerohandle all nonzero blocks sharing these rowsreduce output r1 into r6
r6 = (yoffset + (r1)1 + (r1)2, yoffset+1 + (r1)3 + (r1)4, . . .)
end forscatter r6 unto y (using r5)
end while
Sequential SpMV: reordering
Input
Sequential SpMV: reordering
Column partitioning
Sequential SpMV: reordering
Column permutation
Permuting to SBD form
Mixed row detection
Permuting to SBD form
Row permutation
Sequential SpMV: reordering
No cache misses
1 cache miss per row
1 cache miss per row
3 cache misses per row
Sequential SpMV: reordering
No cache misses
1 cache miss per row
3 cache misses
1 cache miss per row
7 cache misses per row
1 cache miss per row
3 cache misses per row
1 cache miss per row
Sequential SpMV: reordering
1D (p = 20, ε = 0.1) Finegrain (p = 100, ε = 0.1)