· ss17 ©jesper larsson träff mpi: the “message-passing interface” • library with c and...

521
©Jesper Larsson Träff SS17 High Performance Computing Advanced MPI implementations Jesper Larsson Träff traff@par.... Parallel Computing, 184-5 Favoritenstrasse 16, 3. Stock Sprechstunde: By email-appointment

Upload: others

Post on 24-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

High Performance ComputingAdvanced MPI implementations

Jesper Larsson Träfftraff@par....

Parallel Computing, 184-5Favoritenstrasse 16, 3. Stock

Sprechstunde: By email-appointment

Page 2:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

High Performance Computing: MPI topics and algorithms

• “Advanced” features of MPI (Message-Passing Interface): Collective operations, non-blocking collectives, datatypes, one-sided communication, process topologies, MPI I/O, …

• Efficient implementation of MPI collectives: Algorithms under (simplified) network assumptions

Page 3:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI: the “Message-Passing Interface”

• Library with C and Fortran bindings that implements a message-passing model

• Current de facto standard in HPC and distributed memory parallel computing

PGAS competitors:UPC, CaF, OpenSHMEM, …

Page 4:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI: the “Message-Passing Interface”

• Library with C and Fortran bindings that implements a message-passing model

• Current de facto standard in HPC and distributed memory parallel computing

Open source implementations:

•mpich from Argonne National Laboratory (ANL)•mvapich from Ohio State•OpenMPI, community effort

from which many special purpose/vendor implementations are derived

Page 5:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Message-passing model

• Processors (in MPI: processes – something executed by a physical processor/core) execute program on local data

• Processors exchange data and synchronize by explicit communication; only way to exchange information (“shared nothing”)

Communication:• Asynchronous (non-blocking) or synchronous (blocking)• In-order, out of order• One-to-one, one-to-many, many-to-one, many-to-many

Program: MIMD, SPMD

Page 6:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Strict message-passing (theoretical model):

Synchronous, in-order, one-to-one communication: Enforces event order, easier to reason about correctness, mostly deterministic execution, no race-conditions

Practical advantages:

• Enforces locality (no cache sharing, less memory per process)• Synchronous communication can be implemented space

efficiently (no intermediate buffering)

Hoare: CSP

May: OCCAM

C. A. R. Hoare: Communicating Sequential Processes. Comm. ACM 21(8): 666-677 (1978)

Page 7:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI message-passing model

Three main communication models:

Point-to-point: Two processes explicitly involved

One-sided: One process explicitly involved

Collective: p≥1 processes involved

i j

MPI_Send MPI_Recv

i j

MPI_Put/Get/Accumulate

i j

MPI_Bcast

lk

Page 8:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI message-passing model

All communication between named processes in same communication domain:

Communication domain (“communicator”): Ordered set of processes that can communicate (e.g., MPI_COMM_WORLD)

Name: For domain of p processes, rank between 0 and p-1

i jlk

Page 9:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Note:MPI is a specification; not an implementation (term “MPI library” refers to specific implementation of the standard)

MPI specification/standard prescribes (almost) nothing about the implementation, in particular:

• No performance model (thus no assumptions on communication network)

• No performance guarantees• No algorithms (for collective communication, synchronization,

datatypes etc.)• Some assumptions about “High quality implementation”

Page 10:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Both sender and receiver explicitly involved in communication

• Communication is reliable (implementation and underlying communication infrastructure must guarantee this)

• Communication is ordered: Messages with same destination and same tag arrive in order sent (implementation must guarantee this)

i, j: name (“rank”) of process

Page 11:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused (*)

• Blocking and non-blocking (“immediate”) semantics

i, j: name (“rank”) of process

Blocking MPI call:Returns when operation locally complete, buffer can be reused, may or may not depend on actions of other processes, no guarantee regarding state of other processes

(*) Whether MPI_Sendcan complete may depend on MPI implementation

Page 12:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused

• Blocking and non-blocking (“immediate”) semantics

i, j: name (“rank”) of process

Non-blocking MPI call:Returns immediately, independent of actions of other processes, buffer may be in use. Completion as in blocking call must be enforced (MPI_Wait, MPI_Test, …)

Page 13:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused

• Blocking and non-blocking (“immediate”) semantics

i, j: name (“rank”) of process

Terminology note:Non-blocking in MPI is not a non-blocking progress condition, but means that the calling process can continue; progress of communication eventually depends on progress of other process

Page 14:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused

• Blocking and non-blocking (“immediate”) semantics• Many send “modes”: MPI_Ssend (MPI_Bsend, MPI_Rsend)

i, j: name (“rank”) of process

int x[count];

MPI_Send(x,count,MPI_INT,tag,dest,comm);

Page 15:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused

• Blocking and non-blocking (“immediate”) semantics• Many send “modes”: MPI_Ssend (MPI_Bsend, MPI_Rsend)

i, j: name (“rank”) of process

int x[count];

MPI_Issend(x,count,MPI_INT,tag,dest,

comm,&request);

Page 16:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Point-to-point communication

i j

MPI_Send MPI_Recv

• Largely asynchronous model: MPI_Send has semi-local completion semantics, when call completes buffer can be reused

• Blocking and non-blocking (“immediate”) semantics• Non-determinism only through “wildcards” in MPI_Recv

i, j: name (“rank”) of process

int x[count];

MPI_Recv(x,count,MPI_INT,

MPI_ANY_TAG,MPI_ANY_SOURCE,comm,&status);

Page 17:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

One-sided communication

i j

MPI_Put/Get/Accumulate

• Only one process explicitly involved in communication; origin supplies all information

• Communication is reliable, but not ordered• Non-blocking communication, completion at synchronization

point• Different types of synchronization (active/passive;

collective/localized)

• Target memory in exposed/accessible window

News in MPI 3.0origin target

Introduced with MPI 2.0

Page 18:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Collective communication

i j

MPI_Bcast

lk

• All processes explicitly involved in data exchange or compute operation (one-to-many, many-to-one, many-to-many)

• All collective operations blocking in the MPI sense: Return when operation is locally complete, buffers can be reused

• Operations are reliable• Collectives (except for MPI_Barrier) are not synchronizing

News in MPI 3.0

float *buffer = malloc(count*sizeof(float));

MPI_Bcast(buffer,count,MPI_FLOAT,root,comm);

Page 19:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Collective communication

i j

MPI_Bcast

lk

MPI_Barrier

MPI_BcastMPI_Gather/MPI_Scatter

MPI_Reduce

Synchronizing:

Exchange, regular, rooted Reduction, regular, rooted

MPI_AllgatherMPI_Alltoall

MPI_AllreduceMPI_Reduce_scatter_blockMPI_Scan/MPI_Exscan

Non-rooted (symmetric) Non-rooted (symmetric)

Page 20:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Collective communication

i j

MPI_Bcast

lk

MPI_Gatherv/MPI_ScattervMPI_AllgathervMPI_Alltoallv/MPI_Alltoallw

MPI_Reduce_scatter

Exchange, irregular (vector) Reduction, irregular (vector)

Irregular collective: Size and basetype (technically: type signature) of buffers must match pairwise between processes, but can differ between different pairs

Page 21:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

All models

Communication is wrt. communicator: Ordered set of processes, mapped to physical nodes/processors/cores

In one-sided model, communicator is part of memory-window

i

Core k

comm

MPI processes (mostly) statically bound to processors/cores

Processes have rank relative to communicator; the same process can have different ranks in different communicators

MPI_Comm_rank(comm,&rank);

MPI_Comm_size(comm,&size);

Page 22:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

i

Core k

comm1

Processes have rank relative to communicator; the same process can have different ranks in different communicators

j comm2

j

Process with rank j in comm1 may have to send state/data to the process with rank j in comm2

MPI has multi-purpose (collective) operations for creating new communicators out of old ones. If a different mapping is needed, new communicator must be created (MPI objects are static)

Page 23:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

All models

MPI terms: Derived datatypes, user-defined datatypes, … (MPI 3.1 standard, Chapter 4)

• Data in buffers can be arbitrarily structured, not necessarily only consecutive elements

• Data structure/layout communicated to MPI implementation by datatype handle

• Communication buffers: Buffer start (address), element count, element datatype

e.g. indexed datatype

Page 24:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Beyond message-passing MPI features

• MPI-IO: Serial and parallel IO, heavy use of derived datatypes in specification

• MPI process management (important concept: intercommunicators): Coarse grained management (spawn, connect) of additional processes

• Topologies: Mapping application communication patterns to communication system

• Tools support: Profiling interface and library information

Page 25:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: three algorithms for matrix-vector multiplication

mxn matrix A and n-element vector x distributed evenly across p MPI processes: compute

y = Ax

with y the m-element result vector

Even distribution:• Each of p processes has an mn/p element submatrix, an n/p

element subvector, and computes an m/p element result vector.

• Algorithms should respect/preserve distribution

Page 26:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

x

Proc p-1

Algorithm 1:• Row-wise matrix

distribution• Each process needs

full vector: MPI_Allgather(v)

• Compute blocks of result vector locally

A0A1A2

x0x1x2

y0y1y2

= Ai: (m/p) x n matrix

Conjecture: T(m,n,p) = O((m/p)n + n + log p), p≤m Why?

Page 27:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0 xProc p-1

Algorithm 2:• Column-wise matrix

distribution• Compute local partial

result vector• MPI_Reduce_scatter

to sum and distribute partial results

y0y1y2

= A0x0 + A1x1 + A2x2A = (A0 | A1 | A2), Ai m x (n/p) matrix

Conjecture: T(m,n,p) = O(m(n/p) + m + log p), p≤n Why?

Page 28:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Conjecture:

• MPI_Allgatherv(n) can be done in O(n+log p) communication steps

• MPI_Reduce_scatter(n) can be done in O(n+log p) communication steps

For certain types of networks, and not better (regardless of network)

This lecture: Will substantiate the conjectures (proofs and constructions)

Page 29:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc 2c

Proc c

Proc c-1…

x

MPI

_A

llga

ther

MPI

_A

llga

ther

Algorithm 3:• Matrix distribution

into blocks of m/r x n/c elements

• Algorithm 1 on columns• Algorithm 2 on rows

Factor p = rc, each process has m/r x n/c submatrix and n/rc = n/p subvector

c

r

Step 1: Simultaneous MPI_Allgather(v) on all c process columns

Page 30:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc 2c

Proc c

Proc c-1…

x

MPI_Reduce_scatter

MPI_Reduce_scatter

MPI_Reduce_scatter

p=rc

Algorithm 3:• Matrix distribution

into blocks of m/r x n/c elements

• Algorithm 1 on columns• Algorithm 2 on rows

Step 4: Simultaneous MPI_Reduce_Scatter(_block) on process row communicators

c

r

Factor p = rc, each process has m/r x n/c submatrix and n/rc = n/p subvector

Page 31:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc 2c

Proc c

Proc c-1…

x

MPI_Reduce_scatter

MPI_Reduce_scatter

MPI_Reduce_scatter

p=rc

Algorithm 3:• Matrix distribution

into blocks of m/r x n/c elements

• Algorithm 1 on columns• Algorithm 2 on rows

Conjecture:T(m,n,p) = O(mn/p + n/c + m/r + log p), p≤min(mc,nr)

c

r

Factor p = rc, each process has m/r x n/c submatrix and n/rc = n/p subvector

Page 32:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc 2c

Proc c

Proc c-1…

x

MPI_Reduce_scatter

MPI_Reduce_scatter

MPI_Reduce_scatter

p=rc

Algorithm 3:• Matrix distribution

into blocks of m/r x n/c elements

• Algorithm 1 on columns• Algorithm 2 on rows

Algorithm 3 is more scalable. To implement the algorithm, it is essential that independent, simultaneous, collective communication on (column and row) subsets of processes is possible

Note:Interfaces that do not support collectives on subsets of processes cannot express Algorithm 3 (e.g., UPC, CaF)

c

r

Page 33:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI Process topologies (MPI 3.1, Chapter 7)

Observation:For Algorithm 3, row and column subcommunicators are needed; must support concurrent collective operations

Convenient to organize processors into a Cartesian rxc mesh with MPI processes in row-major.Use this Cartesian (x,y) naming to create subcommunicators for the matrix-vector example

Cartesian naming useful for torus/mesh algorithms, e.g., for d-dimensional stencil computations (Jacobi, Life, …)

Page 34:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

rcdim[0] = c; rcdim[1] = r;

period[0] = 0; period[1] = 0; // no wrap around

reorder = 0; // no attempt to reorder

MPI_Cart_create(comm,2,rcdim,period,reorder,&rccomm);

(0,0)Proc 0

(r-1,0)Proc 2c

(1,0)Proc c

(0,c-1)Proc c-1

(r-1,c-1)

(1,c-1)

…Same processes in rccommcommunicator as in comm. But in rccomm, processes also have a d-dimensional coordinate as “name”.

But: Ranks may not be bound to the same processors (if reorder=1)

Page 35:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

rcdim[0] = c; rcdim[1] = r;

period[0] = 0; period[1] = 0; // no wrap around

reorder = 0; // no attempt to reorder

MPI_Cart_create(comm,2,rcdim,period,reorder,&rccomm);

(0,0)Proc 0

(r-1,0)Proc 2c

(1,0)Proc c

(0,c-1)Proc c-1

(r-1,c-1)

(1,c-1)

Translation functions

int rccoord[2];

MPI_Cart_coords(rccomm,

rcrank,

2,rccoord);

MPI_Cart_rank(rccomm,

rccoord,

&rcrank);

rank = ∑0≤i<d(coord[i]∏0≤j<id[j]) E.g., rank(2,c-2) = c-2+2c = 3c-2

Page 36:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

rcdim[0] = c; rcdim[1] = r;

period[0] = 0; period[1] = 0; // no wrap around

reorder = 0; // no attempt to reorder

MPI_Cart_create(comm,2,rcdim,period,reorder,&rccomm);

(0,0)Proc 0

(r-1,0)Proc 2c

(1,0)Proc c

(0,c-1)Proc c-1

(r-1,c-1)

(1,c-1)

…Communication in rccomm uses ranks (not coordinates). Communication between any ranks possible/allowed. Processes are neighbors if they are adjancent in Cartesian mesh/torus

Page 37:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

int dimensions[d] = {0, …, 0}; // all dimensions free

MPI_Dims_create(p,d,dimensions);

Note:Seemingly innocent interface entails integer factorization.Complexity of factorization is unknown (probably high)

Note: This algorithm for 2-dimensional balancing is exponential:

f = (int)sqrt((double)p);

while (p%f!=0) f--;

Factor p into d dimensions that are as close to each other as possible… (MPI 3.0, p.292)

Pseudo-polynomial: Polynomial in the magnitude, but not the size of p

Page 38:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jesper Larsson Träff, Felix Donatus Lübbe:Specification Guideline Violations by MPI_Dims_create. EuroMPI2015: 19:1-19:2

But… factorization of small numbers (p in the order of millions) that fulfill the MPI specification (dimension sizes in increasing order) is not expensive (for p up to millions):

Page 39:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Cart_create operation

• Defines Cartesian naming, creates new communicator, and caches information with this (number and sizes of dimensions …)

• Row-major ordering of processes (dimension order 0, 1, 2, …)

• Neighbors: The 2d adjacent processes in each dimension

• Defines order of neighbors (dimension-wise), important for neighborhood collectives

• May (reorder=1) attempt to remap processes, such that processes that are neighbors in virtual application topology are neighbors in physical system

Page 40:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Reordering processes via new communicators

1

02

5

3

9

7

oldcomm MPI assumption:Application programmer knows which processes will communicate (heaviest communication, most frequent communication): Virtual topology

Idea: Convey this information to MPI library; library can suggest a good mapping of MPI processes to processors that fits communication system

Page 41:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

1

02

5

3

9

7

4

MPI_Cart_create:

Implicit assumption: Process with rank i in old communicator will likely communicate with neighboring processes in grid/torus, +1/-1 in all dimensions

oldcomm

Page 42:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

oldcommIf period[1]=1

newcomm

1

02

5

3

9

7

4

Page 43:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

If period[1]=1

Ranks in newcomm are organized in a (virtual) d-dimensional mesh/torus

Good order: Ranks correspond to processor id’s in physical torus system

Physical torus systems:3-dim, 5-dim, 6-dim; BlueGene, Cray, Fujitsu K

newcomm

Page 44:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

13 2110

7 453

9 11128

0 14615

oldcomm (bad order)

1

02

5

3

9

7

4

newcomm

Process ranks in oldcommand newcommmay differ

Topology creation with reorder=1: MPI library can attempt to map virtual topology to physical topology: newcomm

Page 45:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

13 2110

7 453

9 11128

0 14615

newcomm

Processes in oldcomm may have been mapped to other processes in newcomm

Example: Rank 1 in oldcommmapped to processor 2 (rank 2 in newcomm), mapped to processor 1 in newcomm (which was rank 10 in oldcomm)

oldcomm

Page 46:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

If reordering has been done (ranks in newcomm ≠ ranks in oldcomm)

• Has reordering been done? All processes: Check whether rank in the two communicators differ, allreduce to inform all

• Application may have to transfer data to same rank in new communicator

• Need to be able to translate between ranks in different communicators

Page 47:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

i

j

k

Fully general application communication patterns

• Weighted, directed (multi)graph to specify application communication pattern (whom with whom, how much)

• Hints can be provided (info)

• Distributed description, any MPI process can specify any edges

w=1000, could indicate communication volume/frequency/…

Page 48:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Dist_graph_create(comm,

n,sources,degrees,destinations,

weights,

info,reorder,&graphcomm)

Example: Process i may specify (what it knows):

sources: [0,5,7,2,…](out)degrees: [2,2,2,2,…]destinations: [1,2,18,36,4,5,8,117,…]weights: [100,1000,10,10,1000,1000,100,100,…]

MPI (3.0) uses graph structure for descriptive purposes (neighborhood collectives). MPI library can attempt to remap communication graph to fit target communication system. Other optimizations are possible.

Page 49:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

int reorder = 1;

MPI_Dist_graph_create(comm,

n,sources,degrees,destinations,

weights,

info,reorder,&graphcomm)

If reordering is requested, rank i in new graphcomm may be bound to different processor than rank i in comm

Note:It is legal for an MPI library to return a graphcomm with the same mapping as comm

Data redistribution (from rank i in comm to rank i in graphcomm) can be necessary; must be done explicitly by application (first: Check if reordering has taken place)

Page 50:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Dist_graph_create_adjacent(comm,

indeg,sources,sourceweights,

outdeg,destinations,destweights,

info,reorder,&graphcomm)

If each process knows its incoming and outgoing communication edges

may be more convenient and efficient

Furthermore, the operation does not require communication in order to support the neighborhood query functions:

MPI_Dist_graph_neighbors_count(graphcomm, …)

MPI_Dist_graph_neighbors(graphcomm, …)

Page 51:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI 3.1 Virtual topologies

• “info” can be provided; not specified, implementation dependent

• Interpretation of weights unspecified (when? How much? How often? …)

• Differences between Cartesian and distributed graph topologies: Cartesian creation function (MPI_Cart_create) takes no weights, no info

Do MPI libraries perform non-trivial mapping? Try it

Page 52:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, Jesper Larsson Träff: The scalable process topologyinterface of MPI 2.2. Concurrency and Computation: Practice andExperience 23(4): 293-310 (2011)

Distributed graph interface since MPI 2.2.

Old MPI 1.0 interface is badly non-scalable: Full graph required at each process

Definition:An MPI construct is non-scalable if it entails Ω(p) memory ortime overhead(*)

(*): that cannot be accounted for/amortized in the application

Page 53:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Implementing topology mapping

Step 1:Application specifies communication patters as weighted directed (guest) graph (implicitly: Unweighted, d-dimensional Cartesian pattern)

Step 2:MPI library knows physical communication structure; model subgraph of system spanned by communicator as weighted, directed (host) graph

Step 3:Map/embed communication (guest) graph onto system (host) graph, subject to… The map (embedding) is an injective function Γ from guest graph vertices to host graph vertices

Page 54:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Some optimization criteria:• As many heavy communication edges on physical edges as

possible• Minimize communication costs (volume)• Small congestion (bandwidth)• Small dilation (latency)• …

Meta-Theorem: Most graph embedding problems are NP-hard

Depending on physical system (e.g., clustered, hierarchical), problem can sometimes be formulated as a graph partitioning problem (e.g., clusters, partition into nodes, minimize weight of inter-node edges), …

Mapping function can be injective, surjective, bijective

Page 55:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Guest graph G=(V,E), let w(u,v) denote the volume of data from process (vertex) u to process (vertex) v

Host graph H=(V’,E’), let c(u’,v’) denote the network capacity between processors u’ and v’. Let R(u’,v’) be a function determining the path from processor u’ to v’ in the system (routing algorithm)

Let Γ: V -> V‘ be an injective mapping (embedding). The congestion of link e in E’ wrt. Γ is defined as

Cong(e) = [∑w(u,v) over (u,v) in E where e in R(Γ(u),Γ(v))]/c(e)

The congestion of Γ is defined as the maximum congestion of anylink e

Call the problem of determining whether an embedding exists that has congestion less than C for CONG

Page 56:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Concrete Theorem: For given guest and host graphs G and H, determining whether an embedding exists that has congestion at most C is NP-complete.

Proof: CONG is in NP. Non-deterministically choose an embedding, and check whether its congestion is at most C; this can be done in polynomial time

Completeness is by reduction from MINIMUM CUT INTO BOUNDED SETS (Garey&Johnson ND17): Given a graph G=(V,E), two vertices s,t in V, integer L, determine whether there is a partition of V into subsets V1 and V2 with s in V1, t in V2, |V1|=|V2|, such that the number of edges between V1 and V2 is at most L

Page 57:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Let (G,s,t,L) be an instance of MINIMUM CUT INTO BOUNDED SETS. Let the host graph H be constructed as follows

Complete K|V/2| graphs

u1 u2

c(u1,u2) = |V|, c(e) =1 for all other edges of H. The routing function R selects for u,v in V’ a shortest path from u to v

Let the guest graph be G with w(e)=1 for e in E, except for the edge (s,t) for which w(s,t) = |V|4 if (s,t) is not in E, and w(s,t)=|V|4+1 if (s,t) in E. That is, the guest graph may have one edge not in G, namely (s,t), and each edge in G contributes a volume of at least 1

U1 U2

Page 58:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Any injective mapping Γ from G to H determines a partition of V into V1 and V2 with |V1|=|V2|, namely Vi = {v | Γ(v) in Ui}

Any mapping that minimizes the congestion must map (s,t) to (u1,u2). This gives a congestion of at most |V|3+|V|; any other mapping has congestion at least |V|4.

In such a mapping, the most congested edge is (u1,u2), with congestion

|V|3+|{(v1,v2) in E | v1 in V1 and V2 in V2 }|/|V|

Thus, a solution to CONG with congestion at most |V|3+L/|V| gives a solution to the MINIMUM CUT INTO BOUNDED SETS instance

Page 59:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Conversely, from a solution to the MINIMUM CUT INTO BOUNDED SETS an embedding with this congestion can easily be constructed (check!)

T. Hoefler, M. Snir: Generic topology mapping strategies forlarge-scale parallel architectures. ICS 2011: 75-84

Proof from

ND17, see

M. R. Garey, D. S. Johnson: Computer and Intractability. A guide to the theory of NP-Completeness. W. H. Freeman, 1979 (update 1991)

Page 60:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

CONG is a special case of the simple graph embedding problem

For given guest graph G=(V,E) with edge weights w(u,v), and host graph H=(V’,E’) with edge costs c(u’,v’), find an injective mapping (embedding) Γ: V -> V‘ that minimizes

∑(u,v) in E: w(u,v)c(Γ(u) Γ(v))

This is the quadratic assignment problem (QAP), which is NP-complete (Garey&Johnson ND43)

w(u,v): volume of data to be transferred between processes u and vc(u,v): cost per unit; could model pairwise communication costs (without congestion) in network

Page 61:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

S. H. Bokhari: On the Mapping Problem. IEEE Trans. Computers (TC) 30(3):207-214 (1981)T. Hatazaki: Rank Reordering Strategy for MPI TopologyCreation Functions. PVM/MPI 1998: 188-195J. L. Träff: Implementing the MPI process topology mechanism. SC 2002:1-14H. Yu, I-Hsin Chung, J. E. Moreira: Blue Gene system software -Topology mapping for Blue Gene/L supercomputer. SC 2006: 116A. Bhatele: Topology Aware Task Mapping. Encyclopedia of Parallel Computing 2011: 2057-2062A. Bhatele, L. V. Kalé: Benefits of Topology Aware Mapping forMesh Interconnects. Parallel Processing Letters 18(4): 549-566 (2008)Emmanuel Jeannot, Guillaume Mercier, Francois Tessier: ProcessPlacement in Multicore Clusters: Algorithmic Issues and PracticalTechniques. IEEE Trans. Parallel Distrib. Syst. 25(4): 993-1002 (2014)

Some MPI relevant solutions to the topology mapping problem

Page 62:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

KaHIP, Karlsruhe High Quality Parititoning, recent, state-of-theart graph partitioning (heuristic, multi-level) for large graphs; see http://algo2.iti.kit.edu/documents/kahip/

Graph partitioning software for process topology embedding

VieM, Vienna Mapping and Quadratic Assignment, solves the embedding problem for hierarical systems as a special quadratic assignment problem; see http://viem.taa.univie.ac.at/

Christian Schulz, Jesper Larsson Träff: Better Process Mapping and Sparse Quadratic Assignment. CoRR abs/1702.04164 (2017), to appear at SEA 2017

Peter Sanders, Christian Schulz:Think Locally, Act Globally: Highly Balanced Graph Partitioning. SEA 2013: 164-175

Page 63:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

METIS, ParMETIS, … : Well-known (early) software packages for (hyper)graph and mesh partitioning; see http://glaros.dtc.umn.edu/gkhome/projects/gp/products?q=views/metis

Scotch, a “software package and libraries for sequential and parallel graph partitioning, static mapping and clustering, sequential mesh and hypergraph partitioning, and sequential and parallel sparse matrix block ordering“; see http://www.labri.fr/perso/pelegrin/scotch/

Cédric Chevalier, François Pellegrini: PT-Scotch. Parallel Computing 34(6-8): 318-331 (2008)

George Karypis: METIS and ParMETIS. Encyclopedia of Parallel Computing 2011: 1117-1124

Graph partitioning software for process topology embedding

Page 64:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jostle, originally for partitioning and load balancing for unstructructured meshes; no longer open source (2017); seehttp://staffweb.cms.gre.ac.uk/~wc06/jostle/

Graph partitioning software for process topology embedding

C. Walshaw and M. Cross. JOSTLE: Parallel Multilevel Graph-Partitioning Software - An Overview. In F. Magoules, editor, Mesh Partitioning Techniques and Domain Decomposition Techniques, pages 27-58. Civil-Comp Ltd., 2007. (Invited chapter)

Page 65:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Topology mapping: Discussion

• For scalability, distributed specification of communication graph

• Requires distributed algorithm (heuristic) to solve/approximate hard optimization problem

• Can the mapping overhead be amortized?

• Does it make sense to look for an optimum? Is the MPI interface sufficiently rich to provide all relevant information? Does the application have this information?

• Is the application programmer prepared to use the complex interface?

Page 66:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Creating the communicators (Method 1)

int rcrank;

int rccoord[2];

MPI_Comm_rank(rccomm,&rcrank);

MPI_Cart_coords(rccomm,rcrank,2,rccoord);

MPI_Comm_split(rccomm,rccoord[0],rccoord[1],&ccomm);

MPI_Comm_split(rccomm,rccoord[1],rccoord[0],&rcomm);

“color”: all processes calling with same color, will belong to same communicator

“key”: determines relative process ordering in new comminicator

Page 67:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

(0,0)Proc 0

(2,0)Proc 2c

(1,0)Proc c

(0,c-1)Proc c-1…

MPI_Comm_split (color=rccoord[0]);

MPI_Comm_split (color=rccoord[1]);

ccomm

rcomm

Page 68:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Creating the communicators (Method 2)

MPI_Comm_group(rccomm,&rcgroup); // get group out

MPI_Comm_rank(rccomm,&rcrank);

MPI_Cart_coords(rccomm,rcrank,2,rccoord);

for (i=0; i<rcsize; i++) {

MPI_Cart_coords(rccomm,i,2,coord);

if (coord[0]==rccoord[0]) cranks[c++] = i;

if (coord[1]==rccoord[1]) rranks[r++] = i;

}

MPI_Group_include(rcgroup,c,cranks,&cgroup);

MPI_Group_include(rcgroup,r,rranks,&rgroup);

MPI_Comm_create(rccomm,cgroup,&ccomm);

MPI_Comm_create(rccomm,rgroup,&rcomm);

MPI_Comm_split() is an expensive collective operation. MPI process groups can be used for possibly more efficient communicator creation

Process group operations

Better?

Page 69:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI process groups (MPI 3.1, Chapter 6)

Process local MPI objects representing ordered sets of processes

MPI_Comm_group: Get group of processes associated with communicator

Operations:• Comparing groups (same processes, in same order)• Group union, intersection, difference, …• Inclusion and exclusion of groups• Range groups• Translation of ranks between groups

Processes in group ranked from 0 to size of group-1, a process can locally determine if it’s a member of some group, and its rank in this group

Page 70:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Creating the communicators (Method 2’)

MPI_Comm_group(rccomm,&rcgroup); // get group out

MPI_Comm_rank(rccomm,&rcrank);

MPI_Cart_coords(rccomm,rcrank,2,rccoord);

for (i=0; i<rcsize; i++) {

MPI_Cart_coords(rccomm,i,2,coord);

if (coord[0]==rccoord[0]) cranks[c++] = i;

if (coord[1]==rccoord[1]) rranks[r++] = i;

}

MPI_Group_include(rcgroup,c,cranks,&cgroup);

MPI_Group_include(rcgroup,r,rranks,&rgroup);

MPI_Comm_create_group(rccomm,cgroup,tag,&ccomm);

MPI_Comm_create_group(rccomm,rgroup,tag,&rcomm);

MPI_Comm_split() is an expensive collective operation. MPI process groups can be used for possibly more efficient communicator creation

Even better?

Page 71:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI 3.1 standard: Many variations, many subtleties: Read it

• MPI_Comm_create(comm, …): Collective over all processes in comm

• MPI_Comm_create_group(comm,group, …): Collective over all processes in group, processes in group subset of processes in group of comm

MPI_Comm_create_group: Smaller groups can work independently to create their communicator. Tag arguments for multi-threaded uses (many news in MPI 3.1)

Page 72:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

All group operations can be implemented in O(s) operations, where s is the size of the largest group in the operation. Space consumption is O(U), where U is the maximum number of MPI processes in the system (size of MPI_COMM_WORLD)

Questions:• Group (and communicator) operations permit construction of

unrestricted rank->process mappings; this is costly in space. Possible to define a useful set of more restricted operations?

• Which group operations can be implemented in o(s) operations?

Jesper Larsson Träff: Compact and Efficient Implementation of the MPI Group Operations. EuroMPI 2010: 170-178

Page 73:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: Finding out where data from some rank (in oldcomm) have to be sent to: Where is the rank in newcomm?

MPI_Group_translate_ranks(MPI_Group group1,

int n, const int ranks1[],

MPI_Group group 2,

int ranks2[])

Find fromrank, torank, such that process in (in oldcomm) can receive data from fromrank, and send its own data to torank

Recall: Cannot send from rank in oldcomm to rank in newcomm; communication is always in the same communicator

Page 74:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

int torank, fromrank; // to be computed

int oldrank, newrank;

MPI_Group oldgroup, newgroup;

MPI_Comm_group(oldcomm,&oldgroup);

MPI_Comm_group(newcomm,&newgroup);

// where has rank been mapped to?

MPI_Comm_rank(oldcomm,&oldrank); // rank in old

MPI_Group_translate_rank(newgroup,1,&oldrank,

oldgroup,&torank);

// torank may be MPI_UNDEFINED

// if newcomm smaller than oldcomm

MPI_Comm_rank(newcomm,&newrank);

MPI_Group_translate_ranks(newgroup,1,&newrank,

oldgroup,&fromrank);

Page 75:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Scalability of MPI communicator concept

• A communicator must support translation from MPI rank to physical core in system

• Storing translations explicitly as p-element lookup array (per core) is a non-scalable implementation. But still the most common

Example: 250.000 cores, 4-Byte integers ≈ 1MByte space per communicator (per core). Applications on systems with limited memory, and many communicators can be trouble

Note:MPI specification puts no restrictions on the number and kind of communicators that can be created

Page 76:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

NEK5000 application actually failed, ran out of memory – only 264 communicators possible

Max. 8K communicators

Experiment (1) at Argonne National Lab BlueGene system

Page 77:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

More than 20% memory used by communicator internals

Experiment (2) at Argonne National Lab BlueGene system

Page 78:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Topological awareness

MPI_Comm_split_type(comm,MPI_COMM_TYPE_SHARED,key,

info,&newcomm);

New in MPI 3.0, some support for better exploiting hierarchical systems/shared memory nodes

MPI_COMM_TYPE_SHARED: Creates communicator for processes where a shared memory region can be created (processes on same shared-memory node)

Orthogonal idea:Instead of letting user specify communication requirements (communication topology), let system suggest good communicators subject to specified constraints

Page 79:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. L. Lusk, R. Thakur, Jesper Larsson Träff: MPI on millions ofCores. Parallel Processing Letters 21(1): 45-60 (2011)

H. Kamal, S. M. Mirtaheri, A. Wagner: Scalability of communicators and groups in MPI. HPDC 2010: 264-275

P. Barnes, C. Carothers, D. Jefferson, J. LaPre: Warp speed: executing Time Warp on 1,966,080 cores. ACM SIGSIM-PADS, 2013

On scalability of MPI, especially communicators and interface

Recent (not so, now routine) result with many, many MPI processes (little detail): MPI has so far been able to scale

Page 80:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Implementing MPI collective operations (MPI 3.1, Chapter 5)

Ideally, MPI library

• implements best known/possible algorithms for given communication network

• gives smooth performance for each operation in problem size, data layout, number of processes, process mapping (given by communicator), …

• has predictable performance (concrete performance model for concrete library?)

• has consistent performance between related operations

Page 81:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Questions (empirical and theoretical):

• How good are actual MPI libraries?

• What are realistic/possible expectations for a “high quality” MPI library?

• How to judge?

There is usually no single, “one size fits all” algorithm; for most collectives, MPI libraries use a mix of different algorithms (depending problem size, numbers of processes, placement in network, …)

Possible answer:Benchmark with expectations

But, there are some recurring, common ideas in all these algorithms and implementations

Page 82:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Performance models, definitions, notations:

p: Number of (physical processes) ≈ number of MPI processes in communicatorm: Total size of message, total size of problem in collective

For hierarchical systems (SMP clusters, see later):N: Number of nodesn: Number of processes per node, p = nN (regular cluster)ni: Number of processes at node Ni, p = ∑ni

Note: log p denotes base 2 logarithm, log2 p

Page 83:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear transmission cost model

First approximation:Model physical point-to-point communication time, time to transmit message between any two processors (processes)

t(m) = α+βm

α: start-up latency (unit: seconds)β: time per unit, inverse bandwidth (unit: seconds/byte)

i jα+βm

Page 84:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

First approximation:Model physical point-to-point communication time, time to transmit message between any two processors (processes)

Note:• Models transfer time, both processors involved in the

transfer (synchronous point-to-point)• Assumes homogeneous network, same transfer time for any

pair of processes

Second assumption not realistic: SMP clusters, specific, high-diameter mesh/torus networks, …

t(m) = α+βm

Page 85:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Roger W. Hockney: Parametrization of computer performance. Parallel Computing 5(1-2): 97-103 (1987)Roger W. Hockney: Performance parameters and benchmarking of supercomputers. Parallel Computing 17(10-11): 1111-1130 (1991)Roger W. Hockney: The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing 20(3): 389-398 (1994)

Linear transmission cost model is sometimes called Hockney-model (misnomer in MPI community)

Page 86:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

α

βm

α+βm

Processor pi sending m-unit message to pj

α+βm

Page 87:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

i

Proc k

NIC

MPI_Send(&x,c,datatype,dest,tag,comm);

Physical transfer

Software and algorithmic(*) latency:• decode arguments (datatype, comm, …)• (optional) check arguments• MPIR_Send(&x, …); // library internal ADI

• Decide fabric, build envelope: communication context, source rank, size, mode info, …(Note: no datatype)

Hardware latency: setup, initiate transfer

Software α: 100-10000 instructions

(*) a factor for collective operations

Page 88:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear cost model on Jupiter (mpicroscope benchmark)

MPI processes on same SMP node: intra

MPI processes on different SMP nodes: inter

Jesper Larsson Träff: mpicroscope: Towards an MPI Benchmark Tool for Performance Guideline Verification. EuroMPI 2012: 100-109

Page 89:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear cost model on Jupiter (mpicroscope benchmark)

MPI processes on same SMP node: intra

MPI processes on different SMP nodes: inter

Not linear

Jesper Larsson Träff: mpicroscope: Towards an MPI Benchmark Tool for Performance Guideline Verification. EuroMPI 2012: 100-109

Page 90:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear cost model on Jupiter (mpicroscope benchmark)

Larger messages: approx. linear(?)

Intra-node Inter-node

Page 91:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear cost model on Jupiter (mpicroscope benchmark)

α≈1μsβ≈0.002μs/Byte

Intra-node Inter-node

α≈2μsβ≈0.002μs/Byte

Page 92:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Second refinements:For non-homogeneous systems, different processor pairs (i,j) may have different latencies and costs per unit

tij(m) = αij+βijm

t(m) =

Piece-wise linear model, short and long messages

α1+β1m, if 0≤m<M1

α2+β2m, if M1≤m<M2

α3+β3m, if M2≤m

Cannot model congestion

Etc.

Page 93:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

nk,αk,βk

ni αi,βi

Regular processor-hierarchy: All nodes at level i have the same number of subnodes ni. Communication between processors at level i modeled linearly by αi, βi, system described by the sequence (ni, αi, βi), i=0,…,k. Total number of processors is p=Πni

Regular, hierarchical system can represent αij, βij matrices morecompactly, lookup via suitable tree structure

Level 0

Level k

Page 94:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

A more detailed model: LogGP

L: Latency, time per unit for traveling through network

o: overhead, for processor to start/complete message transfer (both send and receive)

g: gap, between injection of subsequent messages

G: Gap per byte, between injection of subsequent bytes for large messages

P: number of processors (homogeneous communication)

Account for time processor is busy with communication, permit overlapping communication/computation

Page 95:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

oL

Processor i sending message to processor j

oo+L

g

Sending small message: o+L+o

Sender and receiver only involved for o seconds; overlap possible

Page 96:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

oL

Processor i sending message to processor j

(k-1)g

oo+L

Sending k small messages: o+(k-1)g+L+o

Next message can be sent after g seconds (assume g≥o)

Page 97:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

oL

Processor i sending message to processor j

(m-1)G

oo+L

Sending large message: o+(m-1)G+L+o

Page 98:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

oL

Processor i sending message to processor j

(m-1)G

oo+L

Sending k large messages: o+k(m-1)G+(k-1)g+L+o

g

Page 99:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

pi

time

pj

oL

Processor i sending message to processor j

(m-1)G

oo+L

Compared to linear cost model (in which sender and receiver are occupied for the whole transfer, no overlap):α ≈ 2o+L, β ≈ G

Page 100:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

•Alexandrov, M. F. Ionescu, Klaus E. Schauser, C. J. Scheiman: LogGP: Incorporating Long Messages into the LogP Model forParallel Computation. J. Parallel Distrib. Comput. 44(1): 71-79 (1997)•D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, E. E. Santos, K. E. Schauser, R. Subramonian, T. von Eicken: LogP: A PracticalModel of Parallel Computation. Comm ACM 39(11): 78-85 (1996)

Starting point for LogGP, see:

Many variations, some (few) results (optimal tree shapes)•Eunice E. Santos: Optimal and Near-Optimal Algorithms for k-Item Broadcast. J. Parallel Distrib. Comput. 57(2): 121-139 (1999)•Eunice E. Santos: Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines. J. Parallel Distrib. Comput. 62(4): 517-543 (2002)

Page 101:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Many, more recent papers in MPI community use some variations of the LogP model (see papers by Hoefler and others)

Page 102:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Network assumptions

A communication network graph G=(V,E) describes the structure of the communication network (“topology”)

Processors i and j with (i,j) in E are neighbors and can communicate directly with each other

Linear cost for neighbor communication, all pairs of neighbors have same cost: Network is homogeneous

A processor has k, k≥1, communication ports, and can at any instant be involved in at most (2)k communication operations (send and/or recv)

• k=1: Single-ported communication

Page 103:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Most networks G=(V,E) are undirected, a communication edge (i,j) allows communication from i to j, and from j to i

• Diameter of network G=(V,E): Longest path between any two processors i,j in V

• Degree of network G=(V,E): Degree of i with maximum number of (out-going and in-coming) edges

• Bisection width: Number of edges in a smallest cut (V1,V2) with |V1| ≈ |V2|

Page 104:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Single-ported, fully bidirectional, send-receive communication:

i jkProcessor i can simultaneously send to processor j and receive from processor k, k=j (telephone) or k≠j

For sparse networks (mesh/torus) k-ported, bidirectional (telephone) communication is often possible and assumed (e.g., k=2d for d-dimensional torus)

i

Page 105:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear processor array; diameter p-1

Some communication networks (actual, or as design vehicles)

0 1 2

Linear processor ring; dimeter p-1, but now strongly connected

Page 106:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Fully connected network: Complete graph, all (u,v) in E for u≠v in V, each processor can communicate directly with any other processor; diameter 1, bisection width p2/4

Page 107:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

(rxcx…) d-dimensional mesh/torus: each processor has 2d neighbors, diameter (r-1)+(c-1)+…

0 1 2

4 5 6

3

Torus wrap-around edges

Row-order (or some other) numbering

Page 108:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Hypercube: (log2 p)-dimensional torus, size 2 in each dimension

H0 H1 H2

H3

Processors 0x (binary) and 1x (binary) neighbors in Hi for x in H(i-1), i>1

Page 109:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Hypercube: (log2 p)-dimensional torus, size 2 in each dimension

0 1

2 3

4 5

6 7

H3

Processors 0x (binary) and 1x (binary) neighbors in Hi for x in H(i-1), i>1

Diameter: log pDegree: log p, each processor has log p neighborsNaming: k’th neighbor of processor i: flip k’th bit of i

Remark: This is a particular, convenient naming of the processors in the hypercube

Page 110:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Communication algorithms in networks

Independent pairs consisting of a sending and a receiving processor in a network can communicate independently, concurrently, in parallel.

i j

j k

In a k-ported, bidirectional communication system, each processor belongs to at most 2k pairs (as sending or receiving).

Communication step with possible transmissions

Page 111:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm communication complexity:

In each step of the given algorithm, there is a (maximal) set of such processor pairs, in each of which a message of size mi is transmitted. A step where all these processors communicate is called a communication round. The cost of a communication round is α+βmax0≤i<kmi

Assume all possible processor pairs communicate in each round. The communication complexity of the algorithm is the sum of the round costs in the worst case

i j

j k

Communication roundα+βmi

Page 112:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm communication complexity:

Alternatively: Communication takes place in synchronized rounds, in each of which as many processor pairs as possible communicate. The complexity is the number (and cost) of such rounds

Sometimes computation between rounds is not accounted for (too fast; not relevant; …)

Algorithm design goals:

• Fewest possible communication rounds• High network utilization• Balanced communication rounds (all mi≈mj)

Page 113:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

For a synthesis with more networks:

Pierre Fraigniaud, Emmanuel Lazard: Methods and problems of communication in usual networks. Discrete Applied Mathematics 53(1-3): 79-133 (1994)

Page 114:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Abstract, network independent (“bridging”) round model: BSP

Parallel computation in synchronized rounds (“supersteps”), processors working on local data, followed by exchange of data (h-relation) and synchronization.

Claim: Any reasonable, real, parallel computer/network can realize the BSP model

h-relation:Data exchange operation in which each processor sends or receives at most h units of data (and at least one processor sends or receives h data units), data visible after synchronization

Page 115:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

A BSP(p,g,l,L) computer(*) consists of p processors with local memory, a router (network+algorithm) that can route arbitrary h-relations with a cost of g per data unit, a synchronization algorithm with cost l(p), and a minimum superstep duration L

Cost of routing h-relation: gh+l(p)

(*)There are different variants of the BSP computer/model; also other related models (CGM). Note that the parameters have some similarity to LogP

Page 116:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Superstep of BSP algorithm: Local computation, h-relation, synchronization

≥L

Sync. l

Sync. l

h-relation (h=2)

Page 117:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

BSP algorithm with S supersteps:

Each superstep either computation step or communication step.

• Cost per superstep at least L• Cost of computation step: W=max0≤i<p(wi,L)• Cost of communication step (h-relation): H=max(gh,L)• Cost of synchronization after superstep: l

Total cost of algorithm: Sl+∑SWs+∑SHs

Leslie G. Valiant: A Bridging Model for Parallel Computation. Commun. ACM 33(8): 103-111 (1990)

Page 118:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Rob. H. Bisseling: Parallel Scientific Computation. A structured approach using BSP and MPI. Oxford University Press, 2004 (reprint 2010)

Implementing BSP computer (library):• Efficient h-relation (e.g., sparse, irregular MPI alltoall)• Efficient synchronization

Designing good BSP algorithms:• Small number of supersteps• Small h-relations

Page 119:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Lower bounds on communication complexity

Broadcast operation (MPI_Bcast): One “root” process has data of size m to be communicated to all other processes

Diameter lower bound:In a 1-ported network with diameter d, broadcast takes at least d communication rounds, and time

Tbcast(m) ≥ αd + βm

Assume (for proofs) that communication takes place in synchronized rounds

Proof:The processor at distance d from root must receive the data, distance can decrease by at most one in each round. All data must eventually be transferred from root

Page 120:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Fully connected lower bound:In a fully connected, 1-ported network broadcast takes at least ceil(log2 p) communication rounds, and time

Tbcast(m) ≥ α ceil(log2 p)+βm

Proof:The number of processors that have (some) data can at most double from one round to the next. By induction, in round k, k=0, 1, …, the number of processors that have (some) data is at most 2k, therefore ceil(log2 p) rounds are required.

Page 121:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Multiple message/pipelining lower bound:The number of communication rounds required to broadcast M blocks of data (in fully connected, 1-ported network) is at least

M-1+ceil(log2 p)

Proof:The root can possibly send M-1 blocks in M-1 rounds; the last block requires at least ceil(log2 p) rounds

Page 122:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Observation:Assume the m data are arbitrarily divisible into M blocks (MPI: could be difficult for structured data described by derived datatypes). The best possible broadcast (also: reduction) time in the linear cost model on fully connected network is

T(m) = (ceil(log p)-1)α + 2√[ceil(log p)-1)αβm] + βm

Proof: See pipeline lemma (but also next slide)

“Full, optimal bandwidth”

o(m) extra latency

Page 123:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

By lower bound, best time for M blocks is

T(m,M) = (M-1+log p)(α+βm/M) = (log p - 1)α + Mα + (log p - 1)βm/M + Mβm/M =(log p - 1)α + Mα + (log p – 1)βm/M + βm

Balancing Mα and (log p-1)βm/M terms yields

• Best M: √[(log p-1)βm/α]• Best blocksize m/M: √[αm/(log p-1)]

Question: Can the (log p)α+o(m)+βm bound be achieved?

Page 124:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Before/after semantics of the (MPI) collectives

Broadcast: beforex

Broadcast: afterx x x x

0 1 2 30 1 2 3

Scatter: afterx0

x1x2

x3

Scatter: beforex0x1x2x3

Gather: beforex0

x1x2

x3

Gather: afterx0x1x2x3

Page 125:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Allgather: beforex0

x1x2

x3

Allgather: afterx0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Alltoall: before0x0 1x0 2x0 3x00x1 1x1 2x1 3x10x2 1x2 2x2 3x10x3 1x3 2x3 3x3

Alltoall: after0x0 0x1 0x2 0x31x0 1x1 1x2 1x32x0 2x1 2x2 2x33x0 3x1 3x2 3x3

0 1 2 30 1 2 3

Page 126:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 30 1 2 3

Reduce: beforex0 x1 x2 x3

Reduce: after∑xi

Allreduce: beforex0 x1 x2 x3

Allreduce: after∑xi ∑xi ∑xi ∑xi

Reducescatter: after∑x0

∑x1∑x2

∑x3

Reducescatter: beforex0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Scan/exscan: beforex0 x1 x2 x3

Scan/exscan: aftery0 y1 y2 y3

Scan: yi = ∑(0≤j≤i): xiExscan: yi = ∑(0≤j<i): xi, no y0

Page 127:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Observations

Scatter: afterx0

x1x2

x3

Scatter: beforex0x1x2x3

Gather: beforex0

x1x2

x3

Gather: afterx0x1x2x3

Gather and Scatter are “dual” operations

Page 128:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Reduce: beforex0 x1 x2 x3

Reduce: after∑xi

Broadcast: beforex

Broadcast: afterx x x x

Broadcast and Reduce are “dual” operations

Page 129:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Alltoall: before0x0 1x0 2x0 3x00x1 1x1 2x1 3x10x2 1x2 2x2 3x10x3 1x3 2x3 3x3

Alltoall: after0x0 0x1 0x2 0x31x0 1x1 1x2 1x32x0 2x1 2x2 2x33x0 3x1 3x2 3x3

Alltoall ≈ pxp matrix-transpose

Page 130:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Allgather ≈ Gather + Broadcast

Broadcast: beforex

Broadcast: afterx x x x

Gather: beforex0

x1x2

x3

Gather: afterx0x1x2x3

Allgather: beforex0

x1x2

x3

Allgather: afterx0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Page 131:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast: beforex

Broadcast: afterx x x x

Allreduce ≈ Reduce + Broadcast

Reduce: beforex0 x1 x2 x3

Reduce: after∑xi

Allreduce: beforex0 x1 x2 x3

Allreduce:after∑xi ∑xi ∑xi ∑xi

Page 132:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Reducescatter ≈ Reduce + Scatter

Reduce: beforex0 x1 x2 x3

Reduce: after∑xi

Scatter: afterx0

x1x2

x3

Scatter: beforex0x1x2x3

Reducescatter: after∑x0

∑x1∑x2

∑x3

Reducescatter: beforex0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Page 133:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast ≈ Scatter + Allgather

Broadcast: beforex

Broadcast:afterx x x x

Allgather: beforex0

x1x2

x3

Allgather: afterx0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Scatter: afterx0

x1x2

x3

Scatter: beforex0x1x2x3

Page 134:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Allreduce ≈ Reducescatter + Allgather

Allgather: beforex0

x1x2

x3

Allgather: afterx0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Reducescatter: after∑x0

∑x1∑x2

∑x3

Reducescatter: beforex0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Allreduce: beforex0 x1 x2 x3

Allreduce:after∑xi ∑xi ∑xi ∑xi

Page 135:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Reduce ≈ Reducescatter + Gather

Reducescatter: after∑x0

∑x1∑x2

∑x3

Reducescatter: beforex0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Reduce: beforex0 x1 x2 x3

Reduce:after∑xi

Gather: beforex0

x1x2

x3

Gather: afterx0x1x2x3

Page 136:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Allgather ≈ ||0≤i<pBroadcast(i)

Broadcast(0): beforex0

Broadcast(0): afterx0 x0 x0 x0

Allgather: beforex0

x1x2

x3

Allgather: afterx0 x0 x0 x0x1 x1 x1 x1x2 x2 x2 x2x3 x3 x3 x3

Broadcast(1): beforex1

Broadcast(1): afterx1 x1 x2 x1

Broadcast(2): beforex2

Broadcast(2): afterx2 x2 x2 x2

Page 137:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Alltoall ≈ ||0≤i<pScatter(i)

Scatter(i): beforeix0ix1ix2ix3

Scatter(i): after

ix0 ix1 ix2 ix3

Alltoall: before0x0 1x0 2x0 3x00x1 1x1 2x1 3x10x2 1x2 2x2 3x10x3 1x3 2x3 3x3

Alltoall: after0x0 0x1 0x2 0x31x0 1x1 1x2 1x32x0 2x1 2x2 2x33x0 3x1 3x2 3x3

Page 138:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI collectives

• Many specific requirements (arbitrary roots, datatypes, mapping in network, …)

• Any MPI communicator allows (sendrecv) communication between any pair of processes: Virtually fully connected

• Underlying (routing) system (software and hardware) should ensure good (homogeneous) performance between any process pair, under any communication pattern

Approach:• Implement algorithms with send-recv operations, assume fully

connected network, use virtual network structure as design vehicle, use actual network for analysis and refinement

Page 139:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

MPI is (too) powerful

Even if underlying network (hardware) is known, network specific algorithms may still not be useful

MPI comm (e.g., MPI_COMM_WORLD): Must support all collectives

Torus algorithms

Page 140:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

MPI is (too) powerful

Even if underlying network (hardware) is known, network specific algorithms may still not be useful

MPI comm (subcomm): Must support all collectives

Torus/mesh algorithms (possibly different, virtual processor numbering)

Page 141:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

MPI is (too) powerful

Even if underlying network (hardware) is known, network specific algorithms may still not be useful

MPI comm (subcomm): Must support all collectives

Torus assumption does not hold

Page 142:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

MPI is (too) powerful

Even if underlying network (hardware) is known, network specific algorithms may still not be useful

MPI comm (subcomm): Must support all collectives

Torus assumption does not hold

Page 143:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 31 2

4 75 6

8 119 10

12 1513 14

MPI is (too) powerful

Even if underlying network (hardware) is known, network specific algorithms may still not be useful

comm1 comm2

Different, disjoint communicators (comm1, comm2) may share network (edges, ports)

Analysis can hardly account for this. No way in MPI for one communicator to know what is done in other communicators

Page 144:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI (send-receive) communication guarantees:

• Any correct communication algorithm, designed under any network assumption (structure, ports) will work when implemented in MPI (for any communicator)

• Performance depends on how communicator mapping and traffic fits the assumptions of the algorithm

• Fully bidirectional, send-receive communication: MPI_Sendrecv()

• Multi-ported communication: MPI_Isend()/MPI_Irecv + MPI_Waitall()

Page 145:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The “Van de Geijn” implementations

• Linear-array algorithms for large problems• Fully-connected network algorithms (“Minimum Spanning

Tree”) for small problems• Heavy use of Broadcast = Scatter + Allgather observation

Ignores many MPI specific problems: Buffer placement, datatypes, non-commutativity, …

Mostly concerned with collectives relevant for linear algebra; no scan/exscan, alltoall

E. Chan, M. Heimlich, A. Purkayastha, Robert A. van de Geijn: Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19(13): 1749-1783 (2007)

Page 146:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Geoffrey C. Fox, Mark Johnson, Gregory Lyzenga, Steve Otto, John Salmon, and David Walker: Solving Problems on Concurrent Processors. Volume 1: General Techniques and Regular Problems. Prentice-Hall, 1988

See also this interesting, silently highly influential (on MPI and other things), but no longer very well known book

Page 147:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear-array scatter

0

3

2

1

0

MPI_Scatter(sendbuf,scount,stype,

recvbuf,rcount,rtype,root,comm);

scount/rcount: number of elements in one block (m/p), root scatters p-1 blocks to other processes, copies own block

m

m/p

Page 148:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

if (rank==root) for (i=p-1; i>0; i--) {

MPI_Send(sendbuf[i],….,root+1,comm);

}

Root (0):

Page 149:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

MPI_Recv(recvbuf,…rank-1,…,comm);

for (i=rank; i<p-1; i++) {

MPI_Sendrecv_replace(recvbuf,….,

rank-1,rank+1,…,comm);

}

Non-root, 0<rank<size:

Page 150:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Non-root, 0<rank<size:MPI_Recv(recvbuf,…rank-1,…,comm);

for (i=rank; i<p-1; i++) {

MPI_Sendrecv_replace(recvbuf,….,

rank-1,rank+1,…,comm);

}

1-ported, bidirectional send-receive communication

Page 151:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Non-root, 0<rank<size:MPI_Recv(recvbuf,…rank-1,…,comm);

for (i=rank; i<p-1; i++) {

MPI_Sendrecv_replace(recvbuf,….,

rank-1,rank+1,…,comm);

}

1-ported, bidirectional send-receive communication

Page 152:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Tscatter(m) = (p-1)(α+βm/p) = (p-1)α + (p-1)/p βm

Optimal in β-term

in linear processor array with 1-ported, bidirectional, send-receive communication

1-ported, bidirectional send-receive communication

Page 153:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear-ring allgather

0

0

2

3

1

MPI_Allgather(sendbuf,scount,stype,

recvbuf,rcount,rtype,comm);

scount/rcount: number of elements in one block, each process contributes one block and gathers p-1 blocks

MPI:All processes stores gathered blocks in rank orderm/p

Page 154:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=0; i<p-1; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],…,

recvbuf[(rank-i-1+size)%size],…,comm);

}

Page 155:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=0; i<p-1; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],…,

recvbuf[(rank-i-1+size)%size],…,comm);

}

Page 156:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=0; i<p-1; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],…,

recvbuf[(rank-i-1+size)%size],…,comm);

}

Page 157:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

3

2

1

for (i=0; i<p-1; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],…,

recvbuf[(rank-i-1+size)%size],…,comm);

}

Page 158:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

3

2

1

Tallgather(m) = (p-1)(α+βm/p) = (p-1)α + (p-1)/p βm

Optimal in β-termNon-optimal in α-term

Page 159:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Tbroadcast(m) = Tscatter(m)+Tallgather(m)= 2(p-1)α + 2(p-1)/p βm

Factor 2 from optimal in β-term

MPI_Bcast(buffer,count,datatype,root,comm);

Page 160:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear-ring reduce-scatter

0

MPI_Reduce_scatter_block(sendbuf,

resultbuf,count,datatype,

op,comm);

m

m/pAll processes has own m-vector, each receives reduced block of m/p elements

Page 161:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Aside: Requirements for MPI reduction collectives

• op one of MPI_SUM (+), MPI_PROD (*), MPI_BAND, MPI_LAND, MPI_MAX, MPI_MIN, …

• Special: MPI_MINLOC, MPI_MAXLOC• Work on vectors of specific basetypes• User defined operations on any type• All operations assumed to be associative (Note: Floating point

addition etc. is not)• Built-in operations also commutative

• Reduction in (canonical) rank order• Result should be same irrespective of process placement

(communicator)• Preferably same order for all vector elements• sendbuf must not be destroyed (cannot be used as temporary

buffer)

“High quality” requirements

Page 162:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Let Xi be the vector contributed by MPI process i

Order and bracketing: Chosen bracketing should be same for all vector elements (careful with pipelined or blocked, shared-memory implementations), e.g.,

((X0+X1)+(X2+X3))+((X4+X5)+(X6+X7))

And same, for any communicator of same size (careful with mix of algorithms for hierarchical systems)

Page 163:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=1; i<p; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],….,tmp,…,comm);

recvbuf[(rank-i-1+size)%size] =

tmp OP recvbuf[(rank-i-1+size)%size)]; // do MPI op

}

Page 164:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

recvbuf[(rank-i+size)%size] =

tmp OP recvbuf[(rank-i-1+size)%size)]; // do MPI op

Note: Technically, it was not possible to implement MPI_Reduce_scatter algorithms on top of MPI before MPI 2.2

Page 165:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Need for (MPI 2.2 function):MPI_Reduce_local(in,inout,count,datatype,op);

recvbuf[(rank-i+size)%size] =

tmp OP recvbuf[(rank-i-1+size)%size)]; // do MPI op

Page 166:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=1; i<p; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],….,tmp,…,comm);

MPI_Reduce_local(tmp,&recvbuf[(rank-i-1+size)%size],…,

op);

}

Page 167:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=1; i<p; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],….,tmp,…,comm);

MPI_Reduce_local(tmp,&recvbuf[(rank-i-1+size)%size],…,

op);

}

Page 168:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

for (i=1; i<p; i++) {

MPI_Sendrecv(recvbuf[(rank-i+size)%size],….,tmp,…,comm);

MPI_Reduce_local(tmp,&recvbuf[(rank-i-1+size)%size],…,

op);

}

Page 169:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

p-1 rounds i=1,2,…,p-1; after round i, process rank has computed ∑(rank-i≤j≤rank): x[rank-1-i]

But: Exploits commutativity of +MPI (user-defined) operations may not be commutativeNot suited for all MPI op’s/datatypes

Page 170:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Recall: MPI assumes that all operations for MPI_Reduce etc. are associative; but floating point operations are not associative, e.g., (large+small)+small ≠ large+(small+small)

p-1 rounds i=1,2,…,p-1; after round i, process rank has computed ∑(rank-i≤j≤rank): x[rank-1-i]

Page 171:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Treducescatter (m) = (p-1)(α+βm/p)= (p-1)α + (p-1)/p βm

Optimal in β-termNon-optimal in α-term

Ignoring time to perform p-1 m/p block reductions

Page 172:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Combining reduce-scatter, allgather/gather immediately gives

Tallreduce(m) = 2(p-1)α + 2(p-1)/p βm

Treduce(m) = 2(p-1)α + 2(p-1)/p βm

Factor 2 from optimal in β-term

MPI_Allreduce(sendbuf,recvbuf,count,datatype,op,

comm);

MPI_Reduce(sendbuf,recvbuf,count,datatype,op,root,

comm);

Page 173:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of pipelining

Assume m large, m>p

Assume m can be arbitrarily divided into smaller, roughly equal sized blocks

Chose M, number of rounds, send blocks of size m/M one after the other

MPI technicality: If datatype (structure of data element in buffer of count elements) describes a large element, dividing into blocks of m/M units require special capabilities from MPI library implementation

MPI_Bcast(buffer,count,datatype,root,comm);

Page 174:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

Pipelined broadcast

0

1

2

3

for (b=0; b<M; b++) {

MPI_Send(buffer[b],…,rank+1,…,comm);

}

Root (0):

4

m

m/M

Page 175:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],…,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4

Page 176:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],…,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4

Page 177:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],….,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4

Page 178:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],….,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4

Page 179:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

3

2

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],….,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4 4

Page 180:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

3

2

3

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],….,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4 44

Page 181:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

3

2

3

MPI_Recv(buffer[0],…,comm);

for (b=1; j<M; j++) {

MPI_Sendrecv(buffer[b-1],….,buffer[b],…,comm);

}

MPI_Send(buffer[M-1],…,comm);

Non-root, rank<size-1:

4 44 4

Page 182:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

3

2

3

for (b=0; j<M; j++) {

MPI_Recv(buffer[b],…,comm);

}

Non-root, rank=size-1:

4 44 4

Page 183:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

0

1

2

3

0

1

0

2

1

0

3

2

1

3

2

3

• Last processor receives first block after p-1 rounds• Last processor completes after M+p-1 rounds• Processor i receives first block after i rounds, and a new

block in every round• Root completes after M rounds

4444

Page 184:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Observation:

Tbroadcast(m) = (M+p-1)(α+βm/M)

A best possible number of blocks and a best possible block size can easily be found: Pipelining lemma

Similar linear pipelined algorithms for reduction, scan/exscan

MPI_Reduce needs special care for root≠0 and root≠p-1

Page 185:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Pipelining lemma:With latency of k rounds for the first block, and a new block every s rounds, the best possible time (in linear cost model) for a pipelined algorithm that divides m into blocks is

Proof:Pipelining with M blocks takes (k+s(M-1))(α+βm/M) = (k-s)α + sMα + (k-s)βm/M + sβm rounds.Balancing the sMα and (k-s)βm/M terms gives best M of

(k-s)α + 2√[s(k-s)αβm] + sβm

√[(k-s)βm/sα]

Substitution yields the claim

Page 186:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Corollary:Best possible time for linear pipeline broadcast is

(p-2)α+ 2√[(p-2)αβm] + βm

since k=p-1 and s=1 Optimal in β-term

Practical relevance: Extremely simple, good when m>>p

Page 187:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

=

0

1

2

3

4

Block 0 Block 1 Block 2 Block 3 Block 4

MPI difficulty with pipelined algorithms: Structured buffers

datatype

Buffer is count repetitions of datatype

MPI library needs internal functionality to access parts of structured buffers. MPI specification does not expose any such functionality

Page 188:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 32 4 5 76

Hierarchical communication systems

Many/most HPC systems have a (two-level) hierarchical communication system, e.g.

Communication network: torus/tree/…

All processors (cores) inside node share inter-node communication bandwidth to network

“shared-memory” communication

Page 189:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

SMP/multi-core cluster:

• Intra-node communication via shared memory: Somewhat like fully connected, 1-ported, bidirectional

• Inter-node communication via network: k-ported (with small fixed k, often k=1), bidirectional communication with network characteristics, e.g., torus, fat tree (hierarchical), fully connected (rare), …

Page 190:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 32 4 5 76

Linear-array/ring algorithms for MPI collectives can often be embedded efficiently into hierarchical network

No conflicts on inter-node communication, so all linear algorithms can be used

Jesper Larsson Träff, Andreas Ripke, Christian Siebert, PavanBalaji, Rajeev Thakur, William Gropp: A Pipelined Algorithm forLarge, Irregular All-Gather Problems. IJHPCA 24(1): 58-68 (2010)

Example:

Page 191:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 4 52 1 3 76 8

But: MPI allows creation of arbitrary subsets of processes (communicators), can lead to contention/serialization on inter-node network

0 1 3 4 5 2

MPI_Comm_split(comm,color=0,key=random,&newcomm);

Page 192:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 4 52 1 3 76 8

For “commutative” collectives (Broadcast, Allgather, Gather/Scatter): Use linear, virtual process numbering

Reduction: If operator is commutative, use linear, virtual ordering. Scan/Exscan: Reordering not possible

0 21 3 4 65 7

Complication: Data reordering may be required at different steps (e.g., MPI_Gather must receive blocks in rank order)

Page 193:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Observations, hierarchical collectives

c: communicator

•c0, c1, c2, … CN-1: partition of c into subcommunicators, each cifully on node, size(ci) =ni (=n for homogeneous communicator)•C: subcommunicator of c with one process of each ci

0 1 32 4 5 76

c0 cN-1c1

c

MPI_Comm_split_type(c,MPI_COMM_TYPE_SHARED,key,

info,&ci);

Page 194:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Bcast(c):1. Bcast(C)2. In parallel: Bcast(ci)

Reduce(c):1. In parallel: Reduce(ci)2. Reduce(C)

If all the processes in ci are in order of c

Note:Time for Bcast(ci) may differ since size(ci) may be different from size(cj)

Note: Rounds at leastceil(log n)+ceil(log N) ≥ceil(log n + log N) = ceil(log p) since n=p/N

but ceil(log n)+ceil(log N) ≤ ceil(log p)+1

Page 195:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Allreduce(c):1. In parallel: Reduce(ci)2. Allreduce(C)3. In parallel: Bcast(ci)

If all the processes in ci are in order of c

Allgather(c):1. In parallel: Gather(ci)2. Allgatherv(C)3. Bcast(ci) Note: To solve regular problem, algorithm

for irregular problem is needed

Similar observations for Gather/Scatter, Scan/Exscan

Page 196:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Alltoall(c):1. In parallel: Gather(ci)2. Alltoallv(C)3. In parallel: Scatter(ci)

Scan(c):1. In parallel: Reduce(ci)2. Exscan(C)3. In parallel: Scan(ci)

If all the processes in ci are in order of c

Note: To solve regular problem, algorithm for irregular problem is needed

Page 197:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jesper Larsson Träff, Antoine Rougier: MPI Collectives and Datatypes for Hierarchical All-to-all Communication. EuroMPI/ASIA 2014: 27Jesper Larsson Träff, Antoine Rougier: Zero-copy, Hierarchical Gather is not possible with MPI Datatypes and Collectives. EuroMPI/ASIA 2014: 39

Writing own, hierarchical collectives, avoiding explicit data reordering

A good MPI library internally implements collective operations in a hierarchical fashion (shared memory part, network part, …)

Page 198:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of trees: Binomial tree gather

Assume (some convenient) tree can be embedded in communication network: Each tree edge is mapped to a network edge. Many processors in tree receive blocks from subtrees in parallel

For simplicity, continue to assume root=0

Case root≠0 can be handled by “shifting towards 0” (set rank’ = (rank-root+size)%size) for broadcast, gather/scatter. For reduction to root, same works for commutative MPI operations, otherwise, modifications are necessary

Page 199:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

1 2

3

4

5 6

Binomial tree for gathering to root=0

Processors numbered by pre-order traversal

Page 200:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

1 2

3

4

5 6

Binomial tree gather in rounds

α+βm/pα+βm/p α+βm/p

α+βm/p α+β2m/p α+β3m/p

• Three (3) communication rounds (= ceil(log 7))• Each child can be found in O(1) steps, followed by parent in

O(1) steps• At most 2im/p data per round, i=0,1,..,ceil(log p)-1• Total time in linear cost transmission model 3α+β6m/p

Round 0

Round 1

Round 2

Page 201:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

0

2

4

MPI_Gather(sendbuf,scount,stype,

recvbuf,rcount,rtype,root,comm);

recvbuf tmpbuf tmpbuf

m

m/p

Page 202:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

0

0

0

1

1

6

5

3

2

1

0

4

3

2

6

5

42

MPI_Gather(sendbuf,scount,stype,

recvbuf,rcount,rtype,root,comm);

recvbuf tmpbuf tmpbuf

Page 203:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

0

0

0

1

1

6

5

3

2

1

0

4

3

2

6

5

42

• ceil(log p) communication rounds• Root process 0 active in each communication round• Non-root processes send only once: in round k, where k

(=0,1,…,log p-1) is the first set bit in process rank• Amount of data gathered doubles in each round

recvbuf tmpbuf tmpbuf

Page 204:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

0

0

0

1

1

6

5

3

2

1

0

4

3

2

6

5

42

d=1;

while ((rank&d)≠d&&d<size) {

if (rank+d<size) MPI_Recv(tmpbuf,…,rank+d,…,comm);

d <<= 1;

}

If (rank!=root) MPI_Send(tmpbuf,…,rank-d,…,comm);

recvbuf tmpbuf tmpbuf

Page 205:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

0

0

0

1

1

6

5

3

2

1

0

4

3

2

6

5

42

Tgather(m) = ceil(log p)α+βm(p-1)/p

Optimal in α-term Optimal in β-term

recvbuf tmpbuf tmpbuf

Note: In each round, a non-idle processor either only sends or receives. Each processor can determine in O(1) steps what to do

Page 206:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Reduce(sendbuf,recvbuf,count,datatype,op,root,

comm);

if (rank==root) result = recvbuf;

else result = malloc(count*…); // datatype

memcpy(result,sendbuf,…); // need typed memcpy

Binomial tree reduction

All processes contribute a (typed) vector in sendbuf. Root process computes result in recvbuf (significant at root only)

Page 207:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

if (rank==root) result = recvbuf;

else result = malloc(count*…); // datatype

memcpy(result,sendbuf,…); // need typed memcpy

if (rank==root) result = recvbuf;

else result = malloc(count*…); // datatype

MPI_Sendrecv(sendbuf,count,datatype,TAG,0,

result,count,datatype,TAG,0,

MPI_COMM_SELF,MPI_STATUS_IGNORE);

There is no typed, local memcpy function in MPI. But library might implement typed copy efficiently?:

Special, singleton communicator, process has rank 0 in MPI_COMM_SELF

Page 208:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

d=1;

while ((rank&d)≠d&&d<size) {

if (rank+d<size) {

MPI_Recv(tmpbuf,…,rank+d,…,comm);

MPI_Reduce_local(tmpbuf,result,…,);

}

d <<= 1;

}

if (rank!=root) MPI_Send(result,…,rank-d,…,comm);

0 1 2 3 4 5 6

0 00

1 1

2

Page 209:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

Treduce(m) = ceil(log p)(α+βm) = ceil(log p)α+ceil(log p)βm

Optimal in α-term Non-optimal in β-term

No algorithmic latency: Each process can start receiving immediately, that is after O(1) operations (same for gather)

Ignoring time for ceil(log p) m-element reductions

Page 210:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

1 2

3

4

5 6

Binomial tree broadcast

Pre-order traversal numbering unsuited for broadcast: algorithmic latency. Processor 0 can start sending to processor 2i after i steps, where 2i<p, undesirable broadcast latency

First subtree

Page 211:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

d = 1;

while ((rank&d)!=d&&d<size) d <<= 1;

if (rank!=root) MPI_Recv(buffer,…,rank-d,…,comm);

while (d>1) {

d >>= 1;

if (rank+d<size) MPI_Send(buffer,…,rank+d,…,comm);

}

Binomial tree broadcast with harmful algorithmic latency

0 1 2 3 4 5 6

0

1 1

2 2 2

Page 212:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

d = 1;

while (rank>=d) { dd = d; d <<= 1; }

if (rank!=root) MPI_Recv(buffer,…,rank-dd,…,comm);

while (rank+d<size) {

MPI_Send(buffer,…,rank+d,…,comm);

d <<= 1;

}

0 1 2 3 4 5 6

01

12

22

Binomial tree broadcast without harmful algorithmic latency

Page 213:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

4 2

6

1

5 3

Better suited for broadcast: Bit-reversed pre-order traversal numbers. Parent needed in round k can be found O(k) steps

4

3 5 6

1 2

First lecture: Parent can be found in O(log k) steps, or O(1) steps with hardware support

• Find first set bit k (lsb)• Parent: Flip bit k (subtract

2k)• Children: Flip bits k+1, k+2,

Page 214:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6

01

12

22

Tbroadcast(m) = ceil(log p)(α+βm) = ceil(log p)α+ceil(log p)βm

Optimal in α-term, not in β-term

No harmful algorithmic latency: Process to start in round k executes while loop of k iterations

Page 215:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithmic latency in tree algorithms

Latency for node i to decide position in tree:O(1): No algorithmic latencyO(k): No harmful algorithmic latencyω(k): Possibly harmful algorithmic latency

k preceding communication operations

Page 216:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Structure

B3 B2 B1 B0

Page 217:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Structure

Bi

B(i-1) B(i-2) B0

Properties:• Bi has 2i nodes, i≥0• Bi has i+1 levels, 0, …, i• Bi has choose(i,k) = i!/(i-k)!k! nodes at level k

Home exercise: Prove by induction

Page 218:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Naming (ranks)

Bi

B(i-1) B(i-2) B0

202i-22i-1

Recursively:Add rank of parent to names in subtree B(i-2)

0

Page 219:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Naming (ranks)

Bi

B(i-1) B(i-2) B0

202i-22i-1

In binary (root=0): Let rank i = Y10…002 for some binary prefix Y, k be position of first 1 (from least significant bit, k≥1)• i’s k-1 children Y10…01, Y10…10, …, Y11…00• i’s parent Y00…00• all i’s descendants Y1X for k-1 bit suffix X

0

Page 220:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Naming (ranks)

B5

B4

1816

12 9

14

15

1113

10

For any i: Find parent in O(k) steps (or O(log k) steps, O(1) if lsb(x) in hardware), find each child in O(1) steps

This naming corresponds to a pre-order traversal numbering of the binomial tree.

Other namings: post-order numbering, level numbering (BFS), …

Page 221:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree: Alternative definition (and proof of 3rd property)

Bi

Properties:• Bi has 2i nodes, i≥0• Bi has i+1 levels, 0, …, i• Bi has choose(i,k) = i!/(i-k)!k! nodes at level k

B(i-1)

B(i-1)Level k:choose(i-1,k)+choose(i-1,k-1) =choose(i,k)

Page 222:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree gather/broadcast/reduction

Bi

B(i-1)

B(i-1)

α+βS(i-1) T(i): Processing timeS(i): Data volume

Optimality in linear cost model: Since the processing time of the two subtrees is the same, transfer between root and subtree root incurs no delay

Page 223:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Gather: • T(i) = T(i-1) + α + βS(i-1)• T(0) = 0• S(0) = m/p• S(i) = S(i-1)+S(i-1)

Scatter:• T(i) = α + βS(i-1) + T(i-1)

Broadcast:• T(i) = α + βS(i-1) + T(i-1)• S(i) = m

Reduction:• T(i) = T(i-1) + α + βS(i-1) + γS(i-1)

Very different recursions in LogP model

T(i) = (log p)α+β(p-1)/p m

T(i) = (log p)α+β(p-1)/p m

T(i) = (log p)(α+βm)

T(i) = (log p)(α+βm+γm)

Page 224:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree broadcast in hypercube

0

H3Round 0:If processor has data, flip bit 0Send block

Page 225:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

H3Round 1:If processor has data, flip bit 1Send block

Page 226:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2 3

H3Round 2:If processor has data, flip bit 2Send block

Page 227:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2 3

4 5

6 7

H3Round i:If processor has data, flip bit iSend block

Theorem:Binomial tree Bi can be optimally embedded into hypercube Hi, each edge in Bi corresponds to an edge in Hi:• Dilation 1: every edge in Bi is mapped to an edge in Hi (not a

path)• Congestion 1: There is at most one edge in Bi mapping to an

edge in Hi

Does not contradict NP-completeness of general embedding problem

Page 228:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Network design vehicle approach

Many, many results on embedding different, fixed networks into other networks (e.g., binomial tree into hypercube, trees into meshes, …); research in 80ties, 90ties

1. Chose convenient network for design of algorithm2. Prove properties3. Embed into actual system network4. Use embedding results to prove properties

Page 229:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

H3Round 0:If processor has data, flip bit d-1Send p/2 blocks

Binomial tree scatter in hypercube

Page 230:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 4

H3Round 1:If processor has data, flip bit d-2Send p/4 blocks

Page 231:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

2

4

6

H3Round 2:If processor has data, flip bit d-3Send p/8 blocks

Page 232:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2 3

4 5

6 7

H3

Result: Different embedding of binomial tree into hypercube.

Question:What if p is not a power of two? Possible to achieve ceil(log2 p) rounds?

Page 233:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial tree-like, divide-and-conquer scatter

Processors s..m-1 Processors m..e-1

s m=(e+s)/2+s e

root subroot

Idea: Divide processors in two part, root sends blocks for other half to subroot; recurse

MPI_Scatter(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,root,comm);

sendbuf only significant on root; data blocks in rank order

Page 234:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Scatter(void *sendbuf, …, void *recvbuf, …,

int s, int e, int root, MPI_Comm comm)

{

// base case: only one process

if (s+1==e) {

memcpy(recvbuf,sendbuf,…); // should be typed

// can (should) be avoided for non-roots

return;

}

// …

}

MPI_Scatter calls recursive Scatter with s=0 and e=p, sendbuf=void*, unless rank==root

Page 235:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

n = (e-s)/2; m = s+n;

if (root<m) {

subroot = m; blocks = e-s-n;

if (rank<m) {

if (rank==root) sendbuf += m;

e = m; newroot = root;

} else {

s = m; newroot = subroot;

}

} else {

subroot = s; blocks = n;

if (rank>=m) {

s = m; newroot = root;

} else {

e = m; newroot = subroot;

}

}

Processors s..m-1

Processors m..e-1

s

m=(e+s)/2+s e

root

root

Page 236:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

// root sends data to subroot

if (rank==root) {

MPI_Send(sendbuf,blocks,…,subroot,…,comm);

} else if (rank==subroot) {

sendbuf = (void*)malloc(blocks*…);

MPI_Recv(sendbuf,blocks,…,root,…,comm);

}

// recurse

Scatter(sendbuf,…,recvbuf,…,newroot,s,e,comm);

if (rank!=root&&sendbuf!=NULL) free(sendbuf);

return;

• Algorithmic latency is constant, root/subroot can start immediately

• Contiguous blocks from sendbuf• Temporary buffers for subroots

Page 237:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Remark:Van de Geijn and others call these algorithms “MST (Minimum Spanning Tree)”… Misnomer:• Minimum wrt. to what?• Any algorithm for broadcast etc. must use one (or more)

spanning trees, since all processors must be reached

Tscatter(m) = ceil(log p)α + (p-1)/p βm

Optimal in both α and β terms

Note:Same scheme for reduction/gather will have O(log p) harmful algorithmic latency

Page 238:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Binomial gather/scatter with root≠0

0’

1’ 2’

3’

4’

5’ 6’

Standard solution: Use virtual rank’ = (rank-root+size)%size

0

6 1 2

4 5

3

Drawback: Result not in rank order (shifted), requires local reordering at root, further algorithmic latency

1

0

6

5

4

3

7

Divide-and-conquer gather does not have this drawback, received buffers always in rank order, still ceil(log p) rounds

Page 239:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Irregular collectives: Gather and scatter

• Irregular (v-) collectives algorithmically much more difficult than regular collectives seen so far.

• Not that many good results

• MPI libraries most often have trivial implementations: Same algorithm as corresponding, regular collective

What to do?

Page 240:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

“Well-behaved”, irregular collectives: Gather, scatter, allgather, reduce_scatter

Jesper Larsson Träff, Andreas Ripke, Christian Siebert, PavanBalaji, Rajeev Thakur, William Gropp: A Pipelined Algorithm forLarge, Irregular All-Gather Problems. IJHPCA 24(1): 58-68 (2010)

J. L. Träff: An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application. PVM/MPI 2005: 129-137

Jesper Larsson Träff: Practical, Linear-time, Fully Distributed Algorithms for Irregular Gather and Scatter. CoRRabs/1702.05967 (2017)

And others…

Page 241:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Linear time irregular gather

Let m0, m1, m2, …, mi, …, mp-1 be the (sizes of the) blocks contributed by the p processors. Assume blocks are to be gathered in rank order at some processor r, 0≤r<p

m0 m1 m2 mi m(p-1)… …

MPI_Gatherv(sendbuf,scount,stype,

recvbuf,rcounts,rdispl,rtype,root,

comm);

MPI specifics:• Only root knows all mi, in rcounts vector (of size p)• Root provides a displacement rdispl[i] for each mi

Page 242:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Bi

B(i-1)

B(i-1)

S(i-1) = ∑mi of rightsubtree might be larger than S(i-1) of leftsubtree, and theprocessing time T(i-1) ofthe two subtrees differ

Problem with static (rank-structured) binomial trees

Gather delay δ = Tright-Tleft

Delay waiting for right tree to gather

Page 243:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Bi

B(i-1)

B(i-1)

S(i-1) = ∑mi of rightsubtree might be larger than S(i-1) of leftsubtree, and theprocessing time T(i-1) ofthe two subtrees differ

Problem with static (rank-structured) binomial trees

Idea: Avoid gather delays by letting faster constructed trees send to slower trees

Page 244:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Extended hypercube: Processors are organized in a hypercube H, and one extra non-hypercube edge between two processors is allowed.

An extended, d-dimensional hypercube H consists of two d-1 dimensional, extended subcubes H’ and H’’, such that processors in H form a consecutive range of the processors in H’ followed by the processors in H’’

Remark: The algorithm works also for incomplete hypercubes, where p is not a power of 2

Page 245:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Lemma: There exist an extended hypercube algorithm that gathers the p blocks in rank order at some (unspecified) root r optimally (in α and β terms) in ceil(log p)α + β∑i≠rmi time units

Let m0, m1, m2, …, mi, …, mp-1 be the (sizes of the) blocks contributed by the p processors. Assume blocks are to be gathered in rank order at some processor r, 0≤r<p

m0 m1 m2 mi m(p-1)… …

Page 246:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proof: Induction on the dimension of the hypercube

Base case, d=0: Process i has it’s block mi, nothing to gather

Induction step, assume claim holds for hypercubes of dimension d-1: Hypercube H of dimension d is formed by gathering data from one d-1 dimensional cube H’ to the root of the other d-1 dimensional cube H’’. Let r’ and r’’ be the roots of H’ and H’’.

H’

r’

H’’

r’’

Page 247:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Assume that ∑i in H‘,i≠r‘mi ≤ ∑i in H‘‘,i≠r‘‘mi (with ties broken arbitrarily, but consistently).

Send the data gathered at r‘ to r‘‘.

Sending the data gathered at r‘ takes α+β∑i in H‘mi time units.By the induction hypothesis, gathering the data at the other root r‘‘ has taken (d-1)α + β∑i in H’’,i≠r’’mi time units (which is not smaller than time of gathering in H’: no gather delay), for a total of dα + β∑i in H,i≠r’’mi time units. The root in H becomes r’’, and the claim follows.

In H, there is communication along (r’,r’’) which may not be a hypercube edge, but is allowed in the extended hypercube

Page 248:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Data can be gathered in order: Assume H’ contains the processors 0, …, 2d-1-1, and H’’ the processes 2d-1, …, 2d-1. The buffer in H is allocated such that the H’ blocks comes before the H’’ blocks

0 1 2 3 4 5 6

m0=7 m1=4 m2=2 m3=3 m4=6 m5=8 m6=9

Linear cost root r

Page 249:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proposition: There exist a hypercube algorithm that gathers the p blocks in rank order at a given root r in linear time at most ceil(log p)α + β∑i≠rmi + β∑i in H’,i≠r’mi time units for some subcube H’

Proof: Break all ties in favor of given root r. Processor r receives from a sequence of optimal, binomial trees B(i), i=0,1,…,ceil(log p)-1

B(1)

B(2)

B(i)

r

B(i+1)

Page 250:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

B(1)

B(2)

B(i)

r

B(i+1)

Let H’ be the last tree causing a delay. Construction cost of H’ is therefore larger than the cost of receiving from all previous subcubes (including delays). The cost of receiving from all ceil(log p) subcubes is ceil(log p)α + β∑i≠rmi. The delay is at most iα + β∑i in H’,i≠r’mi – (iα + β∑i<H’,i≠rmi) ≤ β∑i in H’,i≠r’mi

Last delay

H’

Page 251:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Question: How does r’ know r’’ and vice versa?

Algorithm:Maintain a fixed binomial tree B with root b in each hypercube H with root r. Maintain the invariant: The fixed root b knows r, r knows b, and both r and b knows ∑i≠rmi and ∑mi

H

B

br Invariant holds for 0-dimensional hypercube with r=b

Page 252:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm:Maintain a fixed binomial tree B with root b in each hypercube H with root r. Maintain the invariant: The fixed root b knows r, r knows b, and both r and b knows ∑i≠rmi and ∑mi

H’

B’

b’r’

H’’

B’’

b’’r’’

Step 1: b’ and b’’ exchange (r’,∑i≠r’mi,∑mi) and (r’’,∑i≠r’’mi,∑mi)

sendrecv

Page 253:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm:Maintain a fixed binomial tree B with root b in each hypercube H with root r. Maintain the invariant: The fixed root b knows r, r knows b, and both r and b knows ∑i≠rmi and ∑mi

H’

B’

b’r’

H’’

B’’

b’’r’’

Step 2’: Fixed root b’ sends (r’’,∑i≠r’’mi,∑mi) to r’ (if b’≠r’)

Page 254:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm:Maintain a fixed binomial tree B with root b in each hypercube H with root r. Maintain the invariant: The fixed root b knows r, r knows b, and both r and b knows ∑i≠rmi and ∑mi

H’

B’

b’r’

H’’

B’’

b’’r’’

Step 2’’: Fixed root b’’ sends (r’,∑i≠r’mi,∑mi) to r’’ (if b’’≠r’’)

Page 255:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithm:Maintain a fixed binomial tree B with root b in each hypercube H with root r. Maintain the invariant: The fixed root b knows r, r knows b, and both r and b knows ∑i≠rmi and ∑mi

H’

B’

b’r’

H’’

B’’

b’’r’’

Step 3: Both hypercube roots r’ and r’’ compare ∑i≠r’mi and ∑i≠r’’mi, and decide which will be the new root in H. The new fixed root b’’ likewise decides

Page 256:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

H

B

b’’r’’

By the three steps, the invariant is maintained for H formed from H’ and H’’. The construction takes ceil(log p) communication rounds, each consisting of 2 steps

Page 257:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Theorem: The irregular gather problem with block sizes m0, m1, …, mp-1, and given root r can be solved in at most 3ceil(log p)α + β∑i≠rmi + β∑i in H’,i≠r’mi time units

Jesper Larsson Träff: Practical, Linear-time, Fully Distributed Algorithms for Irregular Gather and Scatter. CoRRabs/1702.05967 (2017)

Question: What is the minimum possible delay?

Claim: An ordered gather/scatter tree with minimum possible delay can be constructed in polynomial timeClaim: If the order restriction is dropped, finding a minimum delay tree is NP-hard

Proof: Master-thesis

Page 258:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of the hypercube/butterfly: Allgather

0 1 2 3 4 5 6 7

Page 259:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Round k partner: rank XOR 2k (flip k’th bit), exchange distance 2k

Page 260:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Page 261:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Round k partner: rank XOR 2k (flip k’th bit), exchange distance 2k

Page 262:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Page 263:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Round k partner: rank XOR 2k (flip k’th bit), exchange distance 2k

Page 264:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 2 3 4 5 6 7

Page 265:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Butterfly/FFT communication pattern

Page 266:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

H3

Hypercube embedding of butterfly:

Round 0

Round 2Round 1

Page 267:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of the hypercube/butterfly

Tallgather(m) = (log p)α + (p-1)/p βm

since the amount of data exchanged in round k is m/2log p-k, k=0, 1, …, log p2-1

Drawback:Butterfly (hypercube) algorithms do not extend nicely to the case where p is not a power of 2

Optimal in both terms

Page 268:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Fully connected networks/multiple binomial trees: Allgather

p (incomplete) binomial trees each rooted at different processor can be embedded into fully connected network, each gathering data from all other processors. Perform p simultaneous broadcasts

Page 269:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 270:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Round k:Each process sends/receives block of size 2k

to/from processor (i+2k, i-2k) mod p

Page 271:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Round k:Each process sends/receives block of size 2k

to/from processor (i+2k, i-2k) mod p

Page 272:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Round k:Each process sends/receives block of size 2k

to/from processor (i+2k, i-2k) mod p

Page 273:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Except for last round: block of size p-2k

Round k:Each process sends/receives block of size 2k

to/from processor (i+2k, i-2k) mod p

Page 274:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Except for last round: block of size p-2k

Round k:Each process sends/receives block of size 2k

to/from processor (i+2k, i-2k) mod p

Page 275:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

“Bruck”/”Dissemination” allgather algorithm (sketch):

Tallgather(m) = ceil(log p)α + (p-1)/p mβ

d = 1;floor(log p) rounds:• Receive block[rank-2*d+1:rank-d] from rank-d• Send block[rank-d+1:rank] to rank+d• d = d*2if d<ceil(log p)• Receive block[rank-d-r+1:rank-d] from rank-d• Send block[rank-r+1:rank] to rank+dwhere r = p-d, and all rank+/-x operations are modulus p

Optimal in both terms

Page 276:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

“Bruck”/”Dissemination” allgather algorithm (sketch):

Tallgather(m) = ceil(log p)α + (p-1)/p mβ

J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, D. Weathersby: Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems. IEEE Trans. Parallel Distrib. Syst. 8(11): 1143-1156 (1997)

d = 1;floor(log p) rounds:• Receive block[rank-2*d+1:rank-d] from rank-d• Send block[rank-d+1:rank] to rank+d• d = d*2if d<ceil(log p)• Receive block[rank-d-r+1:rank-d] from rank-d• Send block[rank-r+1:rank] to rank+dwhere r = p-d, and all rank+/-x operations are modulus p

Fundamental paper to study

Page 277:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

“Bruck”/”Dissemination” allgather algorithm (sketch):

For MPI: Blocks are received in order, e.g. [4,5,6,0,1,2,3] for rank 3, so internal buffering and copying may be necessary in some steps(*); MPI standard prescribes [0,1,2,3,4,5,6] order for all ranks

d = 1;floor(log p) rounds:• Receive block[rank-2*d+1:rank-d] from rank-d• Send block[rank-d+1:rank] to rank+d• d = d*2if d<ceil(log p)• Receive block[rank-d-r+1:rank-d] from rank-d• Send block[rank-r+1:rank] to rank+dwhere r = p-d, and all rank+/-x operations are modulus p

(*) or use MPI derived datatypes

Page 278:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

From Wolfram MathWorld, www.mathworld.wolfram.com/CirculantGraph.html

Communication pattern in Dissemination allgather is a circulantgraph: processor i communicates with processors (i+d) mod p and (i-d) mod p, for d=1, 2, 4, 2ceil(log p)-1

Page 279:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Scan/Exscan in fully connected networks

Hillis-Steele (reminder from Bachelor lecture, sketch):

Processor i, round k, k=0, 1, …, ceil(log p)-1:1. Receive partial result from processor i-2k (if i-2k≥0)2. Send own partial result to processor i+2k (if i+2k<p)3. Compute partial result for next round by summing own and

received partial result

Partial result: ∑(max(0,i-2k)<j≤i): xi

Tscan(m) = ceil(log p)(α+βm)

Note:Attribution not correct, algorithm was known before Hillis and SteeleNot optimal in β-term

Page 280:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Round 0:1. Sendrecv(rank-1,rank+1)2. Add received partial result to own vector

∑ ∑∑ ∑ ∑∑

Page 281:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Round 1:1. Sendrecv(rank-2,rank+2)2. Add received partial result to own vector

∑∑ ∑ ∑∑

Page 282:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

W. Daniel Hillis, Guy L. Steele Jr.: Data Parallel Algorithms. Commun. ACM 29(12): 1170-1183 (1986)

Round 2:1. Sendrecv(rank-4,rank+4)2. Add received partial result to own vector

∑ ∑∑

Page 283:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Note: In MPI community the algorithm is sometimes called Rabenseifner’s algorithm, although the algorithm has been known for longer in the parallel/distributed computing field (incorrect attribution)

The power of the hypercube/butterfly

Building on the allreduce = reducescatter+allgather observation, the butterfly can be used for a good (best possible?) Allreducealgorithm

Page 284:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 285:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 286:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 287:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 288:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 289:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 290:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Page 291:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Treducescatter’(m) = (log p)α + (p-1)/p βm

Page 292:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Tallreduce(m) = 2[(log p)α + (p-1)/p βm]

Reverse the process to allgather the result:

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Page 293:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Reducescatter part, MPI note:The blocks are not scattered as prescribed by MPI, i’th block at i’th process

Simple trick: Reorder blocks (FFT permutation) before starting, at the cost of an O(m) algorithmic latency

Again: Butterfly algorithm does not extend nicely to case where p is not a power of two. A better than trivial algorithm in:

R. Rabenseifner, J. L.Träff: More Efficient Reduction Algorithmsfor Non-Power-of-Two Number of Processors in Message-PassingParallel Systems. PVM/MPI 2004: 36-46

J. L. Träff: An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application. PVM/MPI 2005: 129-137

Page 294:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of fixed-degree trees

Binomial trees cannot be pipelined. For same block size m/M, different nodes in tree would have different amount of work per round

Fixed-degree trees, e.g., linear pipeline, binary trees, admit pipelining

Lp L(p-1)

Page 295:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Note:Extreme case, non-constant degree star(tree) can be useful, e.g. gather for very large problems (MPI: No need for intermediate buffering); pipelining can be employed for each child:

Sp

Page 296:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

T3

Complete, balanced, binary tree: Structure

Ti

T(i-1) T(i-1)

Properties:• Ti has 2(i+1) – 1 nodes, i≥0• Ti has 2i – 1 interior nodes• Ti has 2i leaves• Ti has i+1 levels

Page 297:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0

1

3

2

7 8

54

T3

9

6

Complete, balanced, binary tree: Naming (BFS)

Navigation in O(1):parent is (rank-1)/2, children 2rank+1, 2rank+2

For MPI: Not always convenient, processor ranks of subtrees do not form a consecutive range (e.g., in gather/scatter block reordering necessary; reduction not in rank-order, …)

Page 298:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

7

3

1

0 2

95

T3

4 6 8

Complete, balanced, binary tree: Naming (in-order)

Property: Processor ranks of subtree from a consecutive range [j, …, j+2k-1] for some j and k

Page 299:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Pipelined binary tree

4

31

2 6

75

Can be used for broadcast, reduction, scan/exscan; for the latter, inorder numbering of the processors is required (unless MPI operation is commutative)

In-order numbering

Page 300:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 301:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 302:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 303:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 304:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 305:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 306:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 307:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 308:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 309:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 310:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 311:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Page 312:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Broadcast: M+x rounds• Root: Send blocks• Leaf: Receive blocks• Interior: Receive new block from parent, send previous block

to left subtree, send previous block to right subtree

0

2

1

M blocks

Observe:• Non-root nodes get a new block every second round• Last leaf gets first block after x=2(log(p+1)-1) rounds

Page 313:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Corrollary (of pipelining lemma):

Best possible time for pipelined, binary tree broadcast is

Tbroadcast(m) = (2ceil(log((p+1)/4))α+ 2√[4(ceil(log((p+1)/4))αβm] + 2βm =

(2ceil(log((p+1)/4))α+ 4√[(ceil(log((p+1)/4))αβm] + 2βm

since k=2(ceil(log(p+1))-1) and s=2, giving k-s = 2(ceil(log(p+1)-2) = 2ceil(log((p+1)/4))

Logarithmic in α-term Non-optimal in β-term (factor 2 off)

(k-s)α + 2√[s(k-s)αβm] + sβm

Page 314:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Reduction: M+2(log(p+1)-1) rounds• Leaf: Send blocks• Interior and root: Receive from left subtree, add to partial

result, receive from right subtree, add to partial result, send to parent (root does not send)

Page 315:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Scan/Exscan: Two phases, up and down

Up-phase: Interior node receives from left subtree, adds to own value, stores partial result, receives from right subtree, computes temporary partial result and sends upwards

[∑Tleft+val(i),∑T]

∑Tleft

[∑Tleft+val(i),∑T]

∑Tleft∑Tleft∑Tright∑Tright

∑Tright

[∑Tleft+val(i)=∑0≤j≤ival(j),∑T]

Page 316:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Scan/Exscan: Two phases, up and down

Down-phase: Interior node receives result from parent, sends to left subtree, adds to own partial result, sends complete result to right subtree

∑Tleft+val(root)

∑0≤j≤ival(j)

∑Tleft+val(root) ∑0≤j≤ival(j)

Page 317:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

Scan/Exscan: Two phases, up and down

Both phases can be pipelined, best time is twice the best time for reduction (some extra overlap of up- and down phase possible with bidirectional communication)

Peter Sanders, Jesper Larsson Träff: Parallel Prefix (Scan) Algorithms for MPI. PVM/MPI 2006: 49-57

Page 318:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Problem with balanced binary tree

Broadcast in Ti:Last leaf becomes data after 2i rounds (latency k in pipelining lemma)

Possible solution: Imbalanced binary tree, left subtree deeper than right subtree

Page 319:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

An imbalanced binary tree: Fibonacci tree

F0 F1 F2 F3

Fi

F(i-2)F(i-1) Properties:•Fi has Fib(i+3)-1 nodes•Fi has depth i, i≥0, •Fi has i+1 levels

Page 320:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Recall:

Fib(0) = 0, Fib(1) = 1, Fib(i) = Fib(i-1)+Fib(i-2) so Fib(2) = 1, Fib(3) = 2, Fib(4) = 3, Fib(5) = 5, Fib(6) = 8, …

i’th Fibonacci number Fib(i) = 1/√5(φn-φ‘n) where φ = (1+√5)/2 andφ’ = (1-√5)/2

Fib(i+3)-1≥p if 1/√5 φi+3 ≥ p i+3 ≥ logφ p i ≥ logφ p - 3

Exercise: Proof by induction

Jehoshua Bruck, Robert Cypher, Ching-Tien Ho: Multiple Message Broadcasting with Generalized Fibonacci Trees. SPDP 1992: 424-431

May have been introduced for broadcast in

Page 321:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast in Fibonacci tree

Page 322:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast in Fibonacci tree

Page 323:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast in Fibonacci tree

Page 324:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast in Fibonacci tree

Page 325:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast in Fibonacci tree

• All leaves become data at the same time

• Every node becomes a new block every second round

Page 326:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Broadcast (reduction) trees: Properties

•Binomial tree Bi: All leaf nodes receive data after i rounds •Fibonacci tree Fi: All leaf nodes receive data after i rounds

•Binary tree Ti: First leaf receives data after i rounds, last leaf after 2i rounds

For small data: Binomial tree is round optimal, Fibonacci almost round optimal, binary tree factor 2 off

Binary tree and Fibonacci tree can be pipelined (binomial tree not)

Exercise: Prove by induction

Fixed-degree trees can be pipelined; all nodes (except root and leaves) have the same amount of work

Page 327:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Two (pipelined) binary trees

Binary tree drawbacks:• Nodes receive a new block only every second round• Leaves only receiving

Capabilities of communication model not fully exploited

Idea:Use two binary trees instead of one. Interior node of one tree is leaf of other, vice versa; in each round processors receive a new block from parent in either tree, send a previous block to one of its children

Could perhaps work for incomplete binary trees with even number of nodes?

Page 328:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Peter Sanders, Jochen Speck, Jesper Larsson Träff: Two-treealgorithms for full bandwidth broadcast, reduction and scan. Parallel Computing 35(12): 581-594 (2009)

Idea:Use two binary trees instead of one. Interior node of one tree is leaf of other, vice versa; in each round processors receive a new block from parent in either tree, send a previous block to one of its children

Each processor associated with two nodes: leaf in one tree, interior node in other three

Page 329:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Incomplete binary tree with even number of nodes: Same number of interior nodes and leaf nodes.

Tree in in-order numbering, processors numbered from 1, …, p

Example: Reduction

4

31

2 6

75

8

10

9

Page 330:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Incomplete binary tree with even number of nodes: Same number of interior nodes and leaf nodes.

Construction: Given p, remove processor 0 (root), construct in-order tree over remaining processors, if p-1 is even. (If p-1 is odd, remove one more processor, add as virtual root)

Example: Reduction

4

31

2 6

75

8

10

9

Page 331:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

31

2 6

75

8

10

9

Duplicate tree

Page 332:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

3 1

26

7 5

8

10

9

Rotate (mirror) tree

Each processor is in two trees, once with rank, once with rank’ = p -rank

Page 333:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

3 1

26

7 5

8

10

9

Same processor

Each processor is either: leaf in upper tree and interior node in lower, or vice versa

Duplicate and mirror tree

Page 334:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

3 1

26

7 5

8

10

9

Same processor

0

Add reduction root, shall receive from both trees

Page 335:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

3 1

26

7 5

8

10

9

Same processor

0‘

Add reduction root, shall receive from both trees

0

Page 336:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Both trees can work simultaneously, m/2 data reduction in either by pipelining. Pipelining lemma with s=2 and k=2(log(p+1)-1) gives

Remark: Currently best known reduction time with logarithmic latency

Optimal in β-term, logarithmic in α-term

Problem:How to schedule communication such that in each round each processor sends to a parent (in either tree) and receives from a child?

Treduce(m) = (2ceil(log((p+1)/4))α+ 2√[4(ceil(log((p+1)/4))αβm/2] + 2βm/2 =

(2ceil(log((p+1)/4))α+ 2√[2(ceil(log((p+1)/4))αβm] + βm

Page 337:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

31

2 6

75

8

10

9

4

3 1

26

7 5

8

10

90

Answer: By colors

In odd rounds on black edges

In even rounds processorssend and receive on rededges

Page 338:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Coloring (2-tree scheduling) lemma:There is an edge 2-coloring of the 2-tree such that for each node• The colors of the edges to parents in upper and lower trees

are different• The colors from the (up to two) children are different

Proof:Construct schedule graph B as follows: Let {s1,…,sp} be the set of sending, {r1,…,rp} the set of receiving nodes; there is an edge between si and rj iff si is a child of rj in either upper or lower tree. Graph B is clearly bipartite and each node si or rj has degree at most two. B is therefore edge 2-colorable.

Page 339:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Note (recall?):A bipartite edge 2-coloring can be computed in O(n+m) time (Cole, Ost, Schirra, 2001)

Too expensive (algorithmic latency), not parallel

Richard Cole, Kirstin Ost, Stefan Schirra: Edge-Coloring Bipartite Multigraphs in O(E log D) Time. Combinatorica 21(1): 5-12 (2001)

Page 340:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Main theorem:Using the mirroring construction for 2-trees, the color of the edge to the parent of an interior tree node v, 1≤v≤p, in the upper tree is computed in O(log p) steps by calling EdgeColor(p,root,v,1):

EdgeColor(p,root,v,H) {

if (v==root) return 1;

while ((v & H)==0) H <<= 1;

u =((v & (H<<1))!=0 || v+H>p) ? v-H : v+H;

c = (u>v) ? 1 : 0;

return EdgeColor(p,root,u,H)^(p/2 mod 2)^c;

}

Find parent

Proof: In paper (room for improvement)

Page 341:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

EdgeColor(10,8,2,1) = EdgeColor(10,8,4,2) = EdgeColor(10,8,8,4) = 1

EdgeColor(10,8,4,1) = EdgeColor(10,8,8,4) = 1

EdgeColor(10,8,6,1) = EdgeColor(10,8,4,2) XOR 1= EdgeColor(10,8,8,4) XOR 1 = 0

EdgeColor(10,8,8,1) = 1

EdgeColor(10,8,10,1) = EdgeColor(10,8,8,2) XOR 1 = 0

EdgeColor(p,root,v,H) {

if (v==root) return 1;

while ((v & H)==0) H <<= 1;

u =((v & (H<<1))!=0 || v+H>p) ? v-H : v+H;

c = (u>v) ? 1 : 0;

return EdgeColor(p,root,u,H)^(p/2 mod 2)^c;

}

Page 342:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Final remark: Latency can be improved by using two Fibonacci trees

Details … (Peter Sanders and students, personal communication)

Page 343:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of the hypercube: Optimal Broadcast

Goal: Broadcast M blocks in M-1+ceil(log p) rounds

Idea: Use allgather like algorithm on hypercube (or circulantgraph), each processor broadcasts some of the blocks

Bin Jia: Process cooperation in multiple message broadcast. Parallel Computing 35(12): 572-580 (2009)

Blocks block[i], 0≤i<M to be broadcast

root

Page 344:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Comm_rank(comm,&rank);

MPI_Comm_size(comm,&size);

for (j=0; j<M-1+log p) {

MPI_Sendrecv(buffer[s(rank,j)],…,

partner(rank,j),…,

buffer[t(rank,j)],…,

partner(rank,j),…,comm);

}

Partner in j’th round is (j mod q)’th hypercube neighbor:

partner(rank,j) = rank XOR (2j mod q)

where for now q = log2 p

The M-1+log p round hypercube algorithm

Page 345:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The algorithm is correct if

1. The block s(rank,j) that processor rank sends in round j has been received in a previous round j’<j

2. The block s(rank,j) that a processor sends in round j is the block that its partner expects to receive in round j, t(partner(rank,j),j)

3. The block t(rank,j) that a processor receives in round j is the block that its partner expects sends in round j, s(partner(rank,j),j)

4. All blocks have been received after M-1+log2 p rounds by each processor

Page 346:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

From now on (wlog) root = 0 (otherwise: shift towards 0)

Root sends blocks out one after the other to partner(0,j):

s(0,j) = min(j,M-1)

The root never receives; if partner(i,j) = 0 for processor i in round j, processor i does not send (in MPI: Send to MPI_PROC_NULL)

Page 347:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

Page 348:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

Round 0

Root sends to partner 0: flip bit 0

Page 349:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

Round 1

Root sends to partner 1: flip bit 1

Page 350:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

Round 1

0

Root sends to partner 1: flip bit 1

Processor 1 sends to partner 1: flip bit 1

Page 351:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 2

0

Root sends to partner 2: flip bit 2

Page 352:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 2

0

1

0

Root sends to partner 2: flip bit 2

Processors 1,2,3 sends to partners: flip bit 2

0

Page 353:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Set q = log p, and define

bit(i,j): (j mod q)’th bit of i (from least significant bit)

s(i,j) = j – q + (1-bit(i,j)) NEXTBIT(i)[j mod q]t(i,j) = j – q + bit(i,j) NEXTBIT(i)[j mod q]

Observation (by definition of partner and NEXTBIT):

s(i,j) = t(partner(i,j),j)t(i,j) = s(partner(i,j),j)

Also: Blocks with s(i,j)<0 or t(i,j)<0 are not sent/received

Remedy: if s(i,j)≥M send instead block M-1

Page 354:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

NEXTBIT(i) is a precomputed function for each processor i with

NEXTBIT(i)[j]: distance from bit j of i to next left (more significant) 1-bit in i, with wrap-around (for technical convenience NEXTBIT(0)[j] = q)

Examples:NEXTBIT(010011)[0] = 1NEXTBIT(010011)[1] = 3NEXTBIT(010011)[2] = 2NEXTBIT(010011)[3] = 1NEXTBIT(010011)[4] = 2NEXTBIT(010011)[5] = 1

Lemma:For any i, 0≤i<p, NEXTBIT(i) can be computed in O(log p) steps

Proof: Exercise (or read the paper)

Observation: NEXTBIT(i)[j] = NEXTBIT(partner(i,j))[j]

Page 355:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proposition:s(i,j+1) = (1-bit(i,j))s(i,j) + bit(i,j)t(i,j)

The block that processor i sends in round j+1 is either the same block as sent in round j, or the block just received in round j.

A processor therefore does not attempt to send a block that it has not received!

Page 356:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proposition:s(i,j+1) = (1-bit(i,j))s(i,j) + bit(i,j)t(i,j)

Proof:First note that if bit(i,j+1)=0, then NEXTBIT(i)[(j+1) mod q] = NEXTBIT(i)[j mod q]-1; andif bit(i,j+1)=1, then NEXTBIT(i)[j mod q] = 1

It follows thats(i,j+1) = j + 1 – q + (1-bit(i,j+1)) NEXTBIT(i)[(j+1) mod q] =j – q + NEXTBIT(i)[j mod q] =(1-bit(i,j)) s(i,j) + bit(i,j) t(i,j)

(definition)

Page 357:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 2

0

1

0

Root sends to partner 2: flip bit 2

Processors 1,2,3 sends to partners: flip bit 2

0

Page 358:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 3

0

1

0

Flip bit (3 mod 3) = 0

00

00

2

1 1

3

Processor 4:NEXTBIT(100)[0] = 2, so s(4,3) = 2, t(4,3) = 0

Processor 5:NEXTBIT(101)[0] = 2, so s(5,3) = 0, t(5,3) = 2

Page 359:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 4

0

1

0

Flip bit (4 mod 3) = 1

00

00

2

1 1

3

Processor 4:NEXTBIT(100)[1] = 1, so s(4,4) = 2, t(4,4) = 1

Processor 6:NEXTBIT(110)[1] = 1, so s(6,4) = 1, t(6,4) = 2

1 1

4

2 2

3

1

Page 360:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 5

0

1

0

Flip bit (5 mod 3) = 2

00

00

2

1 1

3

1 1

4

2 2

3

1

4

3

2

4

2

3

2

Processor 2:NEXTBIT(010)[2] = 2, so s(2,5) = 4, t(2,5) = 2

Processor 6:NEXTBIT(110)[2] = 2, so s(6,5) = 2, t(6,5) = 4

Page 361:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 6

0

1

0

Flip bit (6 mod 3) = 0

00

00

2

1 1

3

1 1

4

2 2

3

1

4

3

2

4

2

3

2

4 4

33

4

3

4

Processor 2:NEXTBIT(010)[0] = 1, so s(6,2) = 4, t(6,2) = 3

Processor 3:NEXTBIT(011)[0] = 1, so s(3,6) = 3, t(3,6) = 4

Page 362:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2

4

0

1

2

3

4

0

1

2

Round 6

0

1

0

Flip bit (5 mod 3) = 0

00

00

2

1 1

3

1 1

4

2 2

3

1

4

3

2

4

2

3

2

4 4

33

4

3

4

Done! 7 = 5+3-1 rounds

Page 363:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Processor 7:

NEXTBIT(111)[0] = 1NEXTBIT(111)[1] = 1NEXTBIT(111)[2] = 1NEXTBIT(111)[0] = 1NEXTBIT(111)[1] = 1NEXTBIT(111)[2] = 1NEXTBIT(111)[0] = 1

s(7,0) = -1

s(7,3) = 0s(7,4) = 1s(7,5) = 2s(7,6) = 3

t(7,0) = -1

t(7,2) = 0t(7,3) = 1t(7,4) = 2t(7,5) = 3t(7,6) = 4

Page 364:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Processor 5:

NEXTBIT(101)[0] = 2NEXTBIT(101)[1] = 1NEXTBIT(101)[2] = 1NEXTBIT(101)[0] = 2NEXTBIT(101)[1] = 1NEXTBIT(101)[2] = 1NEXTBIT(101)[0] = 2

s(5,0) = -1

s(5,3) = 0s(5,4) = 2s(5,5) = 2s(5,6) = 3

t(5,0) = -1

t(5,2) = 0t(5,3) = 2t(5,4) = 1t(5,5) = 3t(5,6) = 4

Page 365:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proposition: Suppose there is an infinite number of blocks. For round j≥0 define G(j,k) = {i|NEXTBIT(i)[j mod q]=k} for 0<k≤q, and G(j,0) = {i|0≤i<p}.

After round j, processors of G(j,k) have received all blocks M[j-q+k], for j-q+k≥0

In particular, after round j=M-2+q all processors (in G(j,0)) have received blocks 0, …,M-2; half the processors G(j,1) have received M-1 and the other half a block after M-1, but this was sent as M-1

Proof: Induction on the number of rounds j

Base: For j=0, k=q. Since G(0,q) = {0,1} and processor 1 has message M[0] = M[j-q+k] after round 0, the claim holds

Page 366:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Induction step: Assume claim holds for round j-1; recall G(j,k) = {i|NEXTBIT(i)[j mod q]=k}

Case 0≤k<q:

In round j each processor i in G(j-1,k+1) sends message s(i,j) = j-q+(1-bit(i,j)) NEXTBIT(i)[j mod q] = j-q+k (case analysis, k=0 and k>0) to partner(i,j), and by induction hypothesis i has message M[j-1-q+k+1] = M[j-q+k].

Since G(j,k) = G(j-1,k+1) union {i|partner(i,j) in G(j-1,k+1)}, processors in G(j,k) have M[j-q+k] in round j

Case k=q: G(j,q) = {0,2j mod q}, and processor 0 sends M[j] to processor 2j mod q in round j

Page 367:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Beyond hypercube

The hypercube result was known long before Bin Jia, similar to the ideas (“Edge Disjoint Spanning Trees”) in the fundamental paper by Johnsson and Ho (1989).

S. Lennart Johnsson, Ching-Tien Ho: Optimum Broadcasting and Personalized Communication in Hypercubes. IEEE Trans. Computers 38(9): 1249-1268 (1989)

Idea:When p is not a power of two, let q = floor(log p), pair each excess process i>q with processor i-q+1, and use the hypercube algorithm on the processor pairs

Page 368:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1

2 3

4

Each processor i has a co(i) processor (cooperating processor). Together i and co(i) play the role of one processor in the hypercube algorithm

Each pair has a representative rep(i)

co(i) =

0 for i = 02q -1+i for 0<i≤p-2q

i for p-2q<i<pi-2q+q for 2q≤i<p

rep(i) = i for i<2q, co(i) otherwise

unit(i) = {i,co(i)}

Page 369:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

unit(i)

t(i,j) from processor in unit(partner(i,j))

s(i,j) to processor in unit(partner(i,j))

Previous message sent from in(i,j) to out(i,j)

Round j: in(i,j): the processor of unit(i) that receives a new block in round j

out(i,j): the processor of unit(i) that sends a block in round j

Note:in(i,j) and out(i,j) may switch from round j to round j+1

Page 370:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

out(i,j+1) = (1-bit(rep(i),j)) out(i,j) + bit(rep(i),j) in(i,j)

in(i,0) = rep(i)out(i,0) = co(rep(i))

in(i,j) = co(out(i,j))

Rationale:out(i,j) sends block s(rep(i),j) to in(partner(rep(i),j),j), and in(i,j) receives block t(rep(i),j) in round j.By the proposition s(rep(i),j+1) is s(rep(i),j) if bit(rep(i),j)=0, thus processor out(i,j) can continue as out(i,j+1). If bit(rep(i),j)=1, s(rep(i),j+1) = t(rep(i),j), which was received by in(i,j), thus out(i,j+1) shall switch to in(i,j)

Page 371:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Thus, the role of each processor in unit(i) can be computed in O(1) time per round.

It remains to determine the role of the processors in unit(partner(i,j))

out(i,j+1) = (1-bit(rep(i),j)) out(i,j) + bit(rep(i),j) in(i,j)

in(i,0) = rep(i)out(i,0) = co(rep(i))

in(i,j) = co(out(i,j))

Page 372:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

for (j=0; j<M-1+q; j++) {

if (co(rank)==rank) { // singleton unit

MPI_Sendrecv(buffer[s(rank,j)],…,

in(partner(rank,j),j),…,

buffer[t(rank,j)],…,

out(partner(rank,j),j),…,comm);

} else if (rank==out(rank,j)) { // out processor

MPI_Sendrecv(buffer[s(rep(rank),j)],…,

in(partner(rep(rank),j),j),…,

buffer[j-q-1],…,

in(rank,j),…,comm);

} else { // in processor

MPI_Sendrecv(buffer[j-q-1],…,

out(rank,j),…,

buffer[t(rep(rank),j)],

out(partner(rep(rank),j),j),…,comm);

}

}

Page 373:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

if (co(rank)!=rank) { // non-trivial unit

if(rank==out(rank,j)) { // out processor

MPI_Sendrecv(buffer[M-1],…,

co(rank),…,

buffer[M-2],…,

co(rank),…,comm);

} else { // in processor

MPI_Sendrecv(buffer[M-2],…,

co(rank),…,

buffer[M-1],

co(rank),…,comm);

}

}

One extra round

Page 374:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

u(i,j) = SWITCH(i)[q-1] (j-1)/q + SWITCH(i)[(j-1) mod q]

To determine role of processors in partner unit, necessary to compute the number of role switches in constant time:

SWITCH(i) is a q-element array that stores for each bit position j of i, the number of 1-bits from 0 to j (included)

Lemma:For any i, 0≤i<p, SWITCH(i) can be computed in O(log p) steps

Page 375:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

u(i,j) = SWITCH(i)[q-1] (j-1)/q + SWITCH(i)[(j-1) mod q]

out(partner(rep(i),j),j) =(1-bit(v(i,j),0)) co(partner(rep(i),j)) + bit(v(i,j),0) partner(rep(i),j)

v(i,j) = u(i,j) + j/q

out(i,j) = (1-bit(u(i,j),0)) co(rep(i)) + bit(u(i,j),0) rep(i)

Page 376:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proposition:Before round j, processor in(i,j) has received M[j-q-1] if j-q-1≥0, and processor out(i,j) has received M[s(rep(i),j)] if s(rep(i),j)≥0

Proof: Induction on j, see paper

This shows correctness of the final, extra exchange step. Together with the previous proposition, the main theorem follows

Page 377:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Main Theorem:In the fully connected, 1-ported communication model, broadcast of M blocks can be done optimally in M-1+ceil(log p) communication rounds. In the linear cost model

Tbroadcast(m) = (ceil(log p)-1)α + 2√[ceil(log p)-1)αβm] + βm

Note:There are at least three other algorithms in the literature achieving the optimal bound; Bin Jia’s is arguably the most elegant and easy to implement.

Page 378:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jesper Larsson Träff, Andreas Ripke: Optimal broadcast for fully connected processor-node networks. J. Parallel Distrib. Comput. 68(7): 887-901 (2008)

Another algorithms achieving the same result is:

Not elegant, needs O(p log p) step precomputation to determine schedule of ceil(log p) entries for process all processes. Challenge: process local, O(log p) step schedule computation for local process i, 0≤i<p

Advantage: Can be used for allgather as well, exploting allgather≈ p bcast. Can be used for optimal allgatherv

Master thesis?

Page 379:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Yet more algorithms with round and bandwidth optimal broadcast

Amotz Bar-Noy, Shlomo Kipnis, Baruch Schieber: Optimal multiple message broadcasting in telephone-like communication systems. Discrete Applied Mathematics 100(1-2): 1-15 (2000)Oh-Heum Kwon, Kyung-Yong Chwa: Multiple message broadcasting in communication networks. Networks 26(4): 253-261 (1995)

Page 380:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Alltoall communication

Each MPI process has an individual block of data (“sendbuf[i]”) to each other process in comm (including itself)Each MPI process receives an individual block of data (“recvbuf[i]”) from each other process in comm (including itself)

MPI_Alltoall(sendbuf,scount,stype,

recvbuf,rcount,rtype,comm);

Alltoall, total exchange, transpose, …

Page 381:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Bisection width lower bound for alltoall communication

Definition:The bisection width of a network is the minimum number edges that have to be removed to partition the network into two parts with floor(p/2) and ceil(p/2) nodes, respectively.

Observation: Let the bisection width of a k-ported, bidirectional network be w. Any alltoall algorithm that sends/receives each block separately requires at least

floor(p/2) x ceil(p/2)/w/k

communication rounds (and, trivially, at least p-1 rounds)

Page 382:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

floor(p/2)

ceil(p/2)

Argument: In any partition of p into roughly equal sized parts, ceil(p/2)/k of the (p-1)/k communication rounds need to cross the cut. In each round, floor(p/2) processors want to communicate, but there is a minimum cut that can accommodate at most w simultaneous communication operations. Thus the number of rounds is at least ceil(p/2) x floor(p/2)/w/k

Bisection width w

Page 383:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Examples:

Bisection width of fully connected network: w = floor(p/2) x ceil(p/2)

Bisection width of linear array (ring):1 (2), alltoall takes at least p2/4 rounds (p2/8 rounds)

Bisection width of 2-d torus:w = 2√p, alltoall takes at least (p√p)/8 rounds

Bisection width of 3-d torus:w = 2 (p1/3)2, alltoall takes at least p (p1/3)/8 rounds

Page 384:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

for (i=1; i<=size; i++) {

prev = (rank-i+size)%size;

next = (rank+i)%size;

MPI_Sendrecv(sendbuf[next],…,next,…,

recvbuf[prev],…,prev,…,comm);

}

Direct algorithm for fully connected networks: Each processor sends to and receives from each other processor, including itself

MPI_Alltoall(sendbuf,scount,stype,

recvbuf,rcount,rtype,comm);

Direct alltoall, fully connected network

Page 385:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Talltoall(m) = (p-1)α + (p-1)/p βm

Direct algorithm, p-1 communication rounds (no communication in last round):• High latency, (p-1)α• Optimal in bandwidth term, since (p-1)/p m units of data have

to be sent and received by each processor• Exploits fully bidirectional, send-receive communication

for (i=1; i<=size; i++) {

prev = (rank-i+size)%size;

next = (rank+i)%size;

MPI_Sendrecv(sendbuf[next],…,next,…,

recvbuf[prev],…,prev,…,comm);

}

Page 386:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Less tightly coupled alltoall: All scheduling decisions left to MPI library and communication subsystem

MPI_Request request[2*size];

MPI_Status status[2*size];

for (i=1; i<=size; i++) {

prev = (rank-i+size)%size;

next = (rank+i)%size;

MPI_Irecv(recvbuf[prev],…,prev,…,comm,

&request[2*(i-1)]);

MPI_Isend(sendbuf[next],…,next,…,comm,

&request[2*i-1]);

}

MPI_Waitall(2*size,request,status);

Page 387:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Telephone exchange algorithm (1-factoring)

Proof:Define for each node u the i‘th partner as vi(u) = (i−u+p) mod p. Since vi(vi(u)) = (i-(i-u+p) mod p) mod p = u (*), which shows that each node has degree 1, the i‘t factor can be defined as the theset of edges (u,vi(u)) for i=0,…,p-1

Lemma: The fully connected network (complete graph with self-loops) can be partitioned into p subgraphs in each of which each node has degree exactly 1 (allowing self-loops): 1-factors

(*) if i-u≥0, then (i-u+p) mod p = i-u, and u mod p = u; if i-u<0, then (i-u+p) mod p = i-u+p

Page 388:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

for (i=0; i<size; i++) {

int partner = (i-rank+size)%size:

MPI_Sendrecv(sendbuf[partner],…,partner,

recvbuf[partner],…,partner,

comm,MPI_STATUS_IGNORE);

}

1-factor self-loop algorithm for alltoall

Works for all p (even, odd, not power-of-two, …)

Talltoall(m) = pα + βm

Page 389:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

1

0 3

Example, p=4

Page 390:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

1

0 3

Example, p=4

Round 0:

v0(0) = (0-0) mod p = 0v0(1) = (0-1) mod p = 3v0(2) = (0-2) mod p = 2v0(3) = (0-3) mod p = 1

Page 391:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

1

0 3

Example, p=4

Round 1:

v1(0) = (1-0) mod p = 1v1(1) = (1-1) mod p = 0v1(2) = (1-2) mod p = 3v1(3) = (1-3) mod p = 2

Page 392:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

1

0 3

Example, p=4

Round 2:

v2(0) = (2-0) mod p = 2v2(1) = (2-1) mod p = 1v2(2) = (2-2) mod p = 0v2(3) = (2-3) mod p = 3

Page 393:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

1

0 3

Example, p=4

Round 3:

v3(0) = (3-0) mod p = 3v3(1) = (3-1) mod p = 2v3(2) = (3-2) mod p = 1v3(3) = (3-3) mod p = 0

Page 394:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Page 395:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Round 0:

v0(0) = (0-0) mod p = 0v0(1) = (0-1) mod p = 4v0(2) = (0-2) mod p = 3v0(3) = (0-3) mod p = 2v0(4) = (0-4) mod p = 1

Page 396:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Round 1:

v1(0) = (1-0) mod p = 1v1(1) = (1-1) mod p = 0v1(2) = (1-2) mod p = 4v1(3) = (1-3) mod p = 3v1(4) = (1-4) mod p = 2

Page 397:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Round 2:

v2(0) = (2-0) mod p = 2v2(1) = (2-1) mod p = 1v2(2) = (2-2) mod p = 0v2(3) = (2-3) mod p = 4v2(4) = (2-4) mod p = 3

Page 398:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Round 3:

v3(0) = (3-0) mod p = 3v3(1) = (3-1) mod p = 2v3(2) = (3-2) mod p = 1v3(3) = (3-3) mod p = 0v3(4) = (3-4) mod p = 4

Page 399:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2

4

0

3

1

Example, p=5

Round 4:

v4(0) = (4-0) mod p = 4v4(1) = (4-1) mod p = 3v4(2) = (4-2) mod p = 2v4(3) = (4-3) mod p = 1v4(4) = (4-4) mod p = 0

Page 400:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Improvements:• p even: p-1 communication rounds needed• p odd: p communication rounds needed

Lemma: For even p, there exist a 1-factorization of the fully connected network into p-1 1-factors. For p odd, there exist an almost 1-factorization into p almost 1-factors (each factor has one un-paired node)

Eric Mendelsohn, Alexander Rosa: One-factorizations of the complete graph - A survey. Journal of Graph Theory 9(1): 43-65 (1985)

Proof: Folklore, see for instance

Page 401:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Even p construction: The i’th 1-factor has edges (i,p-1) and ((j+i) mod (p-1),(p-1-j+i) mod (p-1)) for j=1,…,p/2-1. Each node clearly has degree 1, and each edge of the network is in exactly one factor

0

1

2p-3

p-2

Round i, 0≤i<p-1, for each node u:p-1

p/2-1p/2

Rotate along circle, in round i, node u communicates with node v

if (u==p-1) v = i; else {

uu = (p-1-u-i+p-1)%(p-1);

if (uu==0) v = p-1;

else

v = ((p-1-uu)+i)%(p-1);

}… …

Page 402:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Page 403:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Round 0:

Page 404:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Round 1:

Page 405:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Round 2:

Page 406:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Round 3:

Page 407:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example, p=6

2

4

0

3

1

5

Round 4:

Talltoall(m) = (p-1)α + βm(p-1)/p

Page 408:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Odd p construction: Use even p construction for p+1, remove in each round the edge to virtual node p-1 (or use previous construction)

Folklore: Even p, p power of two, use v = u XOR i, i=0,…,p-2

0

1

2p’-3

p’-2

p’/2-1p’/2

p odd, p’ = p+1

… …

Page 409:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of the hypercube: Fewer communication rounds

Message combining:

In round k, 0≤k<d, each processor sends/receives 2k blocks per processor of 2d-k-1 dimensional neighboring hypercube

Neighboring hypercube of processor i in round k:flip bit d-1-k

Total amount of data per processor per round:2k 2d-k-1 m/p = 2d-1 m/p = m/2

Page 410:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

6

0 1 2 3 4 5 6 7Proc. 110

Combine messages:

In round k, d>k≥0, each processor sends/receives 2k blocks per processor of 2d-k-1 dimensional neighboring hypercube

Page 411:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

2 6

0 1 2 3 4 5 6 7Proc. 110

4 5 6 7 4 5 6 7 Recv 010

Send 010

Combine messages:

In round k, d>k≥0, each processor sends/receives 2k blocks per processor of 2d-k-1 dimensional neighboring hypercube

Round 2

Page 412:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

4

6

4 5 6 7 4 5 6 7Proc. 110

6 7 6 7 6 7 6 7 Recv 100

Send 100

Combine messages:

In round k, d>k≥0, each processor sends/receives 2k blocks per processor of 2d-k-1 dimensional neighboring hypercube

Round 1

Page 413:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

6 7

Proc. 110 6 7 6 7 6 7 6 7

Recv 111

Send 111

6 6 6 6 6 6 6 6

Combine messages:

In round k, d>k≥0, each processor sends/receives 2k blocks per processor of 2d-k-1 dimensional neighboring hypercube

Round 0

Page 414:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Talltoall(m) = (log p)(α+βm/2)

This tradeoff is optimal (see later)

Page 415:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The power of the circulant graph

J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, D. Weathersby: Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems. IEEE Trans. Parallel Distrib. Syst. 8(11): 1143-1156 (1997)

The hypercube result does not generalize to fully connected networks (non-power of 2 number of processors).

But the pattern (circulant graph) used in Dissemination Allgatheralgorithm does!

Page 416:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Three (3) steps:

1. Processor local reordering of send blocks to get a symmetric situation for all processors

2. Routing in ceil(log p) rounds with message combining3. Local reordering to get blocks into rank order

Page 417:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

36

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

ij Block from processor i to processor j

Page 418:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Step 1:

Local rotate upwards; processor i by i positions

Page 419:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Block at right destination processor

Page 420:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

After step 1, symmetric situation:

For each process i: Block in row j has to be sent to processor (i+j) mod p (“shift”)

Idea:Write row index j as a binary number (e.g. j = 5 = (101)2 = 22 + 20), shift according to the 1-bits

Page 421:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Step 2: ceil(log p) rounds

Round k:For process i, all blocks destined to a processor where bit k=1 are combined and sent together to processor (i+2k) mod p

Page 422:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 423:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 424:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 425:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 426:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 427:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 428:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 429:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 430:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 431:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Step 3:

Local reverse and rotate downwards, processor i by i+1 positions

Page 432:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

36

Page 433:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

52

01

00

05

06

0

13

14

42

11

10

15

16

1

23

24

32

21

20

25

26

2

33

34

22

31

30

35

3

63

64

62

61

60

65

66

6

43

44

12

41

40

45

46

4

53

54

02

51

50

55

56

5

36

Page 434:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

22

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

02

21

20

25

26

2

33

34

62

31

30

35

3

63

64

32

61

60

65

66

6

43

44

52

41

40

45

46

4

53

54

42

51

50

55

56

5

36

Page 435:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

33

44

22

11

00

55

66

0

23

34

12

01

60

45

56

1

13

24

02

61

50

35

46

2

03

14

62

51

40

25

3

43

54

32

21

10

65

06

6

63

04

52

41

30

15

26

4

53

64

42

31

20

05

16

5

36

Page 436:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Talltoall(m) = ceil(log p)(α+βfloor(m/2)) + O(m)

Total time:ceil(log p) communication rounds (Step 2), O(m) copy-shift (Step 1), O(m) reverse-shift (Step 3)

Algorithm is used in some MPI libraries (mpich, mvapich, OpenMPI) for small m.

Drawback: Many copy/pack-unpack operations (Steps 1 and 3)

Page 437:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jupiter: 36 nodes, one MPI process/node

Page 438:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jupiter: 36 nodes, 16 MPI processes/node

NEC MPI library does different things for MPI_Alltoalland MPI_Alltoallv

Page 439:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Jesper Larsson Träff, Antoine Rougier, Sascha Hunold:Implementing a classic: zero-copy all-to-all communication with mpi datatypes. ICS 2014: 135-144

Note:Step 1 and Step 3 can be eliminated: Reverse communication direction, shift implicitly by communication step; MPI derived datatypes to avoid explicit packing/unpacking

Next example shows how to eliminate Step 3

Page 440:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

03

04

02

01

00

05

06

0

13

14

12

11

10

15

16

1

23

24

22

21

20

25

26

2

33

34

32

31

30

35

36

3

63

64

62

61

60

65

66

6

43

44

42

41

40

45

46

4

53

54

52

51

50

55

56

5

MPI sendbuf

Page 441:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Step 1:

Copy sendbuf blocks into recvbuf in reverse order

Page 442:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

04

03

05

06

00

02

01

0

16

15

10

11

12

14

13

1

21

20

22

23

24

26

25

2

33

32

34

35

36

31

30

3

62

61

63

64

65

60

66

6

45

44

46

40

41

43

42

4

50

56

51

52

53

55

54

5

Page 443:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

04

03

05

06

00

02

01

0

16

15

10

11

12

14

13

1

21

20

22

23

24

26

25

2

33

32

34

35

36

31

30

3

62

61

63

64

65

60

66

6

45

44

46

40

41

43

42

4

50

56

51

52

53

55

54

5

Page 444:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Step 2: ceil(log p) rounds

Round k:For process i, all blocks destined to a processor with bit k=1 are sent to processor (i-2k) mod p

Block j is in row (i+j) mod p, block j is sent if j XOR 2k=1

Page 445:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

04

03

05

06

00

02

01

0

16

15

10

11

12

14

13

1

21

20

22

23

24

26

25

2

33

32

34

35

36

31

30

3 6

45

44

46

40

41

43

42

4

50

56

51

52

53

55

54

5

62

61

63

64

65

60

66

Page 446:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

15

03

05

10

00

13

01

0

16

26

21

11

12

14

24

1

32

20

22

23

35

30

25

2

33

43

34

46

36

31

41

3 6

45

44

50

40

52

54

42

4

61

56

51

63

53

55

65

5

62

02

04

64

06

60

66

Page 447:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

15

03

05

10

00

13

01

0

16

26

21

11

12

14

24

1

32

20

22

23

35

30

25

2

33

43

34

46

36

31

41

3 6

45

44

50

40

52

54

42

4

61

56

51

63

53

55

65

5

62

02

04

64

06

60

66

Page 448:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

30

03

20

10

00

13

23

0

31

41

21

11

34

14

24

1

32

42

22

45

35

52

25

2

33

43

56

46

36

53

63

3 6

60

44

50

40

04

54

64

4

61

01

51

15

05

55

65

5

62

02

26

16

06

12

66

Page 449:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

30

03

20

10

00

13

23

0

31

41

21

11

34

14

24

1

32

42

22

45

35

52

25

2

33

43

56

46

36

53

63

3 6

60

44

50

40

04

54

64

4

61

01

51

15

05

55

65

5

62

02

26

16

06

12

66

Page 450:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

30

03

20

10

00

13

23

0

31

41

21

11

34

14

24

1

32

42

22

45

35

52

25

2

33

43

56

46

36

53

63

3 6

60

44

50

40

04

54

64

4

61

01

51

15

05

55

65

5

62

02

26

16

06

12

66

Page 451:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

30

40

20

10

00

50

60

0

31

41

21

11

01

51

61

1

32

42

22

12

02

52

62

2

33

43

23

13

03

53

63

3 6

34

44

24

14

04

54

64

4

35

45

25

15

05

55

65

5

36

46

26

16

06

56

66

Page 452:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

30

40

20

10

00

50

60

0

31

41

21

11

01

51

61

1

32

42

22

12

02

52

62

2

33

43

23

13

03

53

63

3 6

34

44

24

14

04

54

64

4

35

45

25

15

05

55

65

5

36

46

26

16

06

56

66

Page 453:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Done! (no step 3)

Step 1 and all intermediate pack/unpack eliminated by using MPI derived datatype and double buffering per block (see paper)

Page 454:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Bandwidth/Latency trade-offs for alltoall communication

Theorem:• Any allgather algorithm requires at least ceil(logk+1 p)

communication rounds in the k-ported communication model• Any alltoall algorithm requires at least ceil(logk+1 p)

communication rounds in the k-ported communication model

Theorem:• Any alltoall algorithm that transfers exactly (p-1)/p m Bytes

requires at least p-1 communication rounds• Any alltoall algorithm that uses ceil(logk+1 p) communication

rounds must transfer at least m/(k+1) logk+1 p Bytes

In fully connected, k-ported, bidirectional nertworks:

Page 455:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

J. Bruck, Ching-Tien Ho, S. Kipnis, E. Upfal, D. Weathersby: Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems. IEEE Trans. Parallel Distrib. Syst. 8(11): 1143-1156 (1997)

Trade-off results from

Page 456:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Another lecture: Algorithms for 2d-ported, d-dim. tori

Lower bounds to meet:• Diameter• Bisection• Edge congestion

Algorithms look different, tricky, still open problems (alltoall)

First approximation (van de Geijn): Use combinations of linear ring algorithms along the dimensions

Page 457:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithms summary (communication costs only)

p-processor linear array

Collective T(m) in linear cost model

Gather/Scatter (p-1)α + (p-1)/p βm

Allgather same

Reduce-scatter same

Bcast/Reduce (p-2)α + o(pm) + βm Pipelining

Scan/Exscan same

All-to-all Not covered

Page 458:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Fully connected (tree, hypercube):

Collective T(m) in linear cost model

Gather/Scatter ceil(log p)α + (p-1)/p βm Binomial tree

Allgather same Circulant graph

Reduce-scatter

Bcast 2ceil(log p) α + o(pm) + 2βm Binary tree pipe

2ceil(log p) α + 2o(pm) + βm 2-trees

ceil(log p)α + o(pm) + βm Pipelining, Bin-Jia

Scan/Exscan ceil(log p)(α+βm) Hillis-Steele

4ceil(log p) α + 2o(pm) + 2βm 2-trees

All-to-all (p-1) (α + 1/p βm) Direct, fully conn., 1-factor

ceil(log p)(α+βfloor(m/2)) Circulant Bruck

Page 459:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc p-1

Back to MPI (example: Changing data distributions)

Proc 0Proc p-1

Page 460:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The MPI collective interfaces

MPI_Bcast(buffer,count,type,root,comm);

buffer: address of data (address)

count: number of elements (int)type: MPI datatype describing layout (= static structure) of element

... Contiguous buffer of consecutive elements

... Noncontiguous buffer(homogeneous)

Page 461:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

The MPI collective interfaces

MPI_Bcast(buffer,count,type,root,comm);

buffer: address of data (address)

count: number of elements (int)type: MPI datatype describing layout (= static structure) of element

... Contiguous buffer of consecutive elements

...Noncontiguous buffer(heterogeneous: different types of elements)

Page 462:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI typed data communication

Order, position and number of elements is determined by countand datatype arguments:

Count repetitions of the layout described by datatype, the i’threpetition, 0≤i<count is at relative offset i*extent(datatype) in buffer

Basic elements (MPI predefined datatypes):int (MPI_INT), char (MPI_CHAR), double (MPI_DOUBLE), …

All communication operations: Sequence of typed, basic elements are sent and received in predefined order

Page 463:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

datatype

buffer:

extent(datatype)

Important (very):extent(datatype) is just an arbitrary (almost…) unit associated with the datatype, used for calculating offsets

Datatypes always have extent. Default rules for how the extent is set, but can also be set explicitly set. Sometimes very powerful tool

Page 464:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Data layout: Finite sequence of (basic element,offset) pairs

Type map: Finite sequence of (basic type,offset) pairs Type signature: Finite sequence of (basic type)

MPI datatypes (layouts), formally

0 2 1 4 3

Sequence: Basic elements in layout has an order. Elements are communicated in that order

Page 465:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

0 1 8 4 6527 3

Type map (layout)

True extent: Difference between offset of first and last element

0 1 83 542 76

Corresponding type signature

Basetype elements, corresponding to C (Fortran) int, double, float, char, …

Page 466:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI typed data communication, again

Sequences of elements described by the type signature of the count and datatype arguments are communicated.

MPI communication is correct if signature of sent elements matches the signature of received element

Matching: • In collective operations, send and receive signatures must be

identical• In point-to-point and one-sided communication, send signature

must be a prefix of receive signature

These requirements are not checked; it is user responsibility to ensure that types are used correctly. No type casting is performed in communication

Page 467:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Scatter(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,root,comm);

sendbuf,sendcount,sendtype: Describes one block of data elements to be sent to one other processor

recvbuf,recvcount,recvtype: Describes one block of data elements to be received from some other processor

Block of data elements: count repetitions of type with j’threpetition of i’th block at offset (j+i*count)*extent(type) from buf, 0≤i<size, 0≤j<count

Page 468:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Gather(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,root,comm);

sendbuf (rank i):

recvbuf (root):

Example: different send/recv types, with same signature

Block from rank i

Root must allocate memory for size(comm) blocks of recvcount*[true]extent(recvtype) elements, otherwise BIG TROUBLE

extent(recvtype)

Page 469:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Scatter(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,root,comm);

Consistency rules:

• Signature of sendblock must be exactly same as signature of corresponding receive block (sequence of basic elements sent by one processor must be same as sequence expected by receiving process)

• Consistent values for other arguments, e.g. same root, same op, …

Page 470:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Scatter(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,root,comm);

Consistency rules:

• Signature of sendblock must be exactly same as signature of corresponding receive block (sequence of basic elements sent by one processor must be same as sequence expected by receiving process)

Recall: Point-to-point and one-sided models are less strict: Send signature must be a prefix of receive signature

Page 471:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI collective interfaces (in applications) and algorithms

• (Library internal) functionality needed to handle data layouts (e.g., pipelining, copying between temporary buffers, …)

• Algorithm analysis assume that all processors participate at the same time: Communication rounds and cost determined by communication cost model

MPI (applications):No guarantee, no requirement that processes call collective operation at the same time

Page 472:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Research on sensitivity of MPI (collective) operations to• Non-synchronized process arrival patterns• “Noise” (OS)still needed

Question:How bad can the algorithms be under non-synchronized process arrival (and progress) patterns?

Answer: Not much known…

Page 473:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Torsten Hoefler, Timo Schneider, Andrew Lumsdaine: The Effect of Network Noise on Large-Scale Collective Communications. Parallel Processing Letters 19(4): 573-593 (2009)

Ahmad Faraj, Pitch Patarasuk, Xin Yuan: A Study of ProcessArrival Patterns for MPI Collective Operations. International Journal of Parallel Programming 36(6): 543-570 (2008)

Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin:The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q. SC 2003: 55

Page 474:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI collective correctness requirement:

If some process in comm calls collective operation MPI_Coll, then eventually all other processes in comm must call MPI_Coll(with consistent arguments), and no process must call any other collective on comm before MPI_Coll (assumption: all collective calls before MPI_Coll have been completed)

Rank i:

MPI_Bcast(…,comm);

MPI_Bcast(…,comm);

MPI_Bcast(…,comm);

… is legal

Page 475:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI collective correctness requirement:

If some process in comm calls collective operation MPI_Coll, then eventually all other processes in comm must call MPI_Coll(with consistent arguments), and no process must call any other collective on comm before MPI_Coll (assumption: all collective calls before MPI_Coll have been completed)

Rank i:

MPI_Bcast(…,comm);MPI_Gather(comm);

MPI_Gather(comm);

MPI_Bcast(…,comm);

MPI_Bcast(…,comm);

MPI_Gather(comm);

… is not

Page 476:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI collective correctness requirement:

If some process in comm calls collective operation MPI_Coll, then eventually all other processes in comm must call MPI_Coll(with consistent arguments), and no process must call any other collective on comm before MPI_Coll (assumption: all collective calls before MPI_Coll have been completed)

Rule:MPI processes must call collectives on each communicator in the same sequence

Page 477:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc j

Process i shall send block of m/p x n/p elements to process j, j=0, …, p-1: Transpose of submatrices: alltoall communication

Change distribution

m/pxn/p

MPI_Alltoall(rowmatrix,1,coltype,

colmatrix,m/p*n/p,MPI_DOUBLE,comm);

colmatrix

rowmatrix

Recall:Matrices in C stored in row order

Page 478:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

m/pxn/p

MPI_Alltoall(rowmatrix,1,coltype,

colmatrix,m/p*n/p,MPI_DOUBLE,comm);

m/pxn/p m/pxn/pm/pxn/pm/pxn/p …

colltype

Block to rank i

Page 479:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

m/pxn/p

rowmatrix

Process i stores m/p x n matrix in row major.

m/p x n/c submatrix consists of n/c element rows strided n elements apart.

n/p n/p …

m/p times

This layout can be described as MPI vector data type

Page 480:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Alltoall(rowmatrix,1,coltype,

colmatrix,m/p*n/p,MPI_DOUBLE,comm);

MPI_Bcast(buffer,1,coltype,root,comm);

Process root

MPI_Bcast(buffer,1,rowtype,root,comm);

Non-root

Legal use of datatypes: Different send and receive types

Number of basic elements must be same for all pairs of processes exhanging data; sequence of basic datatypes (int, float, float, int, char, …) must be same: Type signature

Page 481:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

m/pxn/p

coltype: derived datatype that describes m/p rows of n/p element column of n-element vector

MPI_Type_vector(m/p,n/p,n,MPI_DOUBLE,&coltype);

MPI_Type_commit(&coltype);

is wrongm/pxn/p

Extent of datatype is used to compute offset of next block in MPI_Alltoall

Page 482:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

m/pxn/p

coltype: derived datatype that describes m/p rows of n/p element column of n-element vector

MPI_Type_vector(m/p,n/p,n,MPI_DOUBLE,&coltype);

MPI_Type_commit(&coltype);

m/pxn/p

Correct extent for redistribution

Page 483:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

m/pxn/p

coltype: derived datatype that describes m/p rows of n/p element column of n-element vector

MPI_Type_vector(m/p,n/p,n,MPI_DOUBLE,&cc);

MPI_Type_create_resized(cc,0,n/p*sizeof(double),

&coltype);

MPI_Type_commit(&coltype);

m/pxn/p

Correct extent for redistribution

Page 484:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Process i shall send block of m/p x n/p elements to process j, j=0, …, p-1

MPI_Alltoall(rowmatrix,1,coltype,

colmatrix,m/p*n/p,MPI_DOUBLE,comm);

Process sends rowmatrix+i*1*extent(coltype) to process i, receives colmatrix+i*m/p*n/p*extent(MPI_DOUBLE) from process i

Change distribution

Page 485:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: 2d-stencil 5-point computation

i

Each process i communicates with 4 neighbors and exchanges data at the border of own submatrix:• Vector data type for first and last column• Communication pattern can be implemented with point-to-

point, one-sided, and MPI 3.0 neighborhood collective communication

Column can be described as MPI vector type

Page 486:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: Round k of Bruck all-to-all algorithm

33

43

56

46

36

53

63

3

MPI_Sendrecv

Can be described as:• MPI vector followed

by contig followed by contig followed by vector

• MPI indexed• MPI struct• …

0

1

2

J. Träff, A. Rougier, S. Hunold: Implementing a classic. ICS 2014

More in

Page 487:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

(Research) Questions:

• Are all descriptions of the same data layout equally good: Performance?

• Is there a best possible description?• How expensive is it to compute a best description?

Robert Ganian, Martin Kalany, Stefan Szeider, Jesper Larsson Träff: Polynomial-Time Construction of Optimal MPI Derived Datatype Trees. IPDPS 2016: 638-647Alexandra Carpen-Amarie, Sascha Hunold, Jesper Larsson Träff:On the Expected and Observed Communication Performance with MPI Derived Datatypes. EuroMPI 2016: 108-12

Master thesis

Page 488:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Type_contiguous(count,oldtype,&newtype);

Basic type: MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_CHAR, … - or previously defined, derived datatype

MPI derived datatypes to describe application data layouts

New derived datatypes built from previously defined ones (do not have to be committed) using the MPI type constructors

0 21 3 4

extentConstructor defines type signature (order of elements)

Page 489:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Type_vector(n,block,stride,oldtype,&newtype);

0 21

extent

543 76 8

stride block

Note:Vector extent does not include stride (full block) of last block. Both block(count) and stride are in units of extent(oldtype). Constructor defines type signature (order of elements)

MPI_Type_create_hvector allows stride in bytes

Page 490:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Type_create_indexed(n,block[],displacement[],

oldtype,&newtype);

MPI_Type_create_indexed_block(n,block,displacement[],

oldtype,&newtype);

3 54 0 21

extent

block 0block 1block i

Also constructers with byte displacements

Page 491:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Type_create_struct(n,block[],displacement[],

oldtype[],&newtype);

• Derived datatypes make it possible to (recursively) describe any layout of data in memory

• Derived datatypes can be used in all communication operations: point-to-point, one-sided, collective

• Essential in MPI-IO

MPI_Type_commit(&type);

Before use in communication: After use:

MPI_Type_free(&type);

Page 492:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

extent, before

MPI_Type_create_resized(oldtype,lb,extent,&newtype);

Advanced datatype usage

extent, after

MPI_Type_size(type,&size); // number of bytes consumed

MPI_Type_get_true_extent(type,&lb,&extent);

MPI_Type_get_extent(datatype,&lb,&extent);

Page 493:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Type_get_true_extent(type,&lb,&extent);

True extent: Difference between the highest and the lowest offset of two basic datatypes in the datatype. The true lower bound is the lowest offset of a basic datatype in the datatype

Use true extent for (safer) buffer allocation

MPI_BOTTOM: Special (null) buffer address, can be used as buffer argument. Datatype offsets are absolute addresses. Use with care

Page 494:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Question: Overlapping elements?

sendbuf,sendcount,sendtype: Layouts, also between different blocks of elements, may overlap (some elements are sent multiple times)

recvbuf,recvcount,recvtype: Each element once in layout of total receive buffer, overlap illegal

Communication with datatypes

Reason: Determinism

Page 495:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Pack/Unpack: Functionality to pack/unpack a structured buffer to a consecutive sequence of elements (type signature)

Performance expectation:

MPI_Pack(buffer,count,type,&outbuf,&outsize,…);

MPI_Bcast(outbuf,outsize,MPI_PACKED,…);

not slower than (since they are semantically equivalent)

MPI_Bcast(buffer,count,type,…);

But: Pack/unpack still useful for point-to-point communication of incomplete data (incremental un/packing) and “type safe unions”)

Page 496:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Note: MPI_Pack/Unpack work on units of datatypes. The functionality is not sufficient for accessing arbitrary sequences of elements in a layout described by derived datatypes.

Pipelining requires (unexposed) library-internal functionality

Tarun Prabhu, William Gropp: DAME: A Runtime-Compiled Engine for Derived Datatypes. EuroMPI 2015: 4:1-4:10Timo Schneider, Fredrik Kjolstad, Torsten Hoefler: MPI datatype processing using runtime compilation. EuroMPI 2013: 19-24Timo Schneider, Robert Gerstenberger, Torsten Hoefler: Micro-applications for Communication Data Access Patterns and MPI Datatypes. EuroMPI 2012: 121-131Jesper Larsson Träff, Rolf Hempel, Hubert Ritzdorf, Falk Zimmermann: Flattening on the Fly: Efficient Handling of MPI Derived Datatypes. PVM/MPI 1999: 109-116

Page 497:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Is this true?

Check with different basic layouts. Example: Alternating layout: Two blocks with A1 and A2 elements with strides B1 and B2

extent

Page 498:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI library: mvapich 2-2.1

16 processes on one node 32 processes on 32 nodes

MPI_Bcast benchmark (root packs, non-roots unpack)

Page 499:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

16 processes on one node 32 processes on 32 nodes

MPI library: NECmpi 1.3.1

Note also absolute performance difference between the two libraries

MPI_Bcast benchmark (root packs, non-roots unpack)

Page 500:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Alexandra Carpen-Amarie, Sascha Hunold, Jesper Larsson Träff: On the Expected and Observed Communication Performance with MPI Derived Datatypes. EuroMPI 2016: 108-120

Jesper Larsson Träff: A Library for Advanced Datatype Programming. EuroMPI 2016: 98-107

Benchmarking datatypes under expectations:

On the limited support in MPI specification for programming with datatypes:

Master thesis

Page 501:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc p-1

Example: Changing distributions (2)

Proc 0

Proc 2c

Proc c

Proc c-1…

Process i in comm shall send an m/pxn/c block to process i%c+i/r in Cartesian communicator rccomm

Problem:Two communicators, but MPI communication is always relative to one communicator

Page 502:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Comm_rank(comm,&rank);

MPI_Comm_group(comm,&group);

MPI_Comm_group(rccomm,&rcgroup);

int rcrank;

MPI_Group_translate_rank(group,1,&rank,rcgroup,

&rcrank);

Solution: Translating ranks between groups

Problem: Each process sends blocks only to c other processes

Page 503:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Proc 0

Proc p-1

Proc 0

Proc 2c

Proc c

Proc c-1…

• Each rank in comm sends c blocks of m/p rows and n/c columns to c consecutive processes

• Each rank in rccomm receives m/p rows and n/c columns from c consecutive processes

Page 504:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

for (i=0; i<size; i++) {

scount[i] = 0; sdisp[i] = 0;

rcount[i] = 0; rdisp[i] = 0;

}

MPI_Comm_rank(comm,&rank);

for (i=0; i<c; i++) {

int oldrank, colrank = rank/r+i;

MPI_Group_translate_rank(rcgroup,1,colrank,

group,&oldrank);

scount[oldrank] = 1;

sdisp[oldrank] = i*extent(coltype);

}

MPI_Alltoallv(rows,scount,sdisp,coltype,…,comm);

Solution: Irregular alltoall and datatypes

Page 505:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Cart_coord(rccomm,rank,2,rccoord);

int rowrank = rccoord[0]*r;

for (i=0; i<c; i++) {

rcount[rowrank] = m/p*n/c;

rdisp[rowrank] = i*m/p*n/c;

rowrank++;

}

MPI_Alltoallv(rows,scount,sdisp,coltype,

rowscols,rcount,rdisp,MPI_DOUBLE,

comm);

Solution: Irregular alltoall and datatypes

Home exercise: Use a neighborhood collective instead

Page 506:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Performance problem (alltoallv abuse):

• Each process only sends and receives to/from c neighbors• c (mostly) O(√p) – sparse neighborhood

Other problems:• What if c and r do not divide n and m?• What if p does not factor nicely into r and c?

Partial solutions:• “Padding”• Irregular collectives; but some datatype functionality seems

missing (there is only the fully general, expensive MPI_Alltoallw)

Page 507:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Experiment: What is the cost of MPI_Alltoallv when no data are exchanged, scount[i]=0, rcount[i]=0?

Huge! 25msOn Argonne National Laboratory BlueGene/P system

P. Balaji et al.: MPI on Millions of Cores. Parallel Processing Letters 21(1): 45-60 (2011)

Page 508:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

New collectives in MPI 3.0

Topological/sparse/neighbor collectives can express collective patterns on sparse neighborhoods

Neighborhoods expressed by process topologies

MPI_Cart: Each process has 2d outgoing edges, 2d incoming edges

MPI_Graph: Each process has outgoing and incoming edges as described by communication graph in MPI_Dist_graph_create

Algorithmic flavor totally different from previous, global neighborhood collectives

Page 509:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Neighbor_allgather(v)

i

dj

sk’,dk

recvbuf:

sendbuf:

s0 s1 si s(n-1)

sendbuf sent to all destinations; individual block received from each source into recvbuf

Neighborhood defined by general communication graph

Page 510:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Neighbor_allgather(v)

i

recvbuf:

sendbuf:

s0 s1 s2 s3

sendbuf sent to all destinations; individual block received from each source into recvbuf

Cartesian neighborhood, neighbors in dimension order, -1 dist, +1 dist

0

3

2

1

Page 511:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Neighbor_alltoall(v,w)

i

dj

sk’,dk

recvbuf:

sendbuf:

s0 s1 si s(n-1)

d0

Individual block from sendbuf sent to each destination; individual block received from each source into recvbuf

d(m-1)d1

Page 512:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Algorithms for neighborhood (sparse) collectives

Not much is known, flavor totally different from global, dense collectives, more difficult, hard optimization problems for optimality results; see, e.g.,

Torsten Hoefler, Timo Schneider: Optimization principles for collective neighborhood communications. SC 2012: 98

MPI design questions: Are the neighborhood collectives too powerful? Are they useful? Better compromises possible?

Page 513:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Creation of virtual communication graph/neighborhoods

MPI_Dist_graph_create(comm,…,&graphcomm);

MPI_Dist_graph_create_adjacent(comm,…,&graphcomm);

specify neighborhoods in a fully distributed fashion. Order of adjacent (in and out edges) implementation dependent, but must be fixed, as returned by calls to

MPI_Dist_neighbors_count(graphcomm,

&indegree,&outdegree,

&weighted);

MPI_Dist_graph_neighbors(graphcomm,

maxindeg,sources,sweights,

maxoutdeg,destinats,dweights);

This order of edges is used in neighborhood collectives

Page 514:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

WARNING

Never use the “old” (MPI 1) graph topology interface, MPI_Graph_create() etc.

• Non-scalable• Inconsistent• May/will be deprecated

Page 515:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: 2d stencil 5-point computation

i

• Cartesian neighborhood• MPI_Neighbor_alltoallw: Why?

Different datatypes in x and y dimensions

Page 516:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Example: 2d stencil 9-point computation

i

• Non-Cartesian neighborhood. Have to use distributed graphs• MPI_Neighbor_alltoallw: Why?

Different datatypes in x and y dimensions

Page 517:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Non-blocking collectives (from MPI 3.0)

In analogy with non-blocking point-to-point communication, all collective operations (also sparse) have non-blocking counterparts

MPI_IbarrierMPI_IbcastMPI_Iscatter/IgatherMPI_IallgatherMPI_Ialltoall…

MPI_Ineighbor_allgatherMPI_Ineighbor_allgathervMPI_Ineighbor_alltoallMPI_Ineighbor_alltoallvMPI_Ineighbor_alltoallw

Check for completion: MPI_Test (many variants)Enforce completion: MPI_Wait (many variants)

Page 518:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI_Request request; // MPI request object

MPI_Iallgather(sendbuf,sendcount,sendtype,

recvbuf,recvcount,recvtype,comm,&req);

// compute

MPI_Status status; // MPI status object:

// most fields undefined for non-blocking collectives

MPI_Wait(&req,&status);

Example:

Potential for overlap

Semantics as for blocking collectives: Locally complete (with the implications this has) after wait

Page 519:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

i j

MPI_Send MPI_Irecv

i j

MPI_IBcast MPI_Bcast

All combinations permitted

Blocking and non-blocking collectives do not match. Why?

Non-blocking collective vs. point-to-point communication

Answer: Blocking and non-blocking collective may use a different algorithm. Specification should not forbid such implementations

Page 520:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

Blocking collectives:

Assumption: Used in a synchronized manner, MPI process busy until (local) completionObjective: Fast algorithms (latency, bandwidth)

Non-blocking collectives:

Assumption: Used asynchronously, MPI process can do sensible things concurrently, postponed check for completionObjective: Algorithms that permit overlap, can tolerate skewed arrival patterns, can exploit hardware offloading

Page 521:  · SS17 ©Jesper Larsson Träff MPI: the “Message-Passing Interface” • Library with C and Fortran bindings that implements a message-passing model • Current de factostandard

©Jesper Larsson TräffSS17

MPI and collective communication algorithms summary

• MPI is the reference communication interface in HPC• Much to learn from the MPI design• Good basis to study concepts in interfaces and algorithms for

large-scale, parallel systems• Many interesting research problems

• Application programmers need to know and understand MPI well to program effectively

• Much is known about efficient algorithms for collective communication operations in types of networks; but not everything

• Good synthesis for not fully-connected networks needed