High Performance Computing Center Stuttgart
Edgar Gabriel
MPI
advanced usage of point-to-point operations
Edgar Gabriel
High Performance Computing Center Stuttgart (HLRS)[email protected]
High Performance Computing Center Stuttgart
Edgar Gabriel
Overview
• Point-to-point taxonomy and available functions• What is the status of a message?• Non-blocking operations• Example: 2-D laplace equation • Concatenating independent elements into a single message • Probing for messages
High Performance Computing Center Stuttgart
Edgar Gabriel
What you’ve learned so far
• Six MPI functions are sufficient for programming a distributed system memory machine
MPI_Init(int *argc, char ***argv);MPI_Finalize ();
MPI_Comm_rank (MPI_Comm comm, int *rank);MPI_Comm_size (MPI_Comm comm, int *size);
MPI_Send (void *buf, int count, MPI_Datatype dat,int dest, int tag, MPI_Comm comm);
MPI_Recv (void *buf, int count, MPI_Datatype dat,int source, int tag, MPI_Comm comm, MPI_Status *status);
High Performance Computing Center Stuttgart
Edgar Gabriel
So, why not stop here?
• Performance– need functions which can fully exploit the
capabilities of the hardware– need functions to abstract typical communication
patterns
• Usability– need functions to simplify often recurring tasks– need functions to simplify the management of
parallel applications
High Performance Computing Center Stuttgart
Edgar Gabriel
So, why not stop here?
• Performance– asynchronous point-to-point operations– one-sided operations– collective operations– derived data-types– parallel I/O– hints
• Usability– process grouping functions– environmental and process management– error handling– object attributes– language bindings
High Performance Computing Center Stuttgart
Edgar Gabriel
So, why not stop here?
• Performance– asynchronous point-to-point operations– one-sided operations– collective operations– derived data-types– parallel I/O– hints
• Usability– process grouping functions– environmental and process management– error handling– object attributes– language bindings
High Performance Computing Center Stuttgart
Edgar Gabriel
Point-to-point operations
• Data exchange between two processes– both processes are actively participating in the data
exchange two-sided communication• Large set of functions defined in MPI-1 (50+)
MPI_Ssend_initMPI_IssendMPI_SsendSynchronous
MPI_Rsend_initMPI_IrsendMPI_RsendReady
MPI_Bsend_initMPI_IbsendMPI_BsendBuffered
MPI_Send_initMPI_IsendMPI_SendStandard
PersistentNon-blockingBlocking
High Performance Computing Center Stuttgart
Edgar Gabriel
A message contains of…
• the data which is to be sent from the sender to the receiver, described by– the beginning of the buffer– a data-type– the number of elements of the data-type
• the message header (message envelope)– rank of the sender process– rank of the receiver process– the communicator– a tag
High Performance Computing Center Stuttgart
Edgar Gabriel
Rules for point-to-point operations
• Reliability: MPI guarantees, that no message gets lost• Non-overtaking rule: MPI guarantees, that two
messages posted from process A to process B arrive in the same order as they have been posted
• Message-based paradigm: MPI specifies, that a single message cannot be received with more than one Recv operation (in contrary to sockets!)
Recv buffer1
Message
Message in the Recv buffers
Recv buffer2
if (rank == 0 ) {MPI_Send(buf, 4, …);
}if ( rank == 1 ) {MPI_Recv(buf, 3,…); MPI_Recv(&(buf[3],1,…);
}
High Performance Computing Center Stuttgart
Edgar Gabriel
Message matching (I)
• How does the receiver know, whether the message which he just received is the message for which he was waiting?– the sender of the arriving message has to match the
sender of the expected message– the tag of the arriving message has to match the tag
of the expected message– the communicator of the arriving message has to
match the communicator of the expected message?
High Performance Computing Center Stuttgart
Edgar Gabriel
Message matching (II)
• What happens if the length of the arriving message does not match the length of the expected message?– the length of the message is not used for matching– if the received message is shorter than the expected
message, no problems– the received message is longer than the expected
message – an error code (MPI_ERR_TRUNC) will be returned – or your application will be aborted – or your application will deadlock– or your application writes a core-dump
High Performance Computing Center Stuttgart
Edgar Gabriel
Message matching (III)
• Example 1: correct exampleif (rank == 0 ) {MPI_Send(buf, 3, MPI_INT, 1, 1, MPI_COMM_WORLD);
}else if ( rank == 1 ) {MPI_Recv(buf, 5, MPI_INT, 0, 1, MPI_COMM_WORLD,
&status);}
Recv buffer
Message
Message in the Recv buffer
untouched elements in the recv buffer
High Performance Computing Center Stuttgart
Edgar Gabriel
Message matching (IV)
• Example 2: erroneous example
if (rank == 0 ) {MPI_Send(buf, 5, MPI_INT, 1, 1, MPI_COMM_WORLD);
}else if ( rank == 1 ) {MPI_Recv(buf, 3, MPI_INT, 0, 1, MPI_COMM_WORLD,
&status);}
Recv buffer
Message
Message in the Recv buffer
potentially writing over the end ofthe recv buffer
High Performance Computing Center Stuttgart
Edgar Gabriel
• Question: how can two processes safely exchange data at the same time?
• Possibility 1Process 0 Process 1
MPI_Send(buf,…); MPI_Send(buf,…);MPI_Recv(buf,…); MPI_Recv(buf,…);
– can deadlock, depending on the message length and the capability of the hardware/MPI library to buffer messages
Deadlock (I)
High Performance Computing Center Stuttgart
Edgar Gabriel
• Possibility 2: re-order MPI functions on one process
Process 0 Process 1
MPI_Recv(rbuf,…); MPI_Send(buf,…);MPI_Send(buf,…); MPI_Recv(rbuf,…);
• Other possibilities:– asynchronous communication – shown later – use buffered send (MPI_Bsend) – not shown here– use MPI_Sendrecv – not shown here
Deadlock (II)
High Performance Computing Center Stuttgart
Edgar Gabriel
Example
• Implementation of a ring using Send/Recv– Rank 0 starts the ring
MPI_Comm_rank (comm, &rank);MPI_Comm_size (comm, &size);
if (rank == 0 ) {MPI_Send(buf, 1, MPI_INT, rank+1, 1,comm);MPI_Recv(buf, 1, MPI_INT, size-1, 1,comm,&status);
}else if ( rank == size-1 ) {MPI_Recv(buf, 1, MPI_INT, rank-1, 1,comm,&status);MPI_Send(buf, 1, MPI_INT, 0, 1,comm);
}else {MPI_Recv(buf, 1, MPI_INT, rank-1, 1,comm,&status);MPI_Send(buf, 1, MPI_INT, rank+1, 1,comm);
}
High Performance Computing Center Stuttgart
Edgar Gabriel
Wildcards
• Question: can I use wildcards for the arguments in Send/Recv?
• Answer:– for Send: no– for Recv:
• tag: yes, MPI_ANY_TAG• source: yes, MPI_ANY_SOURCE• communicator: no
High Performance Computing Center Stuttgart
Edgar Gabriel
Status of a message (I)
• the MPI status contains directly accessible information– who sent the message– what was the tag– what is the error-code of the message
• … and indirectly accessible information through function calls– how long is the message– has the message bin cancelled
High Performance Computing Center Stuttgart
Edgar Gabriel
Status of a message (II) – usage in C
MPI_Status status;
MPI_Recv ( buf, cnt, MPI_INT, …, &status);
/*directly access source, tag, and error */src = status.MPI_SOURCE;tag = status.MPI_TAG;err = status.MPI_ERROR;
/*determine message length and whether it has beencancelled */
MPI_Get_count (status, MPI_INT, &rcnt);MPI_Test_cancelled (status, &flag);
High Performance Computing Center Stuttgart
Edgar Gabriel
Status of a message (III) – usage in Fortran
integer status(MPI_STATUS_SIZE)
call MPI_Recv(buf,cnt,MPI_INTEGER, …,status,ierr)
! directly access source, tag, and errorsrc = status(MPI_SOURCE)tag = status(MPI_TAG)err = status(MPI_ERROR)
! determine message length and whether it has been! cancelled
call MPI_Get_count (status,MPI_INTEGER,rcnt,ierr)call MPI_Test_cancelled (status,flag,ierr)
High Performance Computing Center Stuttgart
Edgar Gabriel
Status of a message (IV)
• If you are not interested in the status, you can pass– MPI_STATUS_NULL– MPI_STATUSES_NULL
to MPI_Recv and all other MPI functions, which return a status
High Performance Computing Center Stuttgart
Edgar Gabriel
Non-blocking operations (I)
• A regular MPI_Send returns, when ‘… the data is safely stored away’
• A regular MPI_Recv returns, when the data is fully available in the receive-buffer
• Non-blocking operations initiate the Send and Receive operations, but do not wait for its completion.
• Functions, which check or wait for completion of an initiated communication have to be called explicitly
• Since the functions initiating communication return immediately, all MPI-functions have an I prefix (e.g. MPI_Isend or MPI_Irecv).
High Performance Computing Center Stuttgart
Edgar Gabriel
Non-blocking operations (II)
MPI_Isend (void *buf, int cnt, MPI_Datatype dat, int dest, int tag, MPI_Comm comm, MPI_Request *req);
MPI_Irecv (void *buf, int cnt, MPI_Datatype dat, int src, int tag, MPI_Comm comm, MPI_Request *reqs);
High Performance Computing Center Stuttgart
Edgar Gabriel
Non-blocking operations (III)
• After initiating a non-blocking communication, it is not allowed to touch (=modify) the communication buffer until completion– you can not make any assumptions about when the
message will really be transferred• All Immediate functions take an additional argument, a
request• a request uniquely identifies an ongoing
communication, and has to be used, if you want to check/wait for the completion of a posted communication
High Performance Computing Center Stuttgart
Edgar Gabriel
Completion functions (I)
• Functions waiting for completionMPI_Wait – wait for one communication to finish MPI_Waitall – wait for all comm. of a list to finishMPI_Waitany – wait for one comm. of a list to finishMPI_Waitsome – wait for at least one comm. of a list
• Content of the status not defined for Send operations
MPI_Wait (MPI_Request *req, MPI_Status *stat);MPI_Waitall (int cnt, MPI_Request *reqs,
MPI_Status *stats);MPI_Waitany (int cnt, MPI_Request *reqs, int *index,
MPI_Status *stat);MPI_Waitsome (int cnt, MPI_Request *reqs, int *indices,
MPI_Status *stats);
High Performance Computing Center Stuttgart
Edgar Gabriel
Completion functions (II)
• Test-functions verify, whether a communication is completeMPI_Test – check, whether a comm. has finishedMPI_Testall – check, whether all comm. of a list finishedMPI_Testany – check, whether one of a list of comm.
finishedMPI_Testsome – check, how many of a list of comm.
finished
MPI_Test (MPI_Request *req, int *flag, MPI_Status *stat);
MPI_Testall (int cnt, MPI_Request *reqs, int *flag,MPI_Status *stats);
MPI_Testany (int cnt, MPI_Request *reqs, int *index, int *flag, MPI_Status *stat);
MPI_Testsome (int cnt, MPI_Request *reqs, int *indices,int *flag, MPI_Status *stats);
High Performance Computing Center Stuttgart
Edgar Gabriel
• Question: how can two processes safely exchange data at the same time?
• Possibility 3: usage of non-blocking operationsProcess 0 Process 1
MPI_Irecv(rbuf,…, &req); MPI_Irecv(rbuf,…,&req);
MPI_Send (buf,…); MPI_Send (buf,…);MPI_Wait (req, &status); MPI_Wait (req, &status);
• note: – you have to use 2 separate buffers!– many different ways for formulating this scenario– identical code for both processes
Deadlock problem revisited
High Performance Computing Center Stuttgart
Edgar Gabriel
Example – 2D Laplace equation (I)
• 2-D Laplace equation
• Central discretization leads to0=∆u
022
21,,1,
2,1,,1 =
∆
+−+
∆
+− +−+−
yuuu
xuuu jijijijijiji
i,ji-1,j i+1,j
i,j+1
i,j-1
High Performance Computing Center Stuttgart
Edgar Gabriel
Example – 2D Laplace equation (II)
• Parallel domain decomposition
• Data exchange at process boundaries required– not assuming periodic boundary conditions here
High Performance Computing Center Stuttgart
Edgar Gabriel
Example – 2D Laplace equation (III)
• Halo cells: – store a copy of the data which is hold by another process,
which is however required for the computation of the local data– how to implement efficiently the communication of this
scheme?
High Performance Computing Center Stuttgart
Edgar Gabriel
Example – 2-D Laplace equation (IV)
• Process mapping and determining neighbor processes
• At boundaries: set the rank of the according neighbor to MPI_PROC_NULL– a message sent to MPI_PROC_NULL will be ignored by the MPI
library• Hint: look at the Cartesian Topology functions for another method to
perform the same operations
8 9 10 110,2 1,2 2,2 3,2
4 5 6 70,1 1,1 2,1 3,1
0 1 2 30,0 1,0 2,0 3,0
x
y
xdown
xup
right
left
nprankn
nprankn
rankn
rankn
−=
+=
+=
−=
1
1::
y
x
npnp no of procs in x direction
no of procs in y direction
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in y-direction
• u(i,j) is stored in a matrix !!assuming C!!
• Dimension of u on an inner process (= not being at a boundary):
• with
containing the local data
::
ylocal
xlocal
nn no of local points in x direction
no of local points in y direction
)2,2( ++ ylocalxlocal nnu
):1,:1( ylocalxlocal nnu
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in y-direction
MPI_Request [4] req;
MPI_Irecv(&u[1][nylocal+1], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[0]);
MPI_Irecv(&u[1][0], nxlocal, MPI_DOUBLE, ndown,tag, comm, &req[1]);
MPI_Isend(&u[1][nylocal], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[2]);
MPI_Isend(&u[1][1], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[3]);
MPI_Waitall (4, req, MPI_STATUSES_IGNORE);
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in x-direction
• Problem: the data which we have to send is not contiguous in the memory
• Logical view of the matrix
• Layout in memory of the same matrix (in C)
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in x-direction
• How to implement the halo-cell exchange in x-direction?– Send/Recv every element in a separate message
+ works- very slow
- copy the data into a separate vector/array and send this array
+ works- a more general interface is provided by MPI to pack
data into a contiguous buffer before sending- an even more general interface – derived datatypes –
is provided by MPI to avoid user-level packing and unpacking of messages
- not handled here
High Performance Computing Center Stuttgart
Edgar Gabriel
Packing a message
• MPI_Pack copies incount elements of type dat from inbuf into the user provided buffer outbuf– outbuf has to be large enough to hold the data– pos contains the position of the last packed data in outbuf. Has to be initialized to zero before first usage
– can be called several times to pack independent pieces of data
• Send and receive a message, which has been packed using the MPI datatype MPI_PACKED
MPI_Pack (void* inbuf, int incount, MPI_Datatype dat,void *outbuf, int *pos, MPI_Comm comm);
High Performance Computing Center Stuttgart
Edgar Gabriel
Packing a message (II)
outbuf before pack, pos=0
MPI_Pack(inbuf1,1,MPI_INT,outbuf,pos,comm);
outbuf after 1st pack, pos=6
MPI_Pack(inbuf2,1,MPI_FLOAT,outbuf,pos,comm);
posMPI internal header
outbuf after 1st pack, pos=10
pos
pos
High Performance Computing Center Stuttgart
Edgar Gabriel
Unpacking a message
• MPI_Unpack copies outcount elements of type dat frominbuf into the user provided buffer outbuf– inbuf holds the whole message– pos contains the position of the last unpacked data in inbuf. Has to be initialized to zero before first usage
– can be called several times to pack independent pieces of data
MPI_Unpack(void *inbuf, int insize, int* pos, void* outbuf, int outcount, MPI_Datatype dat,MPI_Comm comm);
High Performance Computing Center Stuttgart
Edgar Gabriel
Determining the size of the pack-buffer
• MPI_Pack_size returns the size in bytes of the required buffer to pack incount elements of type datusing MPI_Pack– size might not be identical to
incount *sizeof(original datatype)– several calls to MPI_Pack_size required, if you plan
to pack more than one type of dat• sum up the returned sizes• you can use size e.g. to malloc a buffer
MPI_Pack_size(int incount, MPI_Datatype dat, MPI_Comm comm, int *size);
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in x-direction (I)
double *sbufleft, *sbufrigh, *rbufleft, *rbufright;int bufsize, posleft=0, posright=0;
/* determine the required buffer sizes and allocate the buffers */
MPI_Pack_size (nylocal, MPI_DOUBLE, comm, &bufsize);sbufleft = malloc(bufsize);sbufright = malloc(bufsize);rbufleft = malloc(bufsize);rbufright = malloc(bufsize);
/* Pack the data before sending */for (i=1; i<nylocal+1; i++) {MPI_Pack (u[nxlocal][i], 1, MPI_DOUBLE, sbufright,
&posright, comm);MPI_Pack (u[1][i], 1, MPI_DOUBLE, sbufleft
&posleft, comm);}
High Performance Computing Center Stuttgart
Edgar Gabriel
Laplace equation – communication in x-direction (II)
/* Execute now the real communication */MPI_Irecv(rbufleft, bufsize, MPI_PACKED, nleft, tag,
comm, &req[0]);MPI_Irecv(rbufright, bufsize, MPI_PACKED, nright,tag,
comm, &req[1]);MPI_Isend(sbufleft, posleft, MPI_PACKED, nleft, tag,
comm, &req[2]);MPI_Isend(sbufright, posright, MPI_PACKED, nright, tag,
comm, &req[3]);MPI_Waitall (4, req, MPI_STATUSES_IGNORE);
/* Unpack the received data */posright = posleft = 0;for (i=1; i<nylocal+1; i++) {MPI_Unpack(rbufright, bufsize, &posright,
u[nxlocal+1][i], 1, MPI_DOUBLE, comm);MPI_Unpack (rbufleft, bufsize, &posleft,
u[0][i], 1, MPI_DOUBLE, comm);}
High Performance Computing Center Stuttgart
Edgar Gabriel
Overlapping communication and computation
Default Algorithm• Data exchange• Execute the calculation over the whole domain at once
High Performance Computing Center Stuttgart
Edgar Gabriel
Overlapping communication and computation (II)
Alternative algorithm:• Initiate communication (MPI_Isend/MPI_Irecv)
• Finish communication (MPI_Waitall)• Calculate inner values
• Calculate boundary values
High Performance Computing Center Stuttgart
Edgar Gabriel
Using more than one ghostcell
• Use n ghostcells and communicate only every nIterations
• Example for n =2
First iteration Second iteration Communication
High Performance Computing Center Stuttgart
Edgar Gabriel
What else is there?
• Various send modes – buffered send, synchronous send, ready send
• Persistent request operations• Sendrecv functions
– MPI_Sendrecv, MPI_Sendrecv_replace• Probing for a message
– MPI_Probe, MPI_Iprobe• Cancelling a message
– MPI_Cancel
• Derived datatypes• One-sided communication