chapter 6 pc

88
Chapter 6 Chapter 6 Programming Using the Programming Using the Message Passing Message Passing Paradigm Paradigm Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied Sciences [email protected] Some slides have bee adapted with thanks from some other lectures available on Internet

Upload: hanif-durad

Post on 11-Aug-2015

169 views

Category:

Education


4 download

TRANSCRIPT

Page 1: Chapter 6 pc

Chapter 6Chapter 6

Programming Using the Programming Using the Message Passing ParadigmMessage Passing Paradigm

Dr. Muhammad Hanif Durad

Department of Computer and Information Sciences

Pakistan Institute Engineering and Applied Sciences

[email protected]

Some slides have bee adapted with thanks from some other lectures available on Internet

Page 2: Chapter 6 pc

Dr. Hanif Durad 2

Lecture Outline

Principles of Message-Passing Programming The Building Blocks: Send and Receive Operations MPI: the Message Passing Interface Topologies and Embedding Overlapping Communication with Computation Collective Communication and Computation Operations Groups and Communicators

Page 3: Chapter 6 pc

6.1 Principles of Message-Passing Programming (1/3)

The paradigm is one of the oldest and most widely used approaches for programming parallel computers. its roots can be traced back in the early days of parallel proc. its wide-spread adopted

There are two key attributes that characterize the paradigm. Assumes a partitioned address space and It supports only explicit parallelization

PC5.pdf

Page 4: Chapter 6 pc

Principles of Message-Passing Programming (2/3)

Each data element must belong to one of the partitions of the space data must be explicitly partitioned and placed encourages locality of access

All interactions (read-only, read/write) require cooperation of 2 processes: the process that has the data and the process that wants to access the data.

Dr. Hanif Durad 4

Page 5: Chapter 6 pc

Principles of Message-Passing Programming (3/3)

Advantage of explicit two-way interactions: the programmer is fully aware of all the costs of nonlocal

interactions, and is more likely to think about algorithms (and mappings) that minimize interactions.

paradigm can be efficiently implemented on a wide variety of architectures.

Disadvantage: For dynamic and/or unstructured interactions the

complexity of the code written for this type of paradigm can be very high for this reason.

Page 6: Chapter 6 pc

Programming Issues (1/5)

Parallelism is coded explicitly by the programmer: The programmer is responsible:

for analyzing the underlying serial algorithm/application and Identifying ways by which he/she can decompose the computations

and extract concurrency. Programming using the paradigm tends to be hard and

intellectually demanding. Properly written message-passing programs can often achieve

very high performance and scale to a very large number of processes.

Dr. Hanif Durad

Page 7: Chapter 6 pc

Programming Issues (2/5)

Progs are written using the asynchronous or loosely synchronous paradigms.

In the asynchronous paradigm: all concurrent tasks execute asynchronously. such programs can be harder to reason about &

can have non-deterministic behavior

Dr. Hanif Durad 7

Page 8: Chapter 6 pc

Programming Issues (3/5)

Loosely synchronous programs: tasks or subsets of tasks synchronize to perform

interactions between these interactions, tasks execute

completely asynchronously Since the interaction happens synchronously, it is

still quite easy to reason about the program

8Dr. Hanif Durad

Page 9: Chapter 6 pc

Programming Issues (4/5)

Paradigm supports execution of a different program on each of the p processes.

provides the ultimate flexibility in parallel programming

makes the job of writing parallel programs effectively unscalable.

Page 10: Chapter 6 pc

Programming Issues (5/5)

⇒ most programs are written using the single program multiple data (SPMD) approach. In SPMD programs the code executed by different

processes is identical except for a small no processes (e.g., the "root" process).

Page 11: Chapter 6 pc

6.2 The Building Blocks: Send and Receive Operations (1/3)

The prototypes of these operations are as follows: send (void *sendbuf, int nelems, int dest) receive (void *recvbuf, int nelems, int source)

sendbuf points to a buffer that stores the data to be sent recvbuf points to a buffer that stores the data to be

received nelems is the number of data units to be sent and

received dest is the identifier of the process that receives the data source is the identifier of the process that sends the data

Page 12: Chapter 6 pc

The Building Blocks: Send and Receive Operations (2/3)

1. P0

2. a = 100;

3. send(&a, 1, 1);

4. a = 0;

5.

P1

receive(&a, 1, 0)

printf("%d\n", a);

Performance ramifications of how these functions are implemented.Example:

What is problem with this code??

Page 13: Chapter 6 pc

The Building Blocks: Send and Receive Operations (3/3)

Most platforms have additional hardware support for send & recv. messages: Asynchronous message transfer using network interface

hardware. allow the transfer from buffer memory to desired location

without CPU intervention. DMA (direct memory access)

allows copying of data from one memory location to another without CPU support

If send returns before the communication operation has been accomplished, P1 might receive the value 0 in a instead of 100!

Page 14: Chapter 6 pc

6.2.1 Blocking Message Passing Operations

The sending operation blocks until it can guarantee that the semantics will not be violated on return irrespective of what happens in the program subsequently.

There are two mechanisms by which this can be achieved: Blocking Non-Buffered Send/Receive. Blocking Buffered Send/Receive.

Page 15: Chapter 6 pc

6.2.1.1 Blocking Non-Buffered Send/Receive (1/2)

The send operation does not return until the matching receive has been encountered at the receiving process.

Then the message is sent and the send operation returns upon completion of the communication operation.

Page 16: Chapter 6 pc

Blocking Non-Buffered Send/Receive (2/2)

Involves a handshake between the sending and receiving processes. The sending process sends a request to communicate to the

receiving process When the receiving process encounters the target receive, it

responds to the request. The sending process upon receiving this response initiates a

transfer operation. Since there are no buffers used at either sending or

receiving ends, this is also referred to as a non-buffered blocking operation.

Page 17: Chapter 6 pc

6.2.1.1.1 Idling Overheads in Blocking Non-Buffered Send/Recv (1/3)

3 scenarios:

a) the send is reached before the receive is posted,

b) the send and receive are posted around the same time,

c) the receive is posted before the send is reached.

Page 18: Chapter 6 pc

Figure 6. 1Handshake for a blocking non-buffered send/receive operation.It is easy to see that in cases where sender and receiver do notreach communication point at similar times, there can be considerable idling overheads. (2/3)

Page 19: Chapter 6 pc

Idling Overheads in Blocking Non-Buffered Send/Recv (3/3)

In cases (a) and (c): there is considerable idling at the sending and receiving process.

⇒ A blocking non-buffered protocol is suitable when the send and receive are posted at roughly the same time.

In an asynchronous environment, this may be impossible to predict.

Page 20: Chapter 6 pc

6.2.1.1.2 Deadlocks in Blocking Non-Buffered Send/Recv (1/2)

1. P0

2.

3. send(&a, 1, 1);

4. receive(&b, 1, 1)

P1

send(&a, 1, 0);

receive(&b, 1, 0)

Consider the following simple exchange of messages that can lead to a deadlock:

Page 21: Chapter 6 pc

Deadlocks in Blocking Non-Buffered Send/Recv (2/2)

The send at P0 waits for the matching receive at P1 The send at process P1 waits for the corresponding receive at

P0, resulting in an infinite wait. Deadlocks are very easy in blocking protocols and care must

be taken to break cyclic waits of the nature outlined. In the above example, this can be corrected by replacing the

operation sequence of one of the processes by a receive and a send as opposed to the other way around.

This often makes the code more cumbersome and buggy.

Page 22: Chapter 6 pc

6.2.1.2 Blocking Buffered Send/Receive (1/6)

The sender has a buffer pre-allocated for communicating

messages. copies the data into the designated buffer returns after the copy operation has been completed. continue with the program knowing that any changes

to the data will not impact program semantics.

Page 23: Chapter 6 pc

Blocking Buffered Send/Receive (2/6)

The actual communication can be accomplished in many ways depending on the available hardware resources. If the hardware supports asynchronous comm.

(independent of the CPU), then a network transfer can be initiated after the message has been copied into the buffer.

Page 24: Chapter 6 pc

Blocking Buffered Send/Receive (3/6)

Receiving end: the data is copied into a buffer at the receiver as well. When the receiving process encounters a receive

operation, it checks to see if the message is available in its receive buffer.

If so, the data is copied into the target location.

Page 25: Chapter 6 pc

Blocking Buffered Send/Receive (3/6)

a) in the presence of communication hardware with buffers at send & receive ends;

b) in the absence of communication hardware:

Page 26: Chapter 6 pc

Figure 6. 2Blocking buffered transfer protocols: (a) in the presence ofcommunication hardware with buffers at send and receive ends; and (b) in the absence of communication hardware, sender interrupts receiver and deposits data in buffer at receiver end.

(5/6)

Page 27: Chapter 6 pc

Blocking Buffered Send/Receive (6/6)

in the absence of communication hardware: sender interrupts receiver and deposits data in buffer

at receiver end. both processes participate in a communication

operation and the message is deposited in a buffer at the receiver end. When the receiver eventually encounters a receive

operation, the message is copied from the buffer into the target location.

Page 28: Chapter 6 pc

6.2.1.2.1 Impact of Finite Buffers in Message Passing (1/3)

1. P0

2.

3. for (i = 0; i < 1000; i++) {

4. produce_data(&a);

5. send(&a, 1, 1);

6. }

P1

for (i = 0; i < 1000; i++) {

receive(&a, 1, 0);

consume_data(&a);

}

Consider the following code fragment:

Page 29: Chapter 6 pc

Impact of Finite Buffers in Message Passing (2/3)

If proc. P1 was slow getting to this loop, proc. P0 might have sent all of its data

If there is enough buffer space, then both processes can proceed;

However, if the buffer is not sufficient (i.e., buffer overflow), the sender would have to be blocked until some of the corresponding receive operations had been posted, thus freeing up buffer space.

This can often lead to unforeseen overheads and performance degradation.

Solution (?)

Page 30: Chapter 6 pc

Impact of Finite Buffers in Message Passing (3/3)

It is a good idea to write programs that have bounded requirements

Page 31: Chapter 6 pc

6.2.1.2.2 Deadlocks in Buffered Send/Recv

Deadlocks are still possible with buffering since receive operations block.

1. P0

2.

3. send(&a, 1, 1);

4. receive(&b, 1, 1)

P1

send(&a, 1, 0);

receive(&b, 1, 0)

Once again, such circular waits have to be broken. However, deadlocks are caused only by waits on receive operations in this case

Page 32: Chapter 6 pc

6.2.2 Non-Blocking Message Passing Operations (1/7)

In blocking protocols, the overhead of guaranteeing semantic correctness was paid in the form of idling (non-buffered) or buffer management (buffered).

Often it is possible to require the programmer to ensure semantic correctness & provide a fast send/receive op. that incurs little overhead.

Returns from the send/receive op. before it is semantically safe to do so.

The user must be careful not to alter data that may be potentially participating in a communication operation.

Page 33: Chapter 6 pc

Non-Blocking Message Passing Operations (2/7)

Non-blocking ops are accompanied by a check-status op. indicating whether the semantics of initiated transfer may be violated or not.

Upon return from a non-blocking send/recv op., the process is free to perform any comp. that does not depend upon the completion of the op.

Later in the program, the process can check whether or not the non-blocking op. has completed, and, if necessary, wait for its completion

Non-blocking operations can be buffered or non-buffered. (Figure 6. 3)

Page 34: Chapter 6 pc

Figure 6. 3

Space of possible protocols for send and receive operations.

(3/7)

Page 35: Chapter 6 pc

Non-Blocking Message Passing Operations (4/7)

In the non-buffered case: a process wishing to send data to another simply posts

a pending message and returns to the user program. The program can then do other useful work. When the corresponding receive is posted, the

communication operation is initiated. When this operation is completed, the check-status

op. indicates that it is safe for the programmer to touch this data. This transfer is indicated in Figure 6.4(a).

35Dr. Hanif Durad

Page 36: Chapter 6 pc

Figure 6. 4Non-blocking non-buffered send and receive operations (a) inabsence of communication hardware; (b) in presence ofcommunication hardware.

(5/7)

Page 37: Chapter 6 pc

Non-Blocking Message Passing Operations (6/7)

The benefits of non-blocking non-buffered operations are further enhanced by the presence of dedicated communication hardware. This is illustrated in Figure 6.4(b).

In this case, the communication overhead can be almost entirely masked by non-blocking operations.

However, the data being received is unsafe for the duration of the receive operation. 37

Page 38: Chapter 6 pc

Non-Blocking Message Passing Operations (7/7)

In the buffered case: In this case, the sender initiates a DMA operation and

returns immediately. The data becomes safe the moment the DMA operation has

been completed. At the receiving end

the receive operation initiates a transfer from the sender's buffer to the receiver's target location.

Using buffers with non-blocking operation has the effect of reducing the time during which the data is unsafe.

38Dr. Hanif Durad

Page 39: Chapter 6 pc

Compare Figures 6.4(a) and 6.1(a) (1/2)

Dr. Hanif Durad 39Figure 6. 1 Figure 6. 4

Page 40: Chapter 6 pc

Compare Figures 6.4(a) and 6.1(a) (2/2)

Dr. Hanif Durad 40

It is easy to see that The idling time when the process is waiting for

the corresponding receive in a blocking operation can now be utilized for computation, provided it does not update the data being sent.

This alleviates the major bottleneck associated with the former at the expense of some program restructuring

Page 41: Chapter 6 pc

Message Passing Libraries

MP libraries such as MPI and PVM implement both blocking and non-blocking operations. Blocking operations facilitate safe and easier

programming Non-blocking operations are useful for performance

optimization by masking communication overhead. One must, however, be careful using non-blocking

protocols since errors can result from unsafe access to data that is in the process of being communicated.

Page 42: Chapter 6 pc

6.3 MPI: the Message Passing Interface (1/2)

A standard library for message-passing that can be used to develop portable message-passing programs using either C or Fortran.

Defines both the syntax as well as the semantics of a core set of library routines. Vendor implementations of MPI are available on

almost all commercial parallel computers. Possible to write fully-functional programs by

using only the six from over 125 routines

Page 43: Chapter 6 pc

MPI: the Message Passing Interface (2/2)

MPI_Init Initializes MPI.

MPI_Finalize Terminates MPI.

MPI_Comm_size Determines the number of processes.

MPI_Comm_rank Determines the label of calling process.

MPI_Send Sends a message.

MPI_Recv Receives a message.

The minimal set of MPI routine

Page 44: Chapter 6 pc

6.3.1 Starting and Terminating the MPI Library

MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI environment.

MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to terminate the MPI environment.

The prototypes of these two functions are:

int MPI_Init(int *argc, char ***argv)

int MPI_Finalize() MPI_Init also strips off any MPI related command-line

arguments. All MPI routines, data-types, and constants are prefixed by “MPI_”.

The return code for successful completion is MPI_SUCCESS.

Page 45: Chapter 6 pc

6.3.2 Communicators (1/2)

A communicator defines a communication domain - a set of processes that are allowed to communicate with each other.

Information about communication domains is stored in variables of type MPI_Comm.

Communicators are used as arguments to all message transfer MPI routines.

Page 46: Chapter 6 pc

Communicators (2/2)

A process can belong to many different (possibly overlapping) communication domains.

MPI defines a default communicator called MPI_COMM_WORLD which includes all the processes.

Page 47: Chapter 6 pc

6.3.3 Getting Information (1/2)

The MPI_Comm_size and MPI_Comm_rank functions are used to determine the number of processes and the label of the calling process, respectively.

The calling sequences of these routines are as follows:

int MPI_Comm_size(MPI_Comm comm, int *size)

int MPI_Comm_rank(MPI_Comm comm, int *rank)

Page 48: Chapter 6 pc

Getting Information (1/2)

The rank of a process is an integer that ranges from zero up to the size of the communicator minus one.

Page 49: Chapter 6 pc

Our First MPI Program (in C)

#include <mpi.h>

main(int argc, char *argv[]){

int npes, myrank;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &npes);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);printf("From process %d out of %d, Hello World!\n",

myrank, npes);MPI_Finalize();

}

Page 50: Chapter 6 pc

Our First MPI Program (in C++)

#include <mpi.h>

main(int argc, char *argv[]){

int npes, myrank;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &npes);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);printf("From process %d out of %d, Hello World!\n",

myrank, npes);MPI_Finalize();

}

Page 51: Chapter 6 pc

6.3.4 Sending and Receiving Messages (1/5)

The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv, respectively.

The calling sequences of these routines are as follows:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm

comm) int MPI_Recv(void *buf, int count, MPI_Datatype

datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Page 52: Chapter 6 pc

Sending and Receiving Messages (2/5)

MPI provides equivalent datatypes for all C datatypes. This is done for portability reasons.

The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data items that has been created by packing non-contiguous data.

The message-tag can take values ranging from zero up to the MPI defined constant MPI_TAG_UB.

Page 53: Chapter 6 pc

MPI Datatypes MPI Datatype C Datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED Not in C

Page 54: Chapter 6 pc

Sending and Receiving Messages (3/5)

MPI allows specification of wildcard arguments for both source and tag.

If source is set to MPI_ANY_SOURCE, then any process of the communication domain can be the source of the message.

If tag is set to MPI_ANY_TAG, then messages with any tag are accepted.

On the receive side, the message must be of length equal to or less than the length field specified.

Page 55: Chapter 6 pc

Sending and Receiving Messages (4/5)

On the receiving end, the status variable can be used to get information about the MPI_Recv operation.

The corresponding data structure contains:typedef struct MPI_Status {

int MPI_SOURCE;

int MPI_TAG;

int MPI_ERROR;

};

Page 56: Chapter 6 pc

Sending and Receiving Messages (5/5)

The MPI_Get_count function returns the precise count of data items received. int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)

Page 57: Chapter 6 pc

MPI_Recv Implementation

The MPI_Recv returns only after the requested message has been received and copied into the buffer.

That is, MPI_Recv is a blocking receive operation.

Page 58: Chapter 6 pc

MPI_Send Implementation (1/2)

MPI allows two different implementations for MPI_Send. In the first implementation, MPI_Send returns only

after the corresponding MPI_Recv have been issued and the message has been sent to the receiver.

In the second implementation, MPI_Send first copies the message into a buffer and then returns, without waiting for the corresponding MPI_Recv to be executed.

Page 59: Chapter 6 pc

MPI_Send Implementation (2/2)

In either implementation, buffer pointed by the buf argument of MPI_Send can be safely reused and overwritten.

MPI programs must be able to run correctly regardless of two methods. Such programs are called safe.

In writing safe MPI programs, it is helpful to forget about the alternate and just think of it as being a blocking send operation.

Page 60: Chapter 6 pc

Deadlocks Example-1

int a[10], b[10], myrank;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);}else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);}...

If MPI_Send is blocking, there is a deadlock.

Page 61: Chapter 6 pc

Deadlocks Example-2 (1/2)

Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes).

Page 62: Chapter 6 pc

Deadlocks Example-2 (2/2)

int a[10], b[10], npes, myrank;MPI_Status status;...MPI_Comm_size(MPI_COMM_WORLD, &npes);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,

MPI_COMM_WORLD);MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,

MPI_COMM_WORLD);…

Once again, we have a deadlock if MPI_Send is blocking.

Page 63: Chapter 6 pc

Avoiding Deadlocks Circular Wait

int a[10], b[10], npes, myrank;MPI_Status status;...MPI_Comm_size(MPI_COMM_WORLD, &npes);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if (myrank%2 == 1) {

MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD);MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,

MPI_COMM_WORLD);}else {

MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);

MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD);

}...

Page 64: Chapter 6 pc

Sending and Receiving Messages Simultaneously (1/2)

To exchange messages, MPI provides the following function:

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

Page 65: Chapter 6 pc

Sending and Receiving Messages Simultaneously (2/2)

The arguments include arguments to the send and receive functions. If we wish to use the same buffer for both send and receive, we can use:

int MPI_Sendrecv_replace(void *buf, int count,MPI_Datatype datatype, int dest, int sendtag,int source, int recvtag, MPI_Comm comm,MPI_Status *status)

Page 66: Chapter 6 pc

6.3.5 Example: Odd-Even Sort

Do yourself

Dr. Hanif Durad 66

Page 67: Chapter 6 pc

Topologies and Embeddings

MPI allows a programmer to organize processors into logical k-d meshes.

The processor ids in MPI_COMM_WORLD can be mapped to other communicators (corresponding to higher-dimensional meshes) in many ways.

The goodness of any such mapping is determined by the interaction pattern of the underlying program and the topology of the machine.

MPI does not provide the programmer any control over these mappings.

Page 68: Chapter 6 pc

Topologies and Embeddings

Different ways to map a set of processes to a two-dimensionalgrid. (a) and (b) show a row- and column-wise mapping of theseprocesses, (c) shows a mapping that follows a space-lling curve

(dotted line), and (d) shows a mapping in which neighboringprocesses are directly connected in a hypercube.

Page 69: Chapter 6 pc

Creating and Using Cartesian Topologies

We can create cartesian topologies using the function: int MPI_Cart_create(MPI_Comm comm_old, int ndims,

int *dims, int *periods, int reorder, MPI_Comm *comm_cart)

This function takes the processes in the old communicator and creates a new communicator with dims dimensions.

Each processor can now be identified in this new cartesian topology by a vector of dimension dims.

Page 70: Chapter 6 pc

Creating and Using Cartesian Topologies

Since sending and receiving messages still require (one-dimensional) ranks, MPI provides routines to convert ranks to cartesian coordinates and vice-versa. int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims,

int *coords)

int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)

The most common operation on cartesian topologies is a shift. To determine the rank of source and destination of such shifts, MPI provides the following function: int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step,

int *rank_source, int *rank_dest)

Page 71: Chapter 6 pc

Overlapping Communicationwith Computation In order to overlap communication with computation, MPI provides a

pair of functions for performing non-blocking send and receive operations. int MPI_Isend(void *buf, int count, MPI_Datatype datatype,

int dest, int tag, MPI_Comm comm, MPI_Request *request)

int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

These operations return before the operations have been completed. Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished. int MPI_Test(MPI_Request *request, int *flag,

MPI_Status *status) MPI_Wait waits for the operation to complete.

int MPI_Wait(MPI_Request *request, MPI_Status *status)

Page 72: Chapter 6 pc

Avoiding Deadlocks Using non-blocking operations remove most deadlocks. Consider:

int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) {

MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);

} else if (myrank == 1) {

MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, &status, MPI_COMM_WORLD);

} ...

Replacing either the send or the receive operations with non-blocking counterparts fixes this deadlock.

Page 73: Chapter 6 pc

Collective Communication and Computation Operations

MPI provides an extensive set of functions for performing common collective communication operations.

Each of these operations is defined over a group corresponding to the communicator.

All processors in a communicator must call these operations.

Page 74: Chapter 6 pc

Collective Communication Operations

The barrier synchronization operation is performed in MPI using:

int MPI_Barrier(MPI_Comm comm)

The one-to-all broadcast operation is: int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,

int source, MPI_Comm comm)

The all-to-one reduction operation is: int MPI_Reduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm)

Page 75: Chapter 6 pc

Predefined Reduction Operations Operation Meaning Datatypes

MPI_MAX Maximum C integers and floating point

MPI_MIN Minimum C integers and floating point

MPI_SUM Sum C integers and floating point

MPI_PROD Product C integers and floating point

MPI_LAND Logical AND C integers

MPI_BAND Bit-wise AND C integers and byte

MPI_LOR Logical OR C integers

MPI_BOR Bit-wise OR C integers and byte

MPI_LXOR Logical XOR C integers

MPI_BXOR Bit-wise XOR C integers and byte

MPI_MAXLOC max-min value-location Data-pairs

MPI_MINLOC min-min value-location Data-pairs

Page 76: Chapter 6 pc

Collective Communication Operations The operation MPI_MAXLOC combines pairs of values (vi, li) and returns the

pair (v, l) such that v is the maximum among all vi 's and l is the corresponding li (if there are more than one, it is the smallest among all these li 's).

MPI_MINLOC does the same, except for minimum value of vi.

An example use of the MPI_MINLOC and MPI_MAXLOC operators.

Page 77: Chapter 6 pc

Collective Communication Operations

MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction operations.

MPI Datatype C Datatype

MPI_2INT pair of ints

MPI_SHORT_INT short and int

MPI_LONG_INT long and int

MPI_LONG_DOUBLE_INT long double and int

MPI_FLOAT_INT float and int

MPI_DOUBLE_INT double and int

Page 78: Chapter 6 pc

Collective Communication Operations

If the result of the reduction operation is needed by all processes, MPI provides:

int MPI_Allreduce(void *sendbuf, void *recvbuf,

int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

To compute prefix-sums, MPI provides: int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,

MPI_Comm comm)

Page 79: Chapter 6 pc

Collective Communication Operations The gather operation is performed in MPI using:

int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target, MPI_Comm comm)

MPI also provides the MPI_Allgather function in which the data are gathered at all the processes.

int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm)

The corresponding scatter operation is: int MPI_Scatter(void *sendbuf, int sendcount,

MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm)

Page 80: Chapter 6 pc

Collective Communication Operations

The all-to-all personalized communication operation is performed by:

int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype senddatatype,

void *recvbuf,

int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm)

Using this core set of collective operations, a number of programs can be greatly simplified.

Page 81: Chapter 6 pc

Groups and Communicators

In many parallel algorithms, communication operations need to be restricted to certain subsets of processes.

MPI provides mechanisms for partitioning the group of processes that belong to a communicator into subgroups each corresponding to a different communicator.

The simplest such mechanism is: int MPI_Comm_split(MPI_Comm comm, int color, int key,

MPI_Comm *newcomm)

This operation groups processors by color and sorts resulting groups on the key.

Page 82: Chapter 6 pc

Groups and Communicators

Using MPI_Comm_split to split a group of processes in a communicator into subgroups.

Page 83: Chapter 6 pc

Groups and Communicators In many parallel algorithms, processes are arranged in a virtual grid,

and in different steps of the algorithm, communication needs to be restricted to a different subset of the grid.

MPI provides a convenient way to partition a Cartesian topology to form lower-dimensional grids:

int MPI_Cart_sub(MPI_Comm comm_cart, int *keep_dims, MPI_Comm *comm_subcart)

If keep_dims[i] is true (non-zero value in C) then the ith dimension is retained in the new sub-topology.

The coordinate of a process in a sub-topology created by MPI_Cart_sub can be obtained from its coordinate in the original topology by disregarding the coordinates that correspond to the dimensions that were not retained.

Page 84: Chapter 6 pc

Groups and Communicators

Splitting a Cartesian topology of size 2 x 4 x 7 into (a) foursubgroups of size 2 x 1 x 7, and (b) eight subgroups of size 1 x

1 x 7.

Page 85: Chapter 6 pc

One-to-All Broadcast on Rings

One-to-all broadcast on an eight-node ring. Node 0 is the source of the broadcast. Each message transfer step is shown by a numbered, dotted arrow from the source of the message to its destination. The number on an arrow indicates the time step during which the message is transferred.

(3/3)

Page 86: Chapter 6 pc

Broadcast and Reduction on a Mesh: Example

One-to-all broadcast on a 16-node mesh.

(2/2)

Page 87: Chapter 6 pc

4.1.3 Broadcast and Reduction on a Hypercube (1/2)

d phases on d-dimension hypercube phase i:

2(i-1) messages sent in parallel in dimension i

source node = 000, d = 3 Phase 1: 000 -> 100 Phase 2: 000 -> 010 100 -> 110 Phase 3: 000 -> 001 100 -> 101

010 -> 011 110 -> 111

Page 88: Chapter 6 pc

MPI Names of various Operations

Summary (2/3)