2.1 message-passing computing cluster computing, unc b. wilkinson, 2007

65
2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007.

Upload: thomas-beasley

Post on 05-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.1

Message-Passing Computing

Cluster Computing, UNC B. Wilkinson, 2007.

Page 2: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.2

Message-Passing Programming using User-level Message-Passing Libraries

Programming by using a normal high-level language such as C, augmented with message-passing library calls that perform direct process-to-process message passing.

1. A method of creating separate processes for execution on different computers

2. A method of sending and receiving messages

Page 3: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.3

Multiple program, multiple data (MPMD) model

Sourcefile

Executable

Processor 0 Processor p - 1

Compile to suitprocessor

Sourcefile

Page 4: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.4

Multiple Program Multiple Data (MPMD) Model with dynamic process creation

Process 1

Process 2spawn();

Time

Start executionof process 2

Processes started from within master process - dynamic process creation.

Potentially very flexible but incurs overhead of dynamically starting processes

PVM used this form.MPI-2 has dynamic process creation although we not use it.

Page 5: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.5

Single Program Multiple Data (SPMD) model.

Sourcefile

Executables

Processor 0 Processor p - 1

Compile to suitprocessor

Different processes merged into one program. Control (IF) statements select different parts for each processor to execute. All executables started together - static process creation

Page 6: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.6

Single Program Multiple Data (SPMD) model.

Sourcefile

Executables

Processor 0 Processor p - 1

Compile to suitprocessor

if (processID == 0) {

… // do this code

} else if (processID == 1) {

… //do this code

} else …

Typically coded for one master and a set of identical slave processes rather than each process different.

MPI uses this form

Page 7: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.7

Advantages/disadvantages of MPMD and SPMD

• MPMD with dynamic process creation– Flexible – can start up processes on demand during

execution, for example when searching through a search space.

– Has process start-up overhead

• SPMD (with static process creation)– Easier to code– Just one program to write– Collective message passing routines in each process the

same (see later)– Efficient process execution

Page 8: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.8

Basic “point-to-point”Send and Receive Routines

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Movementof data

Generic syntax (actual formats later)

Passing a message between processes using send() and recv() library calls:

Page 9: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.9

Synchronous Message PassingRoutines that actually return when message transfer completed.

Synchronous send routine• Waits until complete message has been accepted by the receiving

process before returning.

Synchronous receive routine• Waits until the message it is expecting arrives.

Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Neither can proceed until the message has been passed from the source to the destination. So no message buffer storage is needed.

Page 10: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.10

Synchronous send() and recv() using 3-way protocol

Process 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) When send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

(b) When recv() occurs before send()

Request to send

Request to send

Page 11: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.11

Asynchronous Message Passing• Blocking - has been used to describe routines that do

not return until the transfer is completed.– The routines are “blocked” from continuing.

• Non-blocking - has been used to describe routines that return whether or not the message had been received. Usually require local storage for messages.– In general, they do not synchronize processes but allow

processes to move forward sooner. Must be used with care.• In that sense, the terms synchronous and blocking were

synonymous.

Page 12: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.12

MPI Definitions of Blocking and Non-Blocking

• Locally Blocking - return after their local actions complete, though the message transfer may not have been completed.

• Non-blocking - return immediately.

Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this.These terms may have different interpretations in other systems.

Page 13: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.13

How message-passing routines return before message transfer completed

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time

Message buffer needed between source and destination to hold message:

Page 14: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Message Buffer• For a receive routine, the message has to have been received if we want

the message.

• If recv() is reached before send(), the message buffer will be empty and recv() waits for the message.

• For a send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work.

• In this way, using such send routines can decrease the overall execution time.

• In practice, buffers can only be of finite length and a point could be reached when the send routine is held up because all the available buffer space has been exhausted.

• It may be necessary to know at some point if the message has actually been received, which will require additional message passing.

2.14

Page 15: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.15

Message Tag

• Used to differentiate between different types of messages being sent.

• Message tag is carried within message.

• If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

Page 16: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.16

Message Tag Example

Process 1 Process 2

send(&x,2, 5);

recv(&y,1, 5);

x y

Movementof data

Waits for a message from process 1 with a tag of 5

To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:

Page 17: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.17

“Group” message passing routines

Have routines that send message(s) to a group of processes or receive message(s) from a group of processes

Higher efficiency than separate point-to-point routines although not absolutely necessary.

Page 18: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.18

BroadcastSending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes.

bcast();

buf

bcast();

data

bcast();

datadata

Process 0 Process p - 1Process 1

Action

Code

SPMD (MPI) form Broadcast action does not occur until all the processes have executed their broadcast routine, and the broadcast operation will have the effect of synchronizing the processes.

Page 19: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.19

Scatter

scatter();

buf

scatter();

data

scatter();

datadata

Process 0 Process p - 1Process 1

Action

Code

Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.Can send more than one element.

SPMD (MPI) form

Page 20: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.20

Gather

gather();

buf

gather();

data

gather();

datadata

Process 0 Process p - 1Process 1

Action

Code

Having one process collect individual values from set of processes.

SPMD (MPI) form

Page 21: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.21

Reduce

reduce();

buf

reduce();

data

reduce();

datadata

Process 0 Process p - 1Process 1

+

Action

Code

Gather operation combined with specified arithmetic/logical operation to a single value.

Example: Values could be gathered and then added together by root:

SPMD (MPI) form

Page 22: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.22

Software Tools for ClustersLate 1980’s Parallel Virtual Machine (PVM) - developed

Became very popular.

Mid 1990’s - Message-Passing Interface (MPI) - standard defined.

Based upon Message Passing Parallel Programming model.

Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++, ...).

Page 23: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.23

PVM(Parallel Virtual Machine)

Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge.

Programmer decomposes problem into separate programs (usually master and group of identical slave programs).

Programs compiled to execute on specific types of computers.

Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).

Page 24: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.24

Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine.

PVM

Application

daemon

program

Workstation

PVMdaemon

Applicationprogram

Applicationprogram

PVMdaemon

Workstation

Workstation

Messagessent throughnetwork

(executable)

(executable)

(executable)

MPI implementation we use is similar.

Can have more than one processrunning on each computer.

Page 25: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.25

MPI(Message Passing Interface)

• Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability.

• Defines routines, not implementation.

• Several free implementations exist.

Page 26: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.26

MPIProcess Creation and Execution

• Purposely not defined - Will depend upon implementation.

• Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together.

• Originally SPMD model of computation. • MPMD also possible with static creation

Page 27: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.27

Using SPMD Computational Model

main (int argc, char *argv[]){MPI_Init(&argc, &argv);

.

.MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */

if (myrank == 0)master();

elseslave();..

MPI_Finalize();}

where master() and slave() are to be executed by master process and slave process, respectively.

Page 28: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.28

Communicators• Defines scope of a communication operation.

• Processes have ranks associated with communicator.

• Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.

• Other communicators can be established for groups of processes.

• Two types of communicator– Intracommunicators for communication within a defined groups– Intercommunicators for communication between defined groups

Page 29: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.29

Reasoning for Communicators

• Provides a solution to unsafe message passing, – Message tags alone are not sufficient.

• Enables basic error checking of message passing code by allowing programmer to define communication domains.

– Messages cannot be sent to destinations outside defined communication domain

Page 30: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.30

Unsafe message passing - Example

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source

Page 31: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.31

MPI Blocking Routines

• Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent.

• Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.

Page 32: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.32

Parameters of blocking send

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

Page 33: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.33

Parameters of blocking receive

MPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

Page 34: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.34

Example

To send an integer x from process 0 to process 1,

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */

if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);

} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status);

}

Page 35: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.35

MPI Nonblocking Routines

• Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.

• Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

Page 36: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.36

Nonblocking Routine Formats

MPI_Isend(buf,count,datatype,dest,tag,comm,request)

MPI_Irecv(buf,count,datatype,source,tag,comm, request)

Completion detected by MPI_Wait() and MPI_Test().

MPI_Wait() waits until operation completed and returns then.

MPI_Test() returns with flag set indicating whether operation completed at that time.

Need to know whether particular operation completed.

Determined by accessing request parameter.

Page 37: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.37

Example

To send an integer x from process 0 to process 1 and allow process 0 to continue,

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

int x;

MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);

compute();

MPI_Wait(req1, status);

} else if (myrank == 1) {

int x;

MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);

}

Page 38: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.38

Send Communication Modes

• Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached.

• Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach().

• Synchronous Mode - Send and receive can start before each other but can only complete together.

• Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.

Page 39: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.39

Parameters of synchronous send(same as blocking send)

MPI_Ssend(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

Page 40: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.40

Collective Communication

Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:

• MPI_Bcast() - Broadcast from root to all other processes• MPI_Gather() - Gather values for group of processes• MPI_Scatter() - Scatters buffer in parts to group of processes• MPI_Alltoall() - Sends data from all processes to all processes• MPI_Reduce() - Combine values on all processes to single value• MPI_Reduce_scatter() - Combine values and scatter results• MPI_Scan()- Compute prefix reductions of data on processes

• MPI_Barrier() - A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.

Page 41: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.41

Barrier: Block process until all processes have called it

MPI_Barrier(comm)

communicator

Page 42: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.42

Broadcast message from root process to all processes in comm

and itself.

MPI_Bcast(*buf, count, datatype, root, comm)

Parameters:

*buf message buffer (loaded)

count number of entries in buffer

datatype data type of buffer

root rank of root

Page 43: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.43

Gather values for group of processes

MPI_Gather(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)

Parameters:*sendbuf send buffersendcount number of send buffer elementssendtype data type of send elements*recvbuf receive buffer (loaded)recvcount number of elements each receiverecvtype data type of receive elementsroot rank of receiving processcomm communicator

Page 44: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.44

ExampleTo gather items from group of processes into process 0, using dynamically allocated memory in root process:

int data[10]; /*data to be gathered from processes*/

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/

buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/

}

MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ;

MPI_Gather() gathers from all processes, including root.

Page 45: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.45

Scatter a buffer from root in parts to group of processes

MPI_Scatter(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)

Parameters:*sendbuf send buffersendcount number of elements sent

(each process)sendtype data type of elements*recvbuf receive buffer (loaded)recvcount number of recv buffer elementsrecvtype type of recv elementsroot root process rankcomm communicator

Page 46: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.46

Combine values on all processes to single value

MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm)

Parameters:*sendbuf send buffer address*recvbuf receive buffer addresscount number of send buffer elementsdatatype data type of send elementsop reduce operation.

Several operations, includingMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD Product

root root process rank for resultcomm communicator

Page 47: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Hello World#include <stdio.h>

#include <mpi.h>

int main(int argc, char ** argv)

{

int size,rank;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size);

printf("Hello MPI! Process %d of %d\n", rank, size); MPI_Finalize();

}

2.47

Page 48: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.48

Sample MPI program

#include “mpi.h”

#include <stdio.h>

#include <math.h>

#define MAXSIZE 1000

void main(int argc, char *argv)

{

int myid, numprocs;

int data[MAXSIZE], i, x, low, high, myresult, result;

char fn[255];

char *fp;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

if (myid == 0) { /* Open input file and initialize data */

strcpy(fn,getenv(“HOME”));

strcat(fn,”/MPI/rand_data.txt”);

if ((fp = fopen(fn,”r”)) == NULL) {

printf(“Can’t open the input file: %s\n\n”, fn);

exit(1);

}

for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);

}

MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */

x = n/nproc; /* Add my portion Of data */

low = myid * x;

high = low + x;

for(i = low; i < high; i++)

myresult += data[i];

printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */

MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

if (myid == 0) printf(“The sum is %d.\n”, result);

MPI_Finalize();

}

Page 49: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

MPI Groups, Communicators

2.49

• Collective communication needed to be performed by subsets of the processes in the computation

• MPI provides routines for– defining new process groups from subsets of existing

process groups like MPI_COMM_WORLD– creating a new communicator for a new process group– performing collective communication within that process

group

Page 50: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Groups and communicators

2.50

Page 51: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Facts about groups and communicators

2.51

• Group:– ordered set of processes– each process in group has a unique integer id called its rank within

that group– process can belong to more than one group

• rank is always relative to a group– groups are “opaque objects”

• use only MPI provided routines for manipulating groups• Communicators:

– all communication must specify a communicator– from the programming viewpoint, groups and communicators are

equivalent– communicators are also “opaque objects”

• Groups and communicators are dynamic objects and can be created and destroyed during the execution of the program

Page 52: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Typical usage

2.52

1. Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group

2. Form new group as a subset of global group using MPI_Group_incl or MPI_Group_excl

3. Create new communicator for new group using MPI_Comm_create

4. Determine new rank in new communicator using MPI_Comm_rank

5. Conduct communications using any MPI message passing routine

6. When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free

Page 53: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

main(int argc, char **argv) { int me, count, count2; void *send_buf, *recv_buf, *send_buf2, *recv_buf2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commslave; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); if(me != 0){ /* compute on slave */ MPI_Reduce(send_buf,recv_buff,count, MPI_INT, MPI_SUM, 1,

commslave); }

/* zero falls through immediately to this reduce, others do later... */ MPI_Reduce(send_buf2, recv_buff2, count2, MPI_INT, MPI_SUM, 0,

MPI_COMM_WORLD); MPI_Comm_free(&commslave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize();}

Page 54: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.54

#include “mpi.h”#include <stdio.h>#define NPROCS 8int main(argc,argv)int argc;char *argv[]; {int rank, new_rank, sendbuf, recvbuf, numtasks, ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7};MPI_Group orig_group, new_group;MPI_Comm new_comm;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);sendbuf = rank;

/* Extract the original group handle */MPI_Comm_group(MPI_COMM_WORLD, &orig_group);

/* Divide tasks into two distinct groups based upon rank */if (rank < NPROCS/2) { MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group); }else { MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group); }

/* Create new new communicator and then perform collective communications */MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm);MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm);

MPI_Group_rank (new_group, &new_rank);printf("rank= %d newrank= %d recvbuf= %d\n",rank,new_rank,recvbuf);

MPI_Finalize();}

Sample output:

rank= 7 newrank= 3 recvbuf= 22 rank= 0 newrank= 0 recvbuf= 6 rank= 1 newrank= 1 recvbuf= 6rank= 2 newrank= 2 recvbuf= 6 rank= 6 newrank= 2 recvbuf= 22 rank= 3 newrank= 3 recvbuf= 6 rank= 4 newrank= 0 recvbuf= 22 rank= 5 newrank= 1 recvbuf= 22

Page 55: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.55

Evaluating Programs EmpiricallyMeasuring Execution Time

To measure the execution time between point L1 and point L2 in the code, we might have a construction such as:

.L1: time(&t1); /* start timer */

.

.L2: time(&t2); /* stop timer */

.elapsed_time = difftime(t2, t1); /* time=t2 - t1 */

printf(“Elapsed time = %5.2f seconds”, elapsed_time);

Page 56: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.56

MPI provides the routine MPI_Wtime() for returning time (in seconds):

double start_time, end_time, exe_time;

start_time = MPI_Wtime();    

    .    .

end_time = MPI_Wtime();exe_time = end_time - start_time;

Page 57: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Average/least/most execution time spent by individual process

• int myrank, numprocs; • double mytime, maxtime, mintime, avgtime; /*variables used for gathering timing statistics*/

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

• MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/ • mytime = MPI_Wtime(); /*get time just before work section */ • work(); • mytime = MPI_Wtime() - mytime; /*get time just after work section*/ /*compute max, min, and

average timing statistics*/ • MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD); • MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); • MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); • if (myrank == 0) {• avgtime /= numprocs; • printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);• }

2.57

Page 58: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.58

Compiling/Executing MPI Programs

• Minor differences in the command lines required depending upon MPI implementation.

• For the assignments, we will use MPICH-2.

• Generally, a file need to be present that lists all the computers to be used. MPI then uses those computers listed.

Page 59: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Set MPICH2 Path• Then add the following at• the end of your ~/.bashrc file and source it with the command• source ~/.bashrc (or log in again).• #----------------------------------------------------------------------• # MPICH2 setup• export PATH=/opt/MPICH2/bin:$PATH• export MANPATH= =/opt/MPICH2/bin :$MANPATH• #----------------------------------------------------------------------

• Some logging and visulaization help: • You can Link with the libraries -llmpe -lmpe to enable logging and the MPE

environment. Then run the program as usual and a log file will be produced. The log file can be visualized using the jumpshot program that comes bundled with MPICH2.

2.59

Page 60: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.60

Defining the Computers to UseGenerally, need to create a file containing the list of machines to be used.

Sample machines file (or hostfile)

athena.cs.siu.eduoscarnode1.cs.siu.edu………. oscarnode8.cs.siu.edu

In MPICH, if just using one computer, do not need this file.

Page 61: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.61

MPICH Commands

Two basic commands:

• mpicc, a script to compile MPI programs

• Mpiexec, mpirun, the command to execute an MPI program.

Page 62: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.62

Compiling/executing (SPMD) MPI program

For MPICH. At a command line:

To start MPI: Nothing special.

To compile MPI programs:

for C mpicc -o prog prog.c

for C++ mpiCC -o prog prog.cpp

To execute MPI program:

mpiexec -n no_procs prog

A positive integer

Page 63: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.63

Executing MPICH program on multiple computers

Create a file called say “machines” containing the list of machines:

athena.cs.siu.eduoscarnode1.cs.siu.edu

………. oscarnode8.cs.siu.edu

Establish network environmentsmpdboot –n 9 –f machinesmpdtracempdallexit

Page 64: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

2.64

mpirun -machinefile machines -np 4 prog

would run prog with four processes.

Each processes would execute on one of machines in list.  MPI would cycle through list of machines giving processes to machines.

Can also specify number of processes on a particular machine by adding that number after machine name.)

“MPI standard” command mpiexec is now the replacement for mpirun although mpirun exists.

Page 65: 2.1 Message-Passing Computing Cluster Computing, UNC B. Wilkinson, 2007

Reference Tutorial Materials

• http://www-unix.mcs.anl.gov/mpi/tutorial/index.html• http://www.cs.utexas.edu/users/pingali/

CS378/2008sp/lectureschedule.html

2.65