2.1 message-passing computing cluster computing, unc b. wilkinson, 2007

2.1

Message-Passing Computing

Cluster Computing, UNC B. Wilkinson, 2007.

2.2

Message-Passing Programming using User-level Message-Passing Libraries

Programming by using a normal high-level language such as C, augmented with message-passing library calls that perform direct process-to-process message passing.

1. A method of creating separate processes for execution on different computers

2. A method of sending and receiving messages

2.3

Multiple program, multiple data (MPMD) model

Sourcefile

Executable

Processor 0 Processor p - 1

Compile to suitprocessor

Sourcefile

2.4

Multiple Program Multiple Data (MPMD) Model with dynamic process creation

Process 1

Process 2spawn();

Time

Start executionof process 2

Processes started from within master process - dynamic process creation.

Potentially very flexible but incurs overhead of dynamically starting processes

PVM used this form.MPI-2 has dynamic process creation although we not use it.

2.5

Single Program Multiple Data (SPMD) model.

Sourcefile

Executables



Different processes merged into one program. Control (IF) statements select different parts for each processor to execute. All executables started together - static process creation

2.6

Single Program Multiple Data (SPMD) model.

Sourcefile

Executables



if (processID == 0) {

… // do this code

} else if (processID == 1) {

… //do this code

} else …

Typically coded for one master and a set of identical slave processes rather than each process different.

MPI uses this form

2.7

Advantages/disadvantages of MPMD and SPMD

• MPMD with dynamic process creation– Flexible – can start up processes on demand during

execution, for example when searching through a search space.

– Has process start-up overhead

• SPMD (with static process creation)– Easier to code– Just one program to write– Collective message passing routines in each process the

same (see later)– Efficient process execution

2.8

Basic “point-to-point”Send and Receive Routines

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Movementof data

Generic syntax (actual formats later)

Passing a message between processes using send() and recv() library calls:

2.9

Synchronous Message PassingRoutines that actually return when message transfer completed.

Synchronous send routine• Waits until complete message has been accepted by the receiving

process before returning.

Synchronous receive routine• Waits until the message it is expecting arrives.

Synchronous routines intrinsically perform two actions: They transfer data and they synchronize processes. Neither can proceed until the message has been passed from the source to the destination. So no message buffer storage is needed.

2.10

Synchronous send() and recv() using 3-way protocol

Process 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) When send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

(b) When recv() occurs before send()

Request to send

Request to send

2.11

Asynchronous Message Passing• Blocking - has been used to describe routines that do

not return until the transfer is completed.– The routines are “blocked” from continuing.

• Non-blocking - has been used to describe routines that return whether or not the message had been received. Usually require local storage for messages.– In general, they do not synchronize processes but allow

processes to move forward sooner. Must be used with care.• In that sense, the terms synchronous and blocking were

synonymous.

2.12

MPI Definitions of Blocking and Non-Blocking

• Locally Blocking - return after their local actions complete, though the message transfer may not have been completed.

• Non-blocking - return immediately.

Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this.These terms may have different interpretations in other systems.

2.13

How message-passing routines return before message transfer completed

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time

Message buffer needed between source and destination to hold message:

Message Buffer• For a receive routine, the message has to have been received if we want

the message.

• If recv() is reached before send(), the message buffer will be empty and recv() waits for the message.

• For a send routine, once the local actions have been completed and the message is safely on its way, the process can continue with subsequent work.

• In this way, using such send routines can decrease the overall execution time.

• In practice, buffers can only be of finite length and a point could be reached when the send routine is held up because all the available buffer space has been exhausted.

• It may be necessary to know at some point if the message has actually been received, which will require additional message passing.

2.14

2.15

Message Tag

• Used to differentiate between different types of messages being sent.

• Message tag is carried within message.

• If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

2.16

Message Tag Example

Process 1 Process 2

send(&x,2, 5);

recv(&y,1, 5);

x y

Movementof data

Waits for a message from process 1 with a tag of 5

To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:

2.17

“Group” message passing routines

Have routines that send message(s) to a group of processes or receive message(s) from a group of processes

Higher efficiency than separate point-to-point routines although not absolutely necessary.

2.18

BroadcastSending same message to all processes concerned with problem.Multicast - sending same message to defined group of processes.

bcast();

buf

bcast();

data

bcast();

datadata

Process 0 Process p - 1Process 1

Action

Code

SPMD (MPI) form Broadcast action does not occur until all the processes have executed their broadcast routine, and the broadcast operation will have the effect of synchronizing the processes.

2.19

Scatter

scatter();

buf

scatter();

data

scatter();

datadata


Action

Code

Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.Can send more than one element.

SPMD (MPI) form

2.20

Gather

gather();

buf

gather();

data

gather();

datadata


Action

Code

Having one process collect individual values from set of processes.

SPMD (MPI) form

2.21

Reduce

reduce();

buf

reduce();

data

reduce();

datadata


+

Action

Code

Gather operation combined with specified arithmetic/logical operation to a single value.

Example: Values could be gathered and then added together by root:

SPMD (MPI) form

2.22

Software Tools for ClustersLate 1980’s Parallel Virtual Machine (PVM) - developed

Became very popular.

Mid 1990’s - Message-Passing Interface (MPI) - standard defined.

Based upon Message Passing Parallel Programming model.

Both provide a set of user-level libraries for message passing. Use with sequential programming languages (C, C++, ...).

2.23

PVM(Parallel Virtual Machine)

Perhaps first widely adopted attempt at using a workstation cluster as a multicomputer platform, developed by Oak Ridge National Laboratories. Available at no charge.

Programmer decomposes problem into separate programs (usually master and group of identical slave programs).

Programs compiled to execute on specific types of computers.

Set of computers used on a problem first must be defined prior to executing the programs (in a hostfile).

2.24

Message routing between computers done by PVM daemon processes installed by PVM on computers that form the virtual machine.

PVM

Application

daemon

program

Workstation

PVMdaemon

Applicationprogram

Applicationprogram

PVMdaemon

Workstation

Workstation

Messagessent throughnetwork

(executable)

(executable)

(executable)

MPI implementation we use is similar.

Can have more than one processrunning on each computer.

2.25

MPI(Message Passing Interface)

• Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability.

• Defines routines, not implementation.

• Several free implementations exist.

2.26

MPIProcess Creation and Execution

• Purposely not defined - Will depend upon implementation.

• Only static process creation supported in MPI version 1. All processes must be defined prior to execution and started together.

• Originally SPMD model of computation. • MPMD also possible with static creation

2.27

Using SPMD Computational Model

main (int argc, char *argv[]){MPI_Init(&argc, &argv);

.

.MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /*find process rank */

if (myrank == 0)master();

elseslave();..

MPI_Finalize();}

where master() and slave() are to be executed by master process and slave process, respectively.

2.28

Communicators• Defines scope of a communication operation.

• Processes have ranks associated with communicator.

• Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes.

• Other communicators can be established for groups of processes.

• Two types of communicator– Intracommunicators for communication within a defined groups– Intercommunicators for communication between defined groups

2.29

Reasoning for Communicators

• Provides a solution to unsafe message passing, – Message tags alone are not sufficient.

• Enables basic error checking of message passing code by allowing programmer to define communication domains.

– Messages cannot be sent to destinations outside defined communication domain

2.30

Unsafe message passing - Example

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source

2.31

MPI Blocking Routines

• Return when “locally complete” - when location used to hold message can be used again or altered without affecting message being sent.

• Blocking send will send message and return - does not mean that message has been received, just that process free to move on without adversely affecting message.

2.32

Parameters of blocking send

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

2.33

Parameters of blocking receive

MPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

2.34

Example

To send an integer x from process 0 to process 1,

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */

if (myrank == 0) {int x;MPI_Send(&x, 1, MPI_INT, 1, msgtag, MPI_COMM_WORLD);

} else if (myrank == 1) {int x;MPI_Recv(&x, 1, MPI_INT, 0,msgtag,MPI_COMM_WORLD,status);

}

2.35

MPI Nonblocking Routines

• Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered.

• Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

2.36

Nonblocking Routine Formats

MPI_Isend(buf,count,datatype,dest,tag,comm,request)

MPI_Irecv(buf,count,datatype,source,tag,comm, request)

Completion detected by MPI_Wait() and MPI_Test().

MPI_Wait() waits until operation completed and returns then.

MPI_Test() returns with flag set indicating whether operation completed at that time.

Need to know whether particular operation completed.

Determined by accessing request parameter.

2.37

Example

To send an integer x from process 0 to process 1 and allow process 0 to continue,

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

int x;

MPI_Isend(&x,1,MPI_INT, 1, msgtag, MPI_COMM_WORLD, req1);

compute();

MPI_Wait(req1, status);

} else if (myrank == 1) {

int x;

MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status);

}

2.38

Send Communication Modes

• Standard Mode Send - Not assumed that corresponding receive routine has started. Amount of buffering not defined by MPI. If buffering provided, send could complete before receive reached.

• Buffered Mode - Send may start and return before a matching receive. Necessary to specify buffer space via routine MPI_Buffer_attach().

• Synchronous Mode - Send and receive can start before each other but can only complete together.

• Ready Mode - Send can only start if matching receive already reached, otherwise error. Use with care.

2.39

Parameters of synchronous send(same as blocking send)

MPI_Ssend(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

process

2.40

Collective Communication

Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations:

• MPI_Bcast() - Broadcast from root to all other processes• MPI_Gather() - Gather values for group of processes• MPI_Scatter() - Scatters buffer in parts to group of processes• MPI_Alltoall() - Sends data from all processes to all processes• MPI_Reduce() - Combine values on all processes to single value• MPI_Reduce_scatter() - Combine values and scatter results• MPI_Scan()- Compute prefix reductions of data on processes

• MPI_Barrier() - A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.

2.41

Barrier: Block process until all processes have called it

MPI_Barrier(comm)

communicator

2.42

Broadcast message from root process to all processes in comm

and itself.

MPI_Bcast(*buf, count, datatype, root, comm)

Parameters:

*buf message buffer (loaded)

count number of entries in buffer

datatype data type of buffer

root rank of root

2.43

Gather values for group of processes

MPI_Gather(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)

Parameters:*sendbuf send buffersendcount number of send buffer elementssendtype data type of send elements*recvbuf receive buffer (loaded)recvcount number of elements each receiverecvtype data type of receive elementsroot rank of receiving processcomm communicator

2.44

ExampleTo gather items from group of processes into process 0, using dynamically allocated memory in root process:

int data[10]; /*data to be gathered from processes*/

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */

if (myrank == 0) {

MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/

buf = (int *)malloc(grp_size*10*sizeof (int)); /*allocate memory*/

}

MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0,MPI_COMM_WORLD) ;

MPI_Gather() gathers from all processes, including root.

2.45

Scatter a buffer from root in parts to group of processes

MPI_Scatter(*sendbuf, sendcount, sendtype, *recvbuf, recvcount, recvtype, root, comm)

Parameters:*sendbuf send buffersendcount number of elements sent

(each process)sendtype data type of elements*recvbuf receive buffer (loaded)recvcount number of recv buffer elementsrecvtype type of recv elementsroot root process rankcomm communicator

2.46

Combine values on all processes to single value

MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm)

Parameters:*sendbuf send buffer address*recvbuf receive buffer addresscount number of send buffer elementsdatatype data type of send elementsop reduce operation.

Several operations, includingMPI_MAX MaximumMPI_MIN MinimumMPI_SUM SumMPI_PROD Product

root root process rank for resultcomm communicator

Hello World#include <stdio.h>

#include <mpi.h>

int main(int argc, char ** argv)

{

int size,rank;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size);

printf("Hello MPI! Process %d of %d\n", rank, size); MPI_Finalize();

}

2.47

2.48

Sample MPI program

#include “mpi.h”

#include <stdio.h>

#include <math.h>

#define MAXSIZE 1000

void main(int argc, char *argv)

{

int myid, numprocs;

int data[MAXSIZE], i, x, low, high, myresult, result;

char fn[255];

char *fp;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

if (myid == 0) { /* Open input file and initialize data */

strcpy(fn,getenv(“HOME”));

strcat(fn,”/MPI/rand_data.txt”);

if ((fp = fopen(fn,”r”)) == NULL) {

printf(“Can’t open the input file: %s\n\n”, fn);

exit(1);

}

for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);

}

MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */

x = n/nproc; /* Add my portion Of data */

low = myid * x;

high = low + x;

for(i = low; i < high; i++)

myresult += data[i];

printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */

MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

if (myid == 0) printf(“The sum is %d.\n”, result);

MPI_Finalize();

}

MPI Groups, Communicators

2.49

• Collective communication needed to be performed by subsets of the processes in the computation

• MPI provides routines for– defining new process groups from subsets of existing

process groups like MPI_COMM_WORLD– creating a new communicator for a new process group– performing collective communication within that process

group

Groups and communicators

2.50

Facts about groups and communicators

2.51

• Group:– ordered set of processes– each process in group has a unique integer id called its rank within

that group– process can belong to more than one group

• rank is always relative to a group– groups are “opaque objects”

• use only MPI provided routines for manipulating groups• Communicators:

– all communication must specify a communicator– from the programming viewpoint, groups and communicators are

equivalent– communicators are also “opaque objects”

• Groups and communicators are dynamic objects and can be created and destroyed during the execution of the program

Typical usage

2.52

1. Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group

2. Form new group as a subset of global group using MPI_Group_incl or MPI_Group_excl

3. Create new communicator for new group using MPI_Comm_create

4. Determine new rank in new communicator using MPI_Comm_rank

5. Conduct communications using any MPI message passing routine

6. When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free

main(int argc, char **argv) { int me, count, count2; void *send_buf, *recv_buf, *send_buf2, *recv_buf2; MPI_Group MPI_GROUP_WORLD, grprem; MPI_Comm commslave; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &MPI_GROUP_WORLD); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Group_excl(MPI_GROUP_WORLD, 1, ranks, &grprem); MPI_Comm_create(MPI_COMM_WORLD, grprem, &commslave); if(me != 0){ /* compute on slave */ MPI_Reduce(send_buf,recv_buff,count, MPI_INT, MPI_SUM, 1,

commslave); }

/* zero falls through immediately to this reduce, others do later... */ MPI_Reduce(send_buf2, recv_buff2, count2, MPI_INT, MPI_SUM, 0,

MPI_COMM_WORLD); MPI_Comm_free(&commslave); MPI_Group_free(&MPI_GROUP_WORLD); MPI_Group_free(&grprem); MPI_Finalize();}

2.54

#include “mpi.h”#include <stdio.h>#define NPROCS 8int main(argc,argv)int argc;char *argv[]; {int rank, new_rank, sendbuf, recvbuf, numtasks, ranks1[4]={0,1,2,3}, ranks2[4]={4,5,6,7};MPI_Group orig_group, new_group;MPI_Comm new_comm;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &rank);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);sendbuf = rank;

/* Extract the original group handle */MPI_Comm_group(MPI_COMM_WORLD, &orig_group);

/* Divide tasks into two distinct groups based upon rank */if (rank < NPROCS/2) { MPI_Group_incl(orig_group, NPROCS/2, ranks1, &new_group); }else { MPI_Group_incl(orig_group, NPROCS/2, ranks2, &new_group); }

/* Create new new communicator and then perform collective communications */MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm);MPI_Allreduce(&sendbuf, &recvbuf, 1, MPI_INT, MPI_SUM, new_comm);

MPI_Group_rank (new_group, &new_rank);printf("rank= %d newrank= %d recvbuf= %d\n",rank,new_rank,recvbuf);

MPI_Finalize();}

Sample output:

rank= 7 newrank= 3 recvbuf= 22 rank= 0 newrank= 0 recvbuf= 6 rank= 1 newrank= 1 recvbuf= 6rank= 2 newrank= 2 recvbuf= 6 rank= 6 newrank= 2 recvbuf= 22 rank= 3 newrank= 3 recvbuf= 6 rank= 4 newrank= 0 recvbuf= 22 rank= 5 newrank= 1 recvbuf= 22

2.55

Evaluating Programs EmpiricallyMeasuring Execution Time

To measure the execution time between point L1 and point L2 in the code, we might have a construction such as:

.L1: time(&t1); /* start timer */

.

.L2: time(&t2); /* stop timer */

.elapsed_time = difftime(t2, t1); /* time=t2 - t1 */

printf(“Elapsed time = %5.2f seconds”, elapsed_time);

2.56

MPI provides the routine MPI_Wtime() for returning time (in seconds):

double start_time, end_time, exe_time;

start_time = MPI_Wtime();

. .

end_time = MPI_Wtime();exe_time = end_time - start_time;

Average/least/most execution time spent by individual process

• int myrank, numprocs; • double mytime, maxtime, mintime, avgtime; /*variables used for gathering timing statistics*/

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

• MPI_Barrier(MPI_COMM_WORLD); /*synchronize all processes*/ • mytime = MPI_Wtime(); /*get time just before work section */ • work(); • mytime = MPI_Wtime() - mytime; /*get time just after work section*/ /*compute max, min, and

average timing statistics*/ • MPI_Reduce(&mytime, &maxtime, 1, MPI_DOUBLE,MPI_MAX, 0, MPI_COMM_WORLD); • MPI_Reduce(&mytime, &mintime, 1, MPI_DOUBLE, MPI_MIN, 0,MPI_COMM_WORLD); • MPI_Reduce(&mytime, &avgtime, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD); • if (myrank == 0) {• avgtime /= numprocs; • printf("Min: %lf Max: %lf Avg: %lf\n", mintime, maxtime,avgtime);• }

2.57

2.58

Compiling/Executing MPI Programs

• Minor differences in the command lines required depending upon MPI implementation.

• For the assignments, we will use MPICH-2.

• Generally, a file need to be present that lists all the computers to be used. MPI then uses those computers listed.

Set MPICH2 Path• Then add the following at• the end of your ~/.bashrc file and source it with the command• source ~/.bashrc (or log in again).• #----------------------------------------------------------------------• # MPICH2 setup• export PATH=/opt/MPICH2/bin:$PATH• export MANPATH= =/opt/MPICH2/bin :$MANPATH• #----------------------------------------------------------------------

• Some logging and visulaization help: • You can Link with the libraries -llmpe -lmpe to enable logging and the MPE

environment. Then run the program as usual and a log file will be produced. The log file can be visualized using the jumpshot program that comes bundled with MPICH2.

2.59

2.60

Defining the Computers to UseGenerally, need to create a file containing the list of machines to be used.

Sample machines file (or hostfile)

athena.cs.siu.eduoscarnode1.cs.siu.edu………. oscarnode8.cs.siu.edu

In MPICH, if just using one computer, do not need this file.

2.61

MPICH Commands

Two basic commands:

• mpicc, a script to compile MPI programs

• Mpiexec, mpirun, the command to execute an MPI program.

2.62

Compiling/executing (SPMD) MPI program

For MPICH. At a command line:

To start MPI: Nothing special.

To compile MPI programs:

for C mpicc -o prog prog.c

for C++ mpiCC -o prog prog.cpp

To execute MPI program:

mpiexec -n no_procs prog

A positive integer

2.63

Executing MPICH program on multiple computers

Create a file called say “machines” containing the list of machines:

athena.cs.siu.eduoscarnode1.cs.siu.edu

………. oscarnode8.cs.siu.edu

Establish network environmentsmpdboot –n 9 –f machinesmpdtracempdallexit

2.64

mpirun -machinefile machines -np 4 prog

would run prog with four processes.

Each processes would execute on one of machines in list. MPI would cycle through list of machines giving processes to machines.

Can also specify number of processes on a particular machine by adding that number after machine name.)

“MPI standard” command mpiexec is now the replacement for mpirun although mpirun exists.

Reference Tutorial Materials

• http://www-unix.mcs.anl.gov/mpi/tutorial/index.html• http://www.cs.utexas.edu/users/pingali/

CS378/2008sp/lectureschedule.html

2.65

http://www-unix.mcs.anl.gov/mpi/tutorial/index.html

2.1 message-passing computing cluster computing, unc b. wilkinson, 2007

Documents