qoscosgrid - barcelona25 october 2006 mpich-v project pierre lemarinier [email protected]...

53
QosCosGrid - Barcelona 25 October 2006 MPICH-V Project http://www.mpich-v.net Pierre Lemarinier [email protected] Laboratoire de Recherche en Informatique University Paris South INRIA Futurs M M essage essage P P assing assing I I nterfac nterfac e e Standard Standard Description Description Performance

Upload: sherman-hutchinson

Post on 02-Jan-2016

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona25 October 2006

MPICH-V Project

http://www.mpich-v.net

Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs

MMessageessagePPassingassingIInterfacenterface

StandardStandardDescriptionDescriptionPerformance

Page 2: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 225 October 2006

Contents Introduction to MPI

Message passing Different type of communication MPI functionalities

MPI structures Basic functions Data types Contexts and tags Groups and communication domains

Communication functions Point to point communications Asynchronous communications Global communications

MPI-2 One-sided communications I/O Dynamicity

Page 3: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 325 October 2006

Message passing (1) Problem :

We have N nodes All nodes connected by network

How to use the global computer gathering the N nodes ?

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

Network

Page 4: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 425 October 2006

Message passing (2)One answer : message passing

Execute one process per processorExchange explicitly data between processorsSynchronize explicitly the different processes

Two types of data transfer :Only one process initiate the communication: ‘one sided’ The two processes cooperate for the communication:

‘cooperative’

Page 5: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 525 October 2006

Two types of data transfer ‘one sided’ communications

No Rendez-vous protocol No warning about reading or

writing actions inside local memory for a process

Costly synchronization

Functions prototypes : put(remote_process, data) get(remote_process, data)

Cooperatives Communications The communication involves

the two processes Implicit synchronization in the

simple case

Functions prototypes : send(destination, data) recv(source, data)

CPU CPU

put()

CPU CPU

get()

CPU CPU

send() recv()

Page 6: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 625 October 2006

MPI (Message Passing Interface) Standard developed by academics and industrial partners

Objective: to specify a portable message passing library

Imply an execution environment for launching and connecting together all the processes

Allow: Synchronous and asynchronous communications Global communications

Separated communication domains

Page 7: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 725 October 2006

Contents Introduction to MPI

Message passing Different type of communication MPI functionalities

MPI structures Basic functions (exemple HelloWorld_MPI.c) Data types Contexts and tags Groups and communication domains

Communication functions Point to point communications Asynchronous communications Global communications

MPI-2 One-sided communications I/O Dynamicity

Page 8: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 825 October 2006

MPI Programming Structure Follows the SPMD programming model

All processes are launched at the same time Same program for every processors Can differentiate processors roles by a rank number

Sequential section

MPI initialization

Parallel initialization

Computation

Communications

Synchronization

End of parallel section

Sequential section

Non parallel section

Remark: Most implementations advise to limit this program part to the exit call

Multinode parallel section (MPI)

Parallel section initialization

Parallel section termination

Page 9: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 925 October 2006

Basic functions MPI environment initialization

C : MPI_Init(&argc, char &argv); Fortran : call MPI_Init(ierror)

MPI Environment termination (program are recommended to exit after this function call) C : MPI_Finalize(); Fortran : call MPI_Finalize(ierror)

Getting the process rank C : MPI_Comm_rank(MPI_COMM_WORLD, &rank); Fortran : call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)

Getting the total number of processes C : MPI_Comm_size(MPI_COMM_WORLD, &size); Fortran : call MPI_comm_size(MPI_COMM_WORLD, size, ierror)

Page 10: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1025 October 2006

HelloWorld_MPI.c#include <stdio.h>

#include <mpi.h>

void main(int argc, char ** argv) {

int rang, nprocs;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rang);

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

printf(“hello, I am %d (Of %d processes)\n”, rang, nprocs);

MPI_Finalize();

}

Page 11: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1125 October 2006

MPI data types

MPI_PACKED

MPI_BYTE

long doubleMPI_LONG_DOUBLE

doubleMPI_DOUBLE

floatMPI_FLOAT

unsigned long intMPI_UNSIGNED_LONG

unsigned intMPI_UNSIGNED

unsigned short intMPI_UNSIGNED_SHORT

unsigned charMPI_UNSIGNED_CHAR

signed long intMPI_LONG

signed intMPI_INT

signed short intMPI_SHORT

signed charMPI_CHAR

Type CType MPI

MPI_PACKED

MPI_BYTE

CHARACTER(1)MPI_CHARACTER

LOGICALMPI_LOGICAL

COMPLEXMPI_COMPLEX

DOUBLE PRECISIONMPI_DOUBLE_PRECISION

REALMPI_REAL

INTEGERMPI_INTEGER

Type FORTRANType MPI

Page 12: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1225 October 2006

User data types By default: MPI exchanges data using vector of MPI data

It is possible to create data types to simplify communication operations (simplifying buffer and linearization operations)

User data types replace the obsolete MPI_PACK type

A user type consists in a sequence of basic types and a sequence of offsets for fitting the memory creation : MPI_Type_commit(type) ; Destruction : MPI_Type_free(type) ;

Page 13: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1325 October 2006

Contexts and tags Need to distinguish different messages in reception

Context allow to distinguish between a point-to-point communication and a global communication

Every message is sent in a within a context, and must be received in the same context

Context is automatically managed by MPI

The communication tags allow to identify one communication among multiple ones

When communication are made asynchronously, this tags allow to sort them

For reception operations, we can received the next message by specifying the MPI_ANY_TAG keyword

Tag management is up to the MPI programmer

Page 14: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1425 October 2006

Communication domains Nodes can be grouped in a communication domain called

communicator

Every process as a rank number per group it is involved in

MPI_COMM_WORLD is the default communication domain gathering all processes and created at the initialization.

More generally, All operations can only be made on a single set of processes specified by their communicator

Each domain constitutes an distinct specific context for communications

Page 15: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1525 October 2006

Split a communicator (1/2): groups To create a new domain, first you have to create a new

group of processes: int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); int MPI_Group_incl(MPI_Group group, int rsize, int

*ranks, MPI_Group *newgroup); int MPI_Group_excl(MPI_Group group, int rsize, int

*ranks, MPI_Group *newgroup); Set of operations on the groups:

int MPI_Group_union(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

int MPI_Group_intersection(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

int MPI_Group_difference(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ;

Destruction of a group: int MPI_Group_free(MPI_Group *group) ;

Page 16: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1625 October 2006

Split a communicator (2/2): communicators Associating a communicator to a group:

int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) ;

Dividing a domain in sub-domains: int MPI_Comm_split(MPI_Comm comm, int color, int key,

MPI_Comm *newcomm) ;MPI_Comm_split is a collective operation on the initial

communicator commEvery process gives its color, Every process of the same color are

then in the same newcommThe MPI_UNDEFINED color allows for a process to not be part of the

new communicatorEvery process gives its key, Processes of the same color are ranked

by these keysA group is implicitly created for each new communicator created

this way

Communicators destruction: int MPI_Comm_free(MPI_Comm *comm) ;

Page 17: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1725 October 2006

Contents Introduction to MPI

Message passing Different type of communication MPI functionalities

MPI structures Basic functions Data types Contexts and tags Groups and communication domains

Communication functions Point to point communications (exemple Jeton.c) Asynchronous communications Global communications (exemple trace.c)

MPI-2 One-sided communications I/O Dynamicity

Page 18: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1825 October 2006

Point-to-point communications Send and receive data between a pair of processes

The two processes initiates the communication, one sends the data, the other asks for the reception

Communications are identified by tags

The type and the size of the data must be specified

Page 19: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 1925 October 2006

Basic communication functions Synchronous sending (between the computation process

and the action of sending): int MPI_Send(void* buf, int count, MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm) ; The tag allow unique identifying of messages

Synchronous data reception: int MPI_Recv(void* buf, int count, MPI_Datatype

datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) ;

The tag must be identical to the tag sent MPI_ANY_SOURCE can be specified to receive from anyone

Page 20: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2025 October 2006

Jeton.c#include <stdio.h>

#include <mpi.h>

void main(int argc, char ** argv) {

int me, prec, suiv, np;

int jeton = 0;

MPI_Status * status;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &me);

MPI_Comm_size(MPI_COMM_WORLD, &np);

if (me == 0)

prec = np – 1;

else

prec = me – 1;

if (me == np - 1)

suiv = 0;

else

suiv = me + 1;

if (me == 0)

MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD,);

while (1) {

MPI_Recv(&jeton, 1, MPI_INT, prec, 0, MPI_COMM_WORLD, status);

MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD);

}

MPI_Finalize();

}

3

4

5

21

0n

p -1

Page 21: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2125 October 2006

Synchronism and asynchronism (1) To solve some deadlocks, and to allow le recouvrement des

communications par le calcul, one can use non blocking functions

In this case, the communication scheme is the following: Initialization of the non blocking communication (by either the

two or one of the process) The communication (non blocking or blocking) is called by

other process … computation Termination of the communication (Blocking operation until the

communication is performed)

Page 22: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2225 October 2006

Synchronism and asynchronism (2) Non blocking functions :

int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request);

int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) ;

The request field is used to know the state of a non blocking communication. To wait for its termination, one can call the following function: int MPI_Wait(MPI_Request *request, MPI_Status

*status) ;

Page 23: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2325 October 2006

Synchronism and asynchronism (3)

Data can be exchanged by blocking or non blocking functions. There are multiple functions to manage how the send and the receive operation are coupled

To fix the communication mode, you use prefix (MPI_[*]Send): Synchronous send ([S]) : finished when the coresponding receive is

posted (hard coupled to the reception, without buffers) Buffered send ([B]) : a buffer is created, the send operation ends when

the user buffer is copied to the system buffer (not coupled to the reception)

Standard send () : the send ends when the emission buffer is empty (MPI implementation decides for buffering or coupling to reception)

Ready send ([R]) : User assures that reception request is already posted when calling this function (coupled to the reception without buffer)

Page 24: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2425 October 2006

Collective or global operations

To simplify communication operation involving multiple processes, one can use collective operations on a communicator

Typical operations: reductions

Data exchange:BroadcastScatterGatherAll-to-All

Explicit synchronization

Page 25: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2525 October 2006

Reductions (1) A reduction is an arithmetic operation on the distributed

data made by a set of processors Prototype :

C : int MPI_Reduce(void * sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm communicator);

Fortran : MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, communicator, ierror)

Using MPI_Reduce(), only the root processor gets the result

With MPI_AllReduce(), all processes get the result

Page 26: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2625 October 2006

Reductions (2) Available operations

Maximum and localizationMPI_MAXLOC

Minimum and localizationMPI_MINLOC

Bit/bit exclusive orMPI_BXOR

Logical exclusive orMPI_LXOR

Bit/bit orMPI_BOR

Logical orMPI_LOR

Bit/bit andMPI_BAND

Logical andMPI_LAND

Product element by elementMPI_PROD

SumMPI_SUM

MaximumMPI_MAX

MinimumMPI_MIN

OperationMPI_Op

Page 27: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2725 October 2006

Broadcast A broadcast operation allows to distribute the same data to all

processes

One-to-all communication, from a specified process ‘root’ to all processes of a communicator

Prototypes : C : int MPI_Bcast(void *buffer, int count, MPI_Datatype

datatype, int root, MPI_Comm comm); Fortran : MPI_Bcast(buffer, count, datatype, root,

communicator, ierror)

0 1 2 3 np-1

root = 1

0 1 2 3 np-1

buffer

Page 28: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2825 October 2006

Scatter One-to-all operation, different data are sent to each receiver process

according to their rank Prototypes :

C : int MPI_Scatter(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

The ‘send’ parameters are used by only the sender process

sendbuf

recvbuf

0 1 2 3 np-1

root = 2

0 1 2 3 np-1

Page 29: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 2925 October 2006

Gather All-to-one operation, different data are received by a receiver process Prototypes :

C : int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

The ‘receive’ parameters are only used by the receiver process

sendbuf

recvbuf

0 1 2 3 np-1

root = 3

0 1 2 3 np-1

Page 30: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3025 October 2006

All-to-All All-to-all operation, different data are sent to each process,

according to their rank Prototypes :

C : int MPI_AlltoAll(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator);

Fortran : MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror)

sendbuf

recvbuf

0 1 2 3 np-1 0 1 2 3 np-1

Page 31: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3125 October 2006

Explicit Synchronization Synchronization barrier : All processes of a communicator

waits for the last process to enter the barrier before continuing their execution

For computer with material barrier available (such as SGI and Cray T3E), the MPI barrier is slower than these material barrier

Prototype C : int MPI_Barrier (MPI_Comm communicator); Fortran : MPI_Barrier(Communicator, IERROR)

Page 32: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3225 October 2006

Matrix trace (1) Computing the trace of a matrix An

The matrix trace is the sum of the diagonal element (square matrix)

One can easily see that the sum can be made on multiple processor, ending by using a reduction to compute the complete trace

n

k

kkAATrace1

,)(

Page 33: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3325 October 2006

Matrix trace (2.1)#include <stdio.h>

#include <mpi.h>

void main(int argc, char ** argv) {

int me, np, root=0;

int N; /* Suppose N = m*np */

double A[N][N];

double buffer[N], diag[N];

double traceA, trace_loc;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &me);

MPI_Comm_size(MPI_COMM_WORLD, &np);

tranche = N/np;

/* Initialization of A made by 0 */

/* … */

/* buffering diagonal elements on the root process */

if (me == 0) {

for (i=root; i<N; i++)

buffer[i] = A[i][i];

}

/* Scatter operation allows to distribute the buffered elements among the processes */

MPI_Scatter(

buffer, tranche, MPI_DOUBLE,

diag, tranche, MPI_DOUBLE, MPI_COMM_WORLD);

Page 34: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3425 October 2006

Matrix trace (2.2)/* Each process computes the partial

trace */

trace_loc = 0;

for (i = 0; i < tranche; i++)

trace_loc += diag[i];

/* Then we do the reduction */

MPI_Reduce(&trace_loc, &traceA, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD);

if (me == root)

printf("La trace de A est : %f \n", traceA);

MPI_Finalize();

}

Page 35: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3525 October 2006

Contents Introduction to MPI

Message passing Different type of communication MPI functionalities

MPI structures Basic functions Data types Contexts and tags Groups and communication domains

Communication functions Point to point communications Asynchronous communications Global communications

MPI-2 One-sided communications I/O Dynamicity

Page 36: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3625 October 2006

One-sided communications (1/2) No synchronization during communications Allow simulated shared memory implementation (Remote

Memory Access)

Defining the part of memory other processes can access: MPI_Win_create() MPI_Win_free()

One-sided communication functions: MPI_Put() MPI_Get() MPI_Accumulate()

Operations: MPI_SUM, MPI_LAND, MPI_REPLACE

Page 37: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3725 October 2006

One-sided communications (2/2) Active synchronization function

MPI_Win_fence()Take a win window of memory as parameterCollective operation (barrier) on all processes of the group MPI_Win_group(win)

Act as a synchronization barrier which ends every RMA transfer using the window win

Passive synchronization function MPI_Win_lock() and MPI_Win_unlock()

Classical mutex functionsThe communications initiator is the only responsible for the

synchronizationWhen MPI_Win_unlock() returns, every transfer operation is

finished

Page 38: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3825 October 2006

Parallel Input/Output Need for intelligent management of I/O is mandatory for

parallel applications MPI-IO is a set of functions for optimised I/O Extending classical file access functions

Collective synchronization for accessing file File offset shared or individual Blocking or non blocking read View (for accessing non sequential memory zone) Similar syntax as MPI communication functions

Page 39: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 3925 October 2006

Dynamic allocation of processes

Dynamic change of the number of processes Spawning new processes during execution

The MPI_Comm_spawn() function allow to create a new set of processes on other processors

An inter-communicator links the domain of the parent to the new domain gathering the new processes

The MPI_Intercom_merge() function allows the merge of a unique communicator from an inter-communicator

MPI-2 allows dynamic MPMD style using the function MPI_Comm_spawn_multiple()

MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to know the maximum possible number of MPI processes

Process destruction No explicit exit() function of MPI process For exiting a MPI process, its communicator MPI_COMM_WORLD must

contain only finalizing processes All inter-communicator must be closed before finalization

Page 40: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4025 October 2006

Remarks and conclusion MPI has become, thanks to the distributed computing

community, a standard library for message passing

The MPI-2 breaks the classic message passing SPMD model of MPI-1

Numbers of implementation exist, on most of architectures

Lots of Documentations and publications are available

Page 41: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4125 October 2006

Some pointers MPI standard official site

http://www-unix.mcs.anl.gov/mpi/

The MPI forum http://www.mpi-forum.org/

Book: MPI, The Complete Reference (Marc Snir et al.) http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

Page 42: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona25 October 2006

MPICH-V Project

http://www.mpich-v.net

Pierre Lemarinier [email protected] de Recherche en Informatique University Paris SouthINRIA Futurs

MMessageessagePPassingassingIInterfacenterface

StandardStandardDescriptionPerformancePerformance

Page 43: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4325 October 2006

Contents MPI implementation

Performance metrics

High performance networks

Communication type / 0-copy

Page 44: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4425 October 2006

MPI implementation LAM-MPI

Optimised for collective operations

MPICH Easy writing of new low level driver

Open-MPI Try to combine performance and ease of the two prior ones Conform to MPI-2

IBM / NEC / FUJITSU… Complete and performant implementation of MPI-2 Target specific architecture

Page 45: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4525 October 2006

Performance metrics Comparison criteria

Latency bandwidth Collective operation Overlapping capabilities Real applications

Measuring tools Round Trip Time (ping-pong)

NetPipe

NAS benchmarksCGLUBTFT

Page 46: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4625 October 2006

High performance networks (1/3) Technologies

Myrinet Connexionless reliable api Registered buffers Fully programmable DMA NIC processor Up to full-duplex 2Gb/s bandwidth with Myrinet 2000

SCINet Torus topology based network with static routing No need to register buffers Very small latency (suitable for RMA) Up to 2Gb/s

Gigabit Ethernet No need to registered buffers DMA operations High latency Up to 1Gb/s and 10Gb/s bandwidth

Infiniband Reliable Connexion mode and Unreliable Datagram mode Registered buffers Queued DMA operations Up to 10Gb/s bandwidth

Page 47: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4725 October 2006

High performance networks (2/3) Technologies

Myrinet Socket-GM MPICH-GM

SCINet No functional socket API SCI-MPICH

Gigabit Ethernet Have to use socket interface

Infiniband IoIP LAM-MPI, MPICH, MPI/pro etc…

Page 48: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4825 October 2006

High performance networks (3/3) Technologies

Page 49: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 4925 October 2006

Eager vs Rendez-vous (1/2) Eager protocol

Message is sent without controlBetter latency

Copied in a buffer if the receiver has not posted the reception yetMemory consuming for long messages

Used only for long messages (<64KB)

Rendez-vous protocol Sender and receiver are synchronized

High latency

0-copyBetter bandwidthReduce the memory consumption

Page 50: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 5025 October 2006

Eager vs Rendez-vous (2/2)

Page 51: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 5125 October 2006

Communication types

Page 52: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 5225 October 2006

High performance networks and 0-copy

Latence Myrinet : 8µsLatence MPICH-GM : 33µsLatence MPICH-Vdummy : 94µs

Page 53: QosCosGrid - Barcelona25 October 2006 MPICH-V Project  Pierre Lemarinier pierre.lemarinier@lri.fr Laboratoire de Recherche en Informatique

QosCosGrid - Barcelona 5325 October 2006

Conclusion Many MPI implementation with similar performance Multiple measures criteria and multiple tools

Latency, bandwidth Benchmarks and microbenchmarks Real applications

High performance networks lead to consider small performance details Network bandwidth equals the memory bandwidth Latency smaller than some OS operations Performance relies on good programming

Performance results can vary a lot according to the type of communication employed

Asynchronism is mandatory Bad programming results in bad performance 0-copy can be mandatory