1 lecture 4: part 2: mpi point-to-point communication

1

Lecture 4:Lecture 4:Part 2: MPI Point-to-Point Part 2: MPI Point-to-Point

CommunicationCommunication

2

Realizing Message PassingRealizing Message Passing

Separate network from processor Separate user memory from system memory

node 0

user

systemPE NI

node 1

user

systemPENI

Network

3

Communication Modes for Communication Modes for “Send”“Send”

Blocking/Non-Blocking : Timing regarding the use of user message

buffer Ready:

Timing regarding the invocation of send and receive

Buffered : User/System Buffer Allocation

4

Communication Modes for Communication Modes for “Send”“Send”

Synchronous/Asynchronous: Timing regarding the invocation of send and

receive + the execution of receive operation local/non-local:

completion independ/depend on the execution of another user process

5

Messaging SemanticsMessaging Semantics

Sender Receiver

User-space

System-space

Blocking/nonblocking

Synchronous/asynchronous

Ready

Not Ready

6

Blocking/Non-blocking Blocking/Non-blocking Send Send Blocking send: messaging command does not

return until the message data have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the

matching receive buffer. May be copied into a temporary system buffer, even

no matching receive is invoked. Local (completion does not depend on the execution

of another user process)

7

Blocking Receive -MPI_recvBlocking Receive -MPI_recv

Return when receive is locally complete Message buffer can be read from after

return

8

Nonblocking Send - Nonblocking Send - MPI_IsendMPI_Isend Non-blocking, asynchronous Does not block for receive ( Return “immediately”)

Check for completion with MPI_Wait( ) before using buffer

MPI_Wait( ) returns when message has been safely sent, not when it has been received.

9

Non-blocking Receive Non-blocking Receive MPI_irecvMPI_irecv

Return “immediately” Message buffer should not be read from

after return Must check for local completion

MPI_wait (..): block until the communication is complete

MPI_waitall: block until all communication operations in a given list have completed

10

Non-blocking Receive - Non-blocking Receive - MPI_IrecvMPI_Irecv

MPI_Irecv(Buf, count,source, tag, comm, REQUEST,..): REQUEST can be used to query the status of the

communication

MPI_WAIT(REQUEST,status): return only if REQUEST is complete

MPI_Waitall(count, array_of_request,..): wait for the completion of all REQUESTs in the array.

11

Nonblocking Nonblocking CommunicationCommunication Improve Performance by overlapping communication

and computation You need intelligent communication interface

(messaging co-processor used in SP2, Paragon, CS-2, Myrinet, ATM)

startup transfer startup transfer

startup startup

Add computation

12

Ready Send -- MPI_Rsend( )Ready Send -- MPI_Rsend( )

Receive must be posted before message arrives. Otherwise, the operation is erroneous and its outcome is undefined.

Non-local (completion depends on the starting time of the receiving process)

Overheads for synchronization.

13

Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )

Explicitly buffers messages on sending side User allocates buffer by himself/herself

(MPI_BUFFER_ATTACH( ))

Programmer likes to control the usage of buffer -- writing new communication libraries.

14

Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )

user

systemPE NI

user allocated buffer

15

Synchronous Send --Synchronous Send --MPI_Ssend( )MPI_Ssend( )

Does not return until message is actually received

Send buffer can be reused if send operation completed

Non-local (receiver must have received the message)

16

Standard Send -- MPI_Send( Standard Send -- MPI_Send( )) Standard Send: depends on the

implementation (usually, synchronous, blocking, and non-local)

Safe to reuse buffer when MPI_Send( ) returns

May block until message is received (depends on implementation)

17

Standard Send -- MPI_Send( Standard Send -- MPI_Send( ))

A good implementationshort message: send immediately,

buffer if no receive posted. Should try to reduce latency. Buffering is unimportant

large message: use Rendezvous protocol (request-reply-send; wait for matching receive then send)

18

How to Exchange DataHow to Exchange Data

Simple (code in node 0)sid = MPI_Isend(buf1, node1)

rid = MPI_Irecv(buf2, node1)

..... computation ......

call MPI_Wait(sid)

call MPI_Wait(rid)

For maximum performanceids(1) = MPI_Isend(buf1, node1)

ids(2) = MPI_Irecv(buf2, node1)

..... computation ......

call MPI_Waitall(2, ids)

19

Model and Measure p2p Model and Measure p2p communication in MPIcommunication in MPI

data transfer time = latency + message size/bandwidth

latency (T0) is startup time, independent of message time (but depends on the communication mode/protocol)

bandwidth (B) is number of bytes transferred per second (memory access rate + network transmission rate)

20

Latency and BandwidthLatency and Bandwidth

for short message: latency dominates transfer time

for long message: the bandwidth term dominates transfer time

Critical message size (n 1/2) = latency x bandwidth (let latency = message size/bandwidth)

21

Measure p2p performanceMeasure p2p performance

Round-trip time (ping-pong) time/2

sendrecv

recvsend

22

Some MPI Performance Some MPI Performance ResultsResults

Machine T0 (microsec) B (MB/s)

T3D 54 120

SP2 61 33

Paragon 75 36

PowerChallenge 15 61

23

ProtocolsProtocols

Rendezvous Eager Mixed Pull (get)

24

RendezvousRendezvous

Algorithm:Sender sends request-to-sendReceiver acknowledgesSender sends data

No buffering required High latency (three-steps) High bandwidth (no extra buffer copy)

25

EagerEager

Algorithm:Sender sends data immediatelyUsually must be buffered

May be directly transferred if receive already posted

Features:Low latencyLow bandwidth (buffer copy)

26

MixedMixed

Algorithm:Eager for short messagesRendezvous for long messagesSwitch protocols near n 1/2

27

MixedMixed

Features:Low latency for latency-dominated (short)

messagesHigh bandwidth for bandwidth-dominated (long)

messagesReasonable memory management Non-ideal performance for some messages near

n 1/2

28

Pull (Get) ProtocolPull (Get) Protocol

One-side communication Used in shared memory machines

29

MPICH p2p on SGI MPICH p2p on SGI

Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)

(each interval is 128 bytes)

0.00

2.00

4.00

6.00

8.00

10.00

0 256 512 768 1024 1280 1536 1792 2048 2304

Packet Size (bytes)

Wall C

lock

tim

e (

us

)

Minimum Average

Default : 0-1024 byte: Short, 1024-128K: Eager, > 128KB: RendezvousMPID_PKT_MAX_DATA_SIZE = 256

Short (fill data in the header)

30

Let MPID_PKT_MAX_DATA_SIZE = 256Let MPID_PKT_MAX_DATA_SIZE = 256

Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)

(MPID_PKT_MAX_DATA_SIZE to 256 & the long_len < 1024)(each interval is 128 bytes)

0.002.004.006.00

8.0010.0012.0014.00

0 256 512 768 1024 1280 1536 1792 2048 2304

Packet Size (bytes)

Wall C

lock

tim

e (

us

)

Minimum Average

Short

Eager

Rendezvous

31

MPI-FM (HPVM: Fast Messages) MPI-FM (HPVM: Fast Messages) PerformancePerformance

0 50 100 150 200 250

One-way latency (µs)

WorseBetter

0 50 100 150 200 250 300

Bandwidth (MB/s)

Worse Better

HPVM

Pwr. Chal.

SP-2

T3E

Origin 2K

Beowulf

Note: Supercomputer measurements taken by NAS, JPL, and HLRS (Germany)

32

MPI Collective OperationsMPI Collective Operations

MPI_Alltoall(v)

MPI_AlltoallIt is an extension of MPI_Allgather to the case

where each process sends distinct data to each of the receivers. The j-th block of data sent from process i is received by process j and is placed in the i-th block of receive buffer of process j.

MPI_Alltoall(v)

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1166115511441133112211111100 1177

2266225522442233222222112200 2277

6666665566446633662266116600 6677

7766775577447733772277117700 7777

5566555555445533552255115500 5577

3366335533443333332233113300 3377

4466445544444433442244114400 4477

0066005500440033002200110000 0077

alltoall

datadata

proc

ess

Define ij be the i-th block of data of process j.

MPI_Alltoall(v)

Current Implementation:Process j sends ij directly to process i

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1100

2200

6600

7700

5500

3300

4400

0000

Send buffer Receive buffer0

1

4

3

2

5

6

7

MPI_Alltoall(v)

Current Implementation:Process j sends ij directly to process i

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1166115511441133112211111100 1177

2266225522442233222222112200 2277

6666665566446633662266116600 6677

7766775577447733772277117700 7777

5566555555445533552255115500 5577

3366335533443333332233113300 3377

4466445544444433442244114400 4477

0066005500440033002200110000 00770

1

4

3

2

5

6

7

1 lecture 4: part 2: mpi point-to-point communication

Documents

buffered send

himselfherself mpi

usage of buffer

completemessage buffer

receivedsend buffer

user process blocking

temporary system buffer

message data