1 lecture 4: part 2: mpi point-to-point communication

36
1 Lecture 4: Lecture 4: Part 2: MPI Point-to-Point Part 2: MPI Point-to-Point Communication Communication

Upload: bethany-hodge

Post on 21-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

1

Lecture 4:Lecture 4:Part 2: MPI Point-to-Point Part 2: MPI Point-to-Point

CommunicationCommunication

Page 2: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

2

Realizing Message PassingRealizing Message Passing

Separate network from processor Separate user memory from system memory

node 0

user

systemPE NI

node 1

user

systemPENI

Network

Page 3: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

3

Communication Modes for Communication Modes for “Send”“Send”

Blocking/Non-Blocking : Timing regarding the use of user message

buffer Ready:

Timing regarding the invocation of send and receive

Buffered : User/System Buffer Allocation

Page 4: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

4

Communication Modes for Communication Modes for “Send”“Send”

Synchronous/Asynchronous: Timing regarding the invocation of send and

receive + the execution of receive operation local/non-local:

completion independ/depend on the execution of another user process

Page 5: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

5

Messaging SemanticsMessaging Semantics

Sender Receiver

User-space

System-space

Blocking/nonblocking

Synchronous/asynchronous

Ready

Not Ready

Page 6: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

6

Blocking/Non-blocking Blocking/Non-blocking Send Send Blocking send: messaging command does not

return until the message data have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the

matching receive buffer. May be copied into a temporary system buffer, even

no matching receive is invoked. Local (completion does not depend on the execution

of another user process)

Page 7: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

7

Blocking Receive -MPI_recvBlocking Receive -MPI_recv

Return when receive is locally complete Message buffer can be read from after

return

Page 8: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

8

Nonblocking Send - Nonblocking Send - MPI_IsendMPI_Isend Non-blocking, asynchronous Does not block for receive ( Return “immediately”)

Check for completion with MPI_Wait( ) before using buffer

MPI_Wait( ) returns when message has been safely sent, not when it has been received.

Page 9: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

9

Non-blocking Receive Non-blocking Receive MPI_irecvMPI_irecv

Return “immediately” Message buffer should not be read from

after return Must check for local completion

MPI_wait (..): block until the communication is complete

MPI_waitall: block until all communication operations in a given list have completed

Page 10: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

10

Non-blocking Receive - Non-blocking Receive - MPI_IrecvMPI_Irecv

MPI_Irecv(Buf, count,source, tag, comm, REQUEST,..): REQUEST can be used to query the status of the

communication

MPI_WAIT(REQUEST,status): return only if REQUEST is complete

MPI_Waitall(count, array_of_request,..): wait for the completion of all REQUESTs in the array.

Page 11: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

11

Nonblocking Nonblocking CommunicationCommunication Improve Performance by overlapping communication

and computation You need intelligent communication interface

(messaging co-processor used in SP2, Paragon, CS-2, Myrinet, ATM)

startup transfer startup transfer

startup startup

Add computation

Page 12: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

12

Ready Send -- MPI_Rsend( )Ready Send -- MPI_Rsend( )

Receive must be posted before message arrives. Otherwise, the operation is erroneous and its outcome is undefined.

Non-local (completion depends on the starting time of the receiving process)

Overheads for synchronization.

Page 13: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

13

Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )

Explicitly buffers messages on sending side User allocates buffer by himself/herself

(MPI_BUFFER_ATTACH( ))

Programmer likes to control the usage of buffer -- writing new communication libraries.

Page 14: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

14

Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )

user

systemPE NI

user allocated buffer

Page 15: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

15

Synchronous Send --Synchronous Send --MPI_Ssend( )MPI_Ssend( )

Does not return until message is actually received

Send buffer can be reused if send operation completed

Non-local (receiver must have received the message)

Page 16: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

16

Standard Send -- MPI_Send( Standard Send -- MPI_Send( )) Standard Send: depends on the

implementation (usually, synchronous, blocking, and non-local)

Safe to reuse buffer when MPI_Send( ) returns

May block until message is received (depends on implementation)

Page 17: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

17

Standard Send -- MPI_Send( Standard Send -- MPI_Send( ))

A good implementationshort message: send immediately,

buffer if no receive posted. Should try to reduce latency. Buffering is unimportant

large message: use Rendezvous protocol (request-reply-send; wait for matching receive then send)

Page 18: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

18

How to Exchange DataHow to Exchange Data

Simple (code in node 0)sid = MPI_Isend(buf1, node1)

rid = MPI_Irecv(buf2, node1)

..... computation ......

call MPI_Wait(sid)

call MPI_Wait(rid)

For maximum performanceids(1) = MPI_Isend(buf1, node1)

ids(2) = MPI_Irecv(buf2, node1)

..... computation ......

call MPI_Waitall(2, ids)

Page 19: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

19

Model and Measure p2p Model and Measure p2p communication in MPIcommunication in MPI

data transfer time = latency + message size/bandwidth

latency (T0) is startup time, independent of message time (but depends on the communication mode/protocol)

bandwidth (B) is number of bytes transferred per second (memory access rate + network transmission rate)

Page 20: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

20

Latency and BandwidthLatency and Bandwidth

for short message: latency dominates transfer time

for long message: the bandwidth term dominates transfer time

Critical message size (n 1/2) = latency x bandwidth (let latency = message size/bandwidth)

Page 21: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

21

Measure p2p performanceMeasure p2p performance

Round-trip time (ping-pong) time/2

sendrecv

recvsend

Page 22: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

22

Some MPI Performance Some MPI Performance ResultsResults

Machine T0 (microsec) B (MB/s)

T3D 54 120

SP2 61 33

Paragon 75 36

PowerChallenge 15 61

Page 23: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

23

ProtocolsProtocols

Rendezvous Eager Mixed Pull (get)

Page 24: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

24

RendezvousRendezvous

Algorithm:Sender sends request-to-sendReceiver acknowledgesSender sends data

No buffering required High latency (three-steps) High bandwidth (no extra buffer copy)

Page 25: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

25

EagerEager

Algorithm:Sender sends data immediatelyUsually must be buffered

May be directly transferred if receive already posted

Features:Low latencyLow bandwidth (buffer copy)

Page 26: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

26

MixedMixed

Algorithm:Eager for short messagesRendezvous for long messagesSwitch protocols near n 1/2

Page 27: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

27

MixedMixed

Features:Low latency for latency-dominated (short)

messagesHigh bandwidth for bandwidth-dominated (long)

messagesReasonable memory management Non-ideal performance for some messages near

n 1/2

Page 28: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

28

Pull (Get) ProtocolPull (Get) Protocol

One-side communication Used in shared memory machines

Page 29: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

29

MPICH p2p on SGI MPICH p2p on SGI

Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)

(each interval is 128 bytes)

0.00

2.00

4.00

6.00

8.00

10.00

0 256 512 768 1024 1280 1536 1792 2048 2304

Packet Size (bytes)

Wall C

lock

tim

e (

us

)

Minimum Average

Default : 0-1024 byte: Short, 1024-128K: Eager, > 128KB: RendezvousMPID_PKT_MAX_DATA_SIZE = 256

Short (fill data in the header)

Page 30: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

30

Let MPID_PKT_MAX_DATA_SIZE = 256Let MPID_PKT_MAX_DATA_SIZE = 256

Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)

(MPID_PKT_MAX_DATA_SIZE to 256 & the long_len < 1024)(each interval is 128 bytes)

0.002.004.006.00

8.0010.0012.0014.00

0 256 512 768 1024 1280 1536 1792 2048 2304

Packet Size (bytes)

Wall C

lock

tim

e (

us

)

Minimum Average

Short

Eager

Rendezvous

Page 31: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

31

MPI-FM (HPVM: Fast Messages) MPI-FM (HPVM: Fast Messages) PerformancePerformance

0 50 100 150 200 250

One-way latency (µs)

WorseBetter

0 50 100 150 200 250 300

Bandwidth (MB/s)

Worse Better

HPVM

Pwr. Chal.

SP-2

T3E

Origin 2K

Beowulf

Note: Supercomputer measurements taken by NAS, JPL, and HLRS (Germany)

Page 32: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

32

MPI Collective OperationsMPI Collective Operations

Page 33: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

MPI_Alltoall(v)

MPI_AlltoallIt is an extension of MPI_Allgather to the case

where each process sends distinct data to each of the receivers. The j-th block of data sent from process i is received by process j and is placed in the i-th block of receive buffer of process j.

Page 34: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

MPI_Alltoall(v)

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1166115511441133112211111100 1177

2266225522442233222222112200 2277

6666665566446633662266116600 6677

7766775577447733772277117700 7777

5566555555445533552255115500 5577

3366335533443333332233113300 3377

4466445544444433442244114400 4477

0066005500440033002200110000 0077

alltoall

datadata

proc

ess

Define ij be the i-th block of data of process j.

Page 35: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

MPI_Alltoall(v)

Current Implementation:Process j sends ij directly to process i

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1100

2200

6600

7700

5500

3300

4400

0000

Send buffer Receive buffer0

1

4

3

2

5

6

7

Page 36: 1 Lecture 4: Part 2: MPI Point-to-Point Communication

MPI_Alltoall(v)

Current Implementation:Process j sends ij directly to process i

6611551144113311221111110011 7711

6622552244223322222211220022 7722

6666556644663366226611660066 7766

6677557744773377227711770077 7777

6655555544553355225511550055 7755

6633553344333333223311330033 7733

6666555544443344224411440044 7744

6600550044003300220011000000 7700

1166115511441133112211111100 1177

2266225522442233222222112200 2277

6666665566446633662266116600 6677

7766775577447733772277117700 7777

5566555555445533552255115500 5577

3366335533443333332233113300 3377

4466445544444433442244114400 4477

0066005500440033002200110000 00770

1

4

3

2

5

6

7