introduction to the message passing interface (mpi)
TRANSCRIPT
![Page 1: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/1.jpg)
IntroductiontotheMessagePassingInterface(MPI)
JanFostier
May3rd 2017
![Page 2: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/2.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 3: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/3.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 4: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/4.jpg)
Moore’sLaw
“Transistorcountdoubleseverytwoyears”
IllustrationfromWikipedia
![Page 5: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/5.jpg)
Moore’sLaw
2000 2011
IllustrationfromWikipedia
![Page 6: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/6.jpg)
Evolutionoftop500supercomputersovertime
Imagetakenfrom
www.to
p500
.org
=17.6PFlops/s(#1ranked)
=76TFlops/s(#500ranked)
Exponential increaseofsupercomputerpeakperformanceovertime
![Page 7: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/7.jpg)
Applicationarea– performanceshare
Imagetakenfrom
www.to
p500
.org
“research”
“unknown”
![Page 8: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/8.jpg)
Operatingsystemfamily
“Linux”
“Unix”
“Mixed”
![Page 9: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/9.jpg)
Motivationforparallelcomputing
• Wanttorunthesameprogramfaster§ Dependsontheapplicationwhatisconsideredanacceptable
runtimeo SETI@Home,Folding@Home,GIMPS:yearsmaybeacceptableo ForR&D applications:daysorevenweeksareacceptable
– CFD,CEM,Bioinformatics,Cheminformatics,andmanymoreo Predictionoftomorrow’sweathershouldtakelessthanadayof
computationtime.o Someapplicationsrequirereal-timebehavior
– Computergames,algorithmictrading
• Wanttorunbiggerdatasets• Wanttoreducefinancialcost and/orpowerconsumption
![Page 10: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/10.jpg)
Distributed-memoryarchitecture
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine
CPU
Memory
Machine
CPU
Memory
Machine
CPU
Memory
…
Machine
CPU
Memory
NI NI NI NI
![Page 11: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/11.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 12: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/12.jpg)
MessagePassingInterface(MPI)
• MPI=libraryspecification,notanimplementation• Mostimportantimplementations
§ OpenMPI (http://www.open-mpi.org,MPI-3standard)§ IntelMPI(proprietary,MPI-3standard)
• Specifiesroutinesfor(amongothers)§ Point-to-point communication(between2processes)§ Collective communication(>2processes)§ Topology setup§ ParallelI/O
• BindingsforC/C++andFortran
![Page 13: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/13.jpg)
MPIreferenceworks
• MPIstandards:http://www.mpi-forum.org/docs/
• MPI:TheCompleteReference (M.Snir,S.Otto,SHuss-Lederman,D.Walker,J.Dongarra)Availablefromhttp://switzernet.com/people/emin-gabrielyan/060708-thesis-ref/papers/Snir96.pdf
• UsingMPI:PortableParallelProgrammingwiththeMessagePassingInterface,2nd ed.(W.Gropp,E.Lusk,A.Skjellum).
![Page 14: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/14.jpg)
MPIstandard
• Startedin1992(WorkshoponStandardsforMessage-PassinginaDistributedMemoryEnvironment)withsupportfromvendors,librarywritersandacademia.
• MPIversion1.0(May1994)• Finalpre-draftin1993(Supercomputing‘93conference)• FinalversionJune1994
• MPIversion2.0 (July1997)• Supportforone-sidedcommunication• Supportforprocessmanagement• SupportforparallelI/O
• MPIversion3.0(September2012)• Supportfornon-blockingcollectivecommunication• Fortran2008bindings• Newone-sidedcommunicationroutines
![Page 15: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/15.jpg)
HelloworldexampleinMPI#include <mpi.h>#include <iostream>#include <cstdlib>
using namespace std;
int main(int argc, char* argv[]) {int rank, size;MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );
cout << “Hello World from process” << rank << “/” << size << endl;
MPI_Finalize();return EXIT_SUCCESS;
}
john@doe ~]$ mpirun –np 4 ./helloWorldHello World from process 2/4Hello World from process 3/4Hello World from process 0/4Hello World from process 1/4
Outputorder israndom
![Page 16: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/16.jpg)
BasicMPIroutines
• int MPI_Init(int *argc, char ***argv)§ Initialization:allprocessesmustcallthispriortoanyotherMPIroutine.§ Stripsof(possible)argumentsprovidedby“mpirun”.
• int MPI_Finalize(void)§ Cleanup:allprocessesmustcallthisroutineattheendoftheprogram.§ Allpendingcommunicationshouldhavefinishedbeforecallingthis.
• int MPI_Comm_size(MPI_Comm comm, int *size);§ Returnsthesizeofthe“Communicator”associatedwith“comm”§ Communicator=userdefinedsubsetofprocesses§ MPI_COMM_WORLD=communicatorthatinvolvesallprocesses
• int MPI_Comm_rank(MPI_Comm comm, int *rank);§ ReturntherankoftheprocessintheCommunicator§ Range:[0...size– 1]
![Page 17: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/17.jpg)
MessagePassingInterfaceMechanisms
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine Machine Machine
…
Machine
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
Pprocesses(orinstances)ofthe“HelloWorld”program
mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram
Stackvariables:rank=0size=P
Stackvariables:rank=1size=P
Stackvariables:rank=2size=P
Stackvariables:rank=P-1size=P
“HelloWorld”
“HelloWorld”
“HelloWorld”
“HelloWorld”
![Page 18: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/18.jpg)
Terminology
• Computerprogram =passivecollectionofinstructions.
• Process =instanceofacomputerprogramthatisbeingexecuted.
• Multitasking =runningmultipleprocessesonaCPU.
• Thread =smalleststreamofinstructionsthatcanbemanagedbyanOSscheduler(=light-weightprocess).
• Distributed-memory system=multi-processorsystemswhereeachprocessorhasdirectaccess(fast)toitsownprivatememoryandreliesoninter-processorcommunicationtoaccessanotherprocessor’smemory(typicallyslower).
![Page 19: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/19.jpg)
Multithreadingversusmultiprocessing
Multi-threadingSingle processSharedmemoryaddressspaceProtectdata againstsimultaneouswritingLimitedtoasinglemachineE.g.Pthreads,CILK,OpenMP,etc.
MessagepassingMultiple processesSeparate memoryaddressspacesExplicitlycommunicate everything
Multiple machinespossibleE.g.MPI,UnifiedParallelC,PVM
SeqI SeqI SeqI SeqI
A B C D
SeqII SeqII SeqII SeqII
SeqI
SeqII
A B C D
Threadspawning
Threadjoining
![Page 20: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/20.jpg)
MessagePassingInterface(MPI)• MPImechanisms (dependsonimplementation)
• CompilinganMPIprogramfromsource• mpicc -O3main.cpp-omain
gcc -O3main.cpp-omain-L<IncludeDir>-l<mpiLibs>
• Alsompic++(ormpicxx),mpif77,mpif90,etc.
• RunningMPIapplications(manually)• mpirun -np <numberofprograminstances><yourprogram>• Listofworkernodesspecifiedinsomeconfig file.
• UsingMPIontheUgentHPCcluster• Loadappropriatemodulefirst,e.g.
• moduleloadintel/2017a
• CompilinganMPIprogramfromsource• mpigcc (usestheGNU“gcc”Ccompiler)• mpiicc (usestheIntel“icc”Ccompiler)• mpigxx (usestheGNU“g++”C++compiler)• mpiicpc (usestheIntel“icpc”C++compiler)
• Submitjobusingajobscript (seefurther)
C
C++
mpicc / mpic++defaults to gcc
![Page 21: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/21.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 22: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/22.jpg)
BasicMPIpoint-to-pointcommunication...int rank, size, count;char b[40];MPI_Status status;
... // init MPI and rank and size variables
if (rank != 0) {char * str = “Hello World”;MPI_Send(str, 12, MPI_CHAR, 0, 123, MPI_COMM_WORLD);
} else {for (int i = 1; i < size; i++) {
MPI_Recv(b, 40, MPI_CHAR, i, MPI_ANY_TAG, MPI_COMM_WORLD, &status);MPI_Get_count(&status, MPI_CHAR, &count);printf(“I received %s from process %d with size %d and tag %d\n”,
b, status.MPI_SOURCE, count, status.MPI_TAG);}
}...
john@doe ~]$ mpirun –np 4 ./ptpcommI received Hello World from process 1 with size 12 and tag 123I received Hello World from process 2 with size 12 and tag 123I received Hello World from process 3 with size 12 and tag 123
branchingonrank
![Page 23: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/23.jpg)
MessagePassingInterfaceMechanisms
Interconnectionnetwork(e.g.GigabitEthernet,Infiniband,Myrinet,…)
Machine Machine Machine
…
Machine
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
HelloWorldprocess
Pprocesses(orinstances)ofthe“HelloWorld”program
mpirun launchesPindependent processes acrossthedifferentmachines• Eachprocessisainstanceofthesameprogram
rank=0Recv 3x
rank=1Send
rank=2Send
rank=P-1Send
“HelloWorld”“HelloWorld”“HelloWorld”
“HelloWorld”“HelloWorld”“HelloWorld”
![Page 24: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/24.jpg)
Blockingsendandreceive
int MPI_Send(void *buf, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)
• buf:pointertothemessagetosend• count:numberofitemstosend• datatype:datatypeofeachitem
– numberofbytessent:count*sizeof(datatype)• dest:rankofdestinationprocess• tag:valuetoidentifythemessage[0...atleast(32767)]• comm:communicatorspecification(e.g.MPI_COMM_WORLD)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm comm,MPI_Status *status)
• buf:pointertothebuffertostorereceiveddata• count:upperbound(!)ofthenumberofitemstoreceive• datatype:datatypeofeachitem• source:rankofsourceprocess(orMPI_ANY_SOURCE)• tag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }
![Page 25: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/25.jpg)
SendingandreceivingTwo-sided communication:
• BoththesenderandreceiverareinvolvedindatatransferAsopposedtoone-sidedcommunication
• PostedsendmustmatchreceiveWhendoMPI_SendandMPI_recv match ?
• 1.Rankofreceiver process• 2.Rankofsending process• 3.Tag
- customvaluetodistinguishmessagesfromsamesender• 4.Communicator
RationaleforCommunicators• Usedtocreatesubsetsofprocesses• Transparentuseoftags
- modulescanbewritteninisolation- communicationwithinmodulethroughownCommunicator- communicationbetweenmodulesthroughsharedCommunicator
![Page 26: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/26.jpg)
MPIDatatypes
MPI_Datatype Cdatatype
MPI_CHAR signedchar
MPI_SHORT signedshortint
MPI_INT signedint
MPI_LONG signedlongin
MPI_UNSIGNED_CHAR unsignedchar
MPI_UNSIGNED_SHORT unsignedshortint
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsignedlongint
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE longdouble
MPI_BYTE noconversion,bitpatterntransferredasis
MPI_PACKED groupedmessages
![Page 27: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/27.jpg)
Queryingforinformation
• MPI_Status
• StoresinformationabouttheMPI_Recv operationtypedef struct MPI_Status {
int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR;
}• Doesnotcontainthesizeofthereceivedmessage
• int MPI_Get_count (MPI_Status *status, MPI_Datatype
datatype, int *count)
• returnsthenumberofdataitemsreceivedinthecountvariable• notdirectlyaccessiblefromstatusvariable
![Page 28: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/28.jpg)
Blockingsendandreceive
timea) sendercomesfirst,
idlingatsender(nobufferingofmessage)
send
data
reqtosend
oktosendrecv
b) sending/receivingataboutthesametime,idlingminimized
data
reqtosendoktosend recvsend
c) receivercomesfirst,idlingatreceiver
data
recv
sendreqtosendoktosend
SlidereproducedfromslidesbyJohnMellor- Crummey
![Page 29: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/29.jpg)
Deadlocks
int a[10], b[10], myRank;MPI_Status s1, s2;MPI_Comm_rank( MPI_COMM_WORLD, &myRank);
if ( myRank == 0 ) {MPI_Send( a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD );MPI_Send( b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD );
}else if ( myRank == 1 ) {
MPI_Recv( b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1 );MPI_Recv( a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2 );
}
Slide credit: John Mellor - Crummey
IfMPI_Sendisblocking(handshakeprotocol),thisprogramwilldeadlock• If'eager'protocolisused,itmayrun• Dependsonmessagesize
![Page 30: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/30.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 31: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/31.jpg)
Networkcostmodeling
• Variouschoicesofinterconnectionnetwork§ GigabitEthernet:cheap,butfartooslowforHPCapplications§ Infiniband /Myrinet:highspeedinterconnect
• Simpleperformancemodel forpoint-to-pointcommunicationTcomm =a +b*n
§ a =latency§ B=1/b =saturation(asymptotic)bandwidth(bytes/s)§ n =numberofbytestotransmit§ EffectivebandwidthBeff:
Beff =n
a+b∗n = na+nB
![Page 32: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/32.jpg)
ThecaseofGengar (UGentcluster)Bandwidthandlatency
core3&L1cache
6MByte ofL2cache
core4&L1cache
core2&L1cache
core1&L1cache
16GByteofRAM
Non-blockingInfiniband(switchedfabric)
6MbyteofL2cache
core3&L1cache
6MbyteofL2cache
core4&L1cache
core2&L1cache
core1&L1cache
6MbyteofL2cache
Bandwidth Latency
2ns
50ns~100Gbit/s
~20Gbit/s ~1μstothe193othermachines
CPU1 CPU2
![Page 33: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/33.jpg)
Measureeffectivebandwidth:ringtest
Idea:sendasinglemessage of size N inacircle
• IncreasethemessageNsizeexponentially• 1byte,2bytes,4bytes,...1024bytes,2048bytes,4096bytes
• Benchmarktheresults(measurewallclocktimeT),…• Bandwidth=N*P/T
0
1 2
3
P-1 ...
![Page 34: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/34.jpg)
Hands-on:ringtest inMPI
void sendRing( char *buffer, int length ) {/* send message in a ring here */
}
int main( int argc, char * argv[] ){
...char *buffer = (char*) calloc ( 1048576, sizeof(char) );int msgLen = 8;for (int i = 0; i < 18; i++, msgLen *= 2) {
double startTime = MPI_Wtime();sendRing( buffer, msgLen );double stopTime = MPI_Wtime();double elapsedSec = stopTime - startTime;if (rank == 0)
printf( "Bandwidth for size %d is : %f\", ... );}...
}
![Page 35: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/35.jpg)
Jobscript example
#!/bin/sh##PBS -o output.file#PBS -e error.file#PBS -l nodes=2:ppn=all#PBS -l walltime=00:02:00#PBS -m n
cd $VSC_SCRATCH/<yourdirectory>module load intel/2017amodule load scriptsmympirun ./<program name> <program arguments>
qsub mpijob.shqstat /qdel /etc remainsthesame
mpijob.shjobscriptexample:
![Page 36: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/36.jpg)
Hands-on:ringtest inMPI(solution)
void sendRing( char *buffer, int msgLen ){
int myRank, numProc;MPI_Comm_rank( MPI_COMM_WORLD, &myRank );MPI_Comm_size( MPI_COMM_WORLD, &numProc );MPI_Status status;
int prevR = (myRank - 1 + numProc) % numProc;int nextR = (myRank + 1 ) % numProc;
if (myRank == 0) { // send first, then receiveMPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,
&status );} else { // receive first, then send
MPI_Recv( buffer, msgLen, MPI_CHAR, prevR, 0, MPI_COMM_WORLD,&status );
MPI_Send( buffer, msgLen, MPI_CHAR, nextR, 0, MPI_COMM_WORLD);}
}
![Page 37: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/37.jpg)
BasicMPIroutines
TimingroutinesinMPI
• double MPI_Wtime( void )• returnsthetimeinsecondsrelativeto“sometime”inthepast• “sometime”inthepastisfixedduringprocess
• double MPI_Wtick( void )• ReturnstheresolutionofMPI_Wtime()inseconds• e.g.10-3 =millisecondresolution
![Page 38: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/38.jpg)
BandwidthonGengar(Ugentcluster)
EffectiveBWincreases forlarger messages
Messagesize(byte)
1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 0
200
400
600
800
1000
1200
GigabitEthernet
Infiniband
EffectiveBa
ndwidthB
eff(M
byte/s)
![Page 39: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/39.jpg)
BenchmarkresultsComparisonofCPUload
Infiniband
GigabitEthernet
![Page 40: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/40.jpg)
ExchangingmessagesinMPIint MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype
sendtype, int dest, int sendtag, void *recvbuf,int recvcount, MPI_Datatype recvtype, int source,int recvtag, MPI_Comm comm, MPI_Status *status )• sendbuf:pointertothemessagetosend• sendcount:numberofelementstotransmit• sendtype:datatypeoftheitemstosend• dest:rankofdestinationprocess• sendtag:identifierforthemessage• recvbuf:pointertothebuffertostorethemessage(disjoint withsendbuf)• recvcount:upperbound(!)tothenumberofelementstoreceive• recvtype:datatypeoftheitemstoreceive• source:rankofthesourceprocess(orMPI_ANY_SOURCE)• recvtag:valuetoidentifythemessage(orMPI_ANY_TAG)• comm:communicatorspecification(e.g.MPI_COMM_WORLD)• status:structurethatcontains{MPI_SOURCE, MPI_TAG, MPI_ERROR }• sendbuf:pointertothebuffertosend
int MPI_Sendrecv_replace( ... )• Bufferisreplacebyreceiveddata
![Page 41: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/41.jpg)
BasicMPIroutinesSendrecv example
const int len = 10000;int a[len], b[len];if ( myRank == 0 ) {
MPI_Send( a, len, MPI_INT, 1, 0, MPI_COMM_WORLD );MPI_Recv( b, len, MPI_INT, 2, 2, MPI_COMM_WORLD, &status);
} else if ( myRank == 1 ) {MPI_Sendrecv( a, len, MPI_INT, 2, 1, b, len, MPI_INT, 0,
0, MPI_COMM_WORLD, &status );} else if ( myRank == 2) {
MPI_Sendrecv( a, len, MPI_INT, 0, 2, b, len, MPI_INT, 1,1, MPI_COMM_WORLD, &status );
}
• CompatibilitybetweenSendrecv and'normal'sendandrecv• Sendrecv canhelptopreventdeadlocks
2
0 1
safe to exchange !
0
12
![Page 42: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/42.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 43: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/43.jpg)
Non-blockingcommunication
Idea:• Dosomethingusefulwhilewaitingforcommunicationstofinish• Trytooverlapcommunicationsandcomputations
How?• Replaceblockingcommunicationbynon-blockingvariants
MPI_Send(...) MPI_Isend(..., MPI_Request *request)MPI_Recv(...) MPI_Irecv(..., status, MPI_Request *request)
• I=intermediate functions• MPI_Isend and MPI_Irecv routinesreturnimmediately• Needpolling routinestoverifyprogress
• request handleisusedtoidentifycommunications• status fieldmovedtopollingroutines(seefurther)
![Page 44: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/44.jpg)
Non-blockingcommunications
Asynchronousprogress• =abilitytoprogresscommunicationswhileperformingcalculations• Dependsonhardware
• GigabitEthernet=verylimited• Infiniband=muchmorepossibilities
• DependsonMPIimplementation• MultithreadedimplementationsofMPI(e.g.OpenMPI)• Daemonforasynchronousprogress(e.g.LAMMPI)
• Dependsonprotocol• Eagerprotocol• Handshakeprotocol
• Stillthesubjectofongoingresearch
![Page 45: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/45.jpg)
Non-blockingsendingandreceiving
timea) networkinterfacesupports
overlappingcomputationsandcommunications
b) networkinterfacehasnosuchsupport
Isend
data
reqtosend
oktosendIrecv
SlidereproducedfromslidesbyJohnMellor- Crummey
Isend
data
reqtosend
oktosendIrecv
![Page 46: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/46.jpg)
Non-blockingcommunicationsPolling/waitingroutines
int MPI_Wait( MPI_Request *request, MPI_Status *status )
request: handle to identify communicationstatus: status information (cfr. 'normal' MPI_Recv)
int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status )
Returns immediately. Sets flag = true if communication has completed
int MPI_Waitany( int count, MPI_Request *array_of_requests,int *index, MPI_Status *status )
Waits for exactly one communication to completeIf more than one communication has completed, it picks a random oneindex returns the index of completed communication
int MPI_Testany( int count, MPI_Request *array_of_requests,int *index, int *flag, MPI_Status *status )
Returns immediately. Sets flag = true if at least one communication completedIf more than one communication has completed, it picks a random oneindex returns the index of completed communicationIf flag = false, index returns MPI_UNDEFINED
![Page 47: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/47.jpg)
Example:client-servercodeif ( rank != 0 ) { // client code
while ( true ) { // generate requests and send to the servergenerate_request( data, &size );MPI_Send( data, size, MPI_CHAR, 0, tag, MPI_COMM_WORLD );
}} else { // server code (rank == 0)
MPI_Request *reqList = new MPI_Request[nProc];for ( int i = 0; i < nProc - 1; i++ )
MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,MPI_COMM_WORLD, &reqList[i] );
while ( true ) { // main consumer loopMPI_Status status;int reqIndex, recvSize;
MPI_Waitany( nProc–1, reqList, &reqIndex, &status );MPI_Get_count ( &status, MPI_CHAR, &recvSize );do_service( buffer[reqIndex].data, recvSize );MPI_Irecv( buffer[reqIndex].data, MAX_LEN, MPI_CHAR,
status.MPI_SOURCE, tag, MPI_COMM_WORLD, &reqList[reqIndex] );
}}
![Page 48: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/48.jpg)
Non-blockingcommunicationsPolling / waiting routines (cont'd)
int MPI_Waitall( int count, MPI_Request *array_of_requests,MPI_Status *array_of_statuses )
Waits for all communications to completeint MPI_Testall ( int count, MPI_Request *array_of_requests,
int *flag, MPI_Status *array_of_statuses )Returns immediately. Sets flag = true if all communications have completed
int MPI_Waitsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )
Waits for at least one communications to completeoutcount contains the number of communications that have completedCompleted requests are set to MPI_REQUEST_NULL
int MPI_Testsome ( int incount, MPI_Request * array_of_requests,int *outcount, int *array_of_indices,MPI_Status *array_of_statuses )
Same as Waitsome, but returns immediately. flag field no longer needed, returns outcount = 0 if no completed communications
![Page 49: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/49.jpg)
Example:improvedclient-servercodeif ( rank != 0 ) { // same client code
...} else { // server code (rank == 0)
MPI_Request *reqList = new MPI_Request[nProc-1];MPI_Status *status = new MPI_Status[nProc-1];int *reqIndex = new MPI_Request[nProc];
for ( int i = 0; i < nProc - 1; i++ )MPI_Irecv( buffer[i].data, MAX_LEN, MPI_CHAR, i+1, tag,
MPI_COMM_WORLD, &reqList[i] );while ( true ) { // main consumer loop
int numMsg;MPI_Waitsome( nProc–1, reqList, &numMsg, reqIndex, status );for ( int i = 0; i < numMsg; i++ ) {
MPI_Get_count ( &status[i], MPI_CHAR, &recvSize );do_service( buffer[reqIndex[i]].data, recvSize);MPI_Irecv( buffer[reqIndex[i]].data, MAX_SIZE, MPI_CHAR,
status[i].MPI_SOURCE, tag, MPI_COMM_WORLD,&reqList[reqIndex[i]] );
}}
}
![Page 50: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/50.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 51: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/51.jpg)
Barriersynchronization
BarrierSynchronization
time
Waitingtime(processdependent)
CalltoMPI_Barrier
processrank Computations
Idletime
MPI_Barrier(MPI_Comm comm)Thisfunctiondoesnotreturnuntilallprocessesincommhavecalledit.
![Page 52: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/52.jpg)
BroadcastMPI_Bcast(void *buffer, int count, MPI_Datatype datatype,
int root, MPI_Comm comm)MPI_Bcastbroadcastscount elementsoftypedatatype storedinbuffer attheroot processtoallotherprocessesincomm wherethisdatais storedinbuffer.
data
processrank
bufferatnon-root process(contentswillbeoverwritten)
countelements
data
data
data
data
data
bufferatroot process(containsusefuldata)
MPI_Bcast
processrank
bufferatallprocessesnowcontainsamedata
countelements
p1
p0
p2
p4
p3
![Page 53: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/53.jpg)
Broadcastexample...int rank, size;
... // init MPI and rank and size variables
int root = 0;char buffer[12];
if (rank == root) sprintf(buffer, “Hello world”);
MPI_Bcast(buffer, 12, MPI_CHAR, root, MPI_COMM_WORLD);
printf(“Process %d has %s stored in the buffer.\n”, buffer, rank);
...
john@doe ~]$ mpirun –np 4 ./broadcastProcess 1 has Hello World stored in the buffer.Process 0 has Hello World stored in the buffer.Process 3 has Hello World stored in the buffer.Process 2 has Hello World stored in the buffer.
fillthebufferattherootprocessonly
allprocessesmustcallMPI_Bcast
![Page 54: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/54.jpg)
Broadcastalgorithm
data(nbytes)
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.
• Binarytreealgorithm takesonly(a +bn)élog2Pù time.
![Page 55: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/55.jpg)
ScatterMPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendType,
void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)
MPI_Scatter partitionsasendbuf attheroot processintoPequalpartsofsizesendcount andsendseachprocessincomm (includingroot)aportioninrankorder.
d0
processrank
sendbuffer(onlymattersatroot process)
sendcountelements
receivebuffer
MPI_Scatter
processrank
recvcountelements
p1
p0
p2
p4
p3
d1 d2 d3 d4
d2
d1 d0
d3
d4
d1 d2 d3 d4
d0
root=p1
![Page 56: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/56.jpg)
Scatterexampleint root = 0;char recvBuf[7];
if (rank == root) { char sendBuf[25];sprintf(sendBuf, “This is the source data.”);
MPI_Scatter(sendBuf, 6, MPI_CHAR, recvBuf, 6, MPI_CHAR,root, MPI_COMM_WORLD);
} else {MPI_Scatter(NULL, 0, MPI_CHAR, recvBuf, 6, MPI_CHAR,
root, MPI_COMM_WORLD);}
recvBuf[6] = ’\0’;printf(“Process %d has %s in receive buffer\n”, rank, recvBuf);
...john@doe ~]$ mpirun –np 4 ./scatterProcess 1 has s the stored in the buffer.Process 0 has This i stored in the buffer.Process 3 has data. stored in the buffer.Process 2 has source stored in the buffer.
fillthesendbufferattherootprocessonly
firstthreeparametersareignoredonnon-rootprocesses
![Page 57: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/57.jpg)
Scatteralgorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromrootprocesstoP-1otherprocessestakes(a +bn)(P- 1)time.
• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)
oneblock=nbytes
![Page 58: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/58.jpg)
Scatter(vectorvariant)MPI_Scatterv(void *sendbuf, int *sendcnts, int *displs,
MPI_Datatype sendType, void *recvbuf, int recvcnt,MPI_Datatype recvType, int root, MPI_Comm comm)
Partitionsofsendbuf don’tneedtobeofequalsizeandarespecifiedperreceivingprocess:thefirstindexbythedispls array,theirsizebythesendcnts array.
processrank
sendbuffer(onlymattersatroot process)
sendcnts[i]elements
receivebuffer
MPI_Scatterv
processrank
recvcountelements
p1
p0
p2
p4
p3
root=p1
displs[i]
![Page 59: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/59.jpg)
GatherMPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendType,
void *recvbuf, int recvcount, MPI_Datatype recvType,int root, MPI_Comm comm)
MPI_Gathergathersequalpartitionsofsizerecvcount fromeachofthePprocessesincomm (includingroot)andstorestheminrecvbuf attheroot processinrankorder.
processrank
sendcountelements
d2
d1
d3
d4
d0
d0
processrank
receivebuffer(onlymattersatroot process)
recvcountelements
sendbuffer
p1
p0
p2
p4
p3
d1 d2 d3 d4
root=p1
d2
d1
d3
d4
d0
MPI_Gather
GatheristheoppositeofScatter
Avectorvariant,MPI_Gatherv,exists,asimilargeneralizationasMPI_Scatterv
![Page 60: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/60.jpg)
Gatherexampleint root = 0;int sendBuf = rank;
if (rank == root) { int *recvBuf = new int[size];
MPI_Gather(&sendBuf, 1, MPI_INT, recvBuf, 1, MPI_INT, root, MPI_COMM_WORLD);
cout << “Receive buffer at root process: ” << endl;for (size_t i = 0; i < size; i++)
cout << recvBuf[i] << “ ”;cout << endl;
delete [] recvBuf;} else {
MPI_Gather(&sendBuf, 1, MPI_INT, NULL, 1, MPI_INT,root, MPI_COMM_WORLD);
}
john@doe ~]$ mpirun –np 4 ./gatherReceive buffer at root process:0 1 2 3
receivebufferexistsattherootprocessonly
receiveparametersareignoredonnon-rootprocesses
![Page 61: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/61.jpg)
Gatheralgorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Linearalgorithm,subsequentsendingofnbytesfromP-1processestoroottakes(a +bn)(P- 1)time.
• Binaryalgorithm takesonlya élog2Pù +bn(P-1)time(reducednumberofcommunicationrounds!)
oneblock=nbytes
![Page 62: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/62.jpg)
AllGatherMPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendType,
void *recvbuf, int recvcnt, MPI_Datatype recvType,MPI_Comm comm)
MPI_Allgather isageneralizationofMPI_Gather,inthatsensethatthedataisgatheredbyallprocessed,insteadofjusttherootprocess.
processrank
sendcountelements
processrank recvcountelementssendbuffer
p1
p0
p2
p4
p3
receivebuffer
MPI_Allgather
receivebuffer
![Page 63: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/63.jpg)
Allgather algorithm
p1
p0
p2
p4
p3
p5
p6
p7
processrank
• Pcallstogather takesP[a élog2Pù +bn(P-1)]time(usingthebestgatheralgorithm)• Gatherfollowedbybroadcasttakes2a élog2Pù +bn(Pélog2Pù +P-1) time.• “Butterfly”algorithm takesonlya log2P +bn(P-1)time(incasePisapoweroftwo)
nbytes 2nbytes4nbytes
oneblock=nbytes
![Page 64: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/64.jpg)
AlltoallcommunicationMPI_Alltoall(void *sendbuf, int sendcnt, MPI_Datatype sendType,
void *recvbuf, int recvcnt, MPI_Datatype recvType,int root, MPI_Comm comm)
UsingMPI_Alltoall,everyprocesssendsadistinctmessagetoeveryotherprocess.
processrank sendcountelements
a0
processrank
p1
p0
p2
p4
p3
MPI_Alltoall
a1 a2 a3 a4
b0 b1 b2 b3 b4
e0 e1 e2 e3 e4
d0 d1 d2 d3 d4
c0 c1 c2 c3 c4
sendbuffer
recvcountelements
a0 b0 c0 d0 e0
a1 b1 c1 d1 e1
a4 b4 c4 d4 e4
a3 b3 c3 d3 e3
a2 b2 c2 d2 e2
receivebuffer
Avectorvariant,MPI_Alltoallv,exists,allowingfordifferentsizesforeachprocess
![Page 65: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/65.jpg)
ReduceMPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype
dataType, MPI_Op op, int root, MPI_Comm comm)Thereduceoperationaggregates(“reduces”)scattereddataattherootprocess
processrank
sendbuffer
countelements
MPI_Reduce
processrank
p1
p0
p2
p4
p3root=p0
d2
d1
d3
d4
d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4
receivebuffer
◊ =operation,likesum,product,maximum,etc.
countelements
![Page 66: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/66.jpg)
Reduceoperations
MPI_MAX maximum
MPI_MIN minimum
MPI_SUM sum
MPI_PROD product
MPI_LAND logical AND
MPI_BAND bitwise AND
MPI_LOR logicalOR
MPI_BOR bitwiseOR
MPI_LXOR logicalexclusiveOR
MPI_BXOR bitwise exclusiveOR
MPI_MAXLOC maximumanditslocation
MPI_MINLOC maximumanditslocation
Availablereduceoperations(associativeandcommutative)Userdefinedoperationsarealsopossible
![Page 67: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/67.jpg)
AllreduceoperationMPI_Allreduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Similartothereduceoperation,buttheresultisavailableoneveryprocess.
processrank
sendbuffer
countelements
MPI_Allreduce
processrank
p1
p0
p2
p4
p3
d2
d1
d3
d4
d0 d0◊ d1 ◊ d2 ◊ d3 ◊ d4
receivebuffer
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
![Page 68: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/68.jpg)
ScanoperationMPI_Scan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype dataType, MPI_Op op, MPI_Comm comm)Ascanperformsapartialreductionofdata,everyprocesshasadistinctresult
processrank
sendbuffer
countelements
MPI_Scan
processrank
p1
p0
p2
p4
p3
d2
d1
d3
d4
d0 d0
receivebuffer
d0◊ d1 ◊ d2
d0◊ d1
d0◊ d1 ◊ d2 ◊ d3
d0◊ d1 ◊ d2 ◊ d3 ◊ d4
![Page 69: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/69.jpg)
Hands-onMatrix-vectormultiplication
Master :CoordinatestheworkofothersSlave :doesabitofwork
Task:computeA.bA:doubleprecision(mxn)matrixb:doubleprecision(nx1)columnmatrix
Master algorithm Slave algorithm1. Broadcast b to each slave 1. Broadcast b (in fact receive b)2. Send 1 row of A to each slave 2. do {3. while ( not all m results received ) { Receive message m
Receive result from any slave s if( m != termination )if ( not all m rows sent ) compute result
Send new row to slave s send result to master else } while( m != termination )
Send termination message to s 3. slave terminates}
4. continue
![Page 70: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/70.jpg)
Matrix-vectormultiplicationint main( int argc, char** argv ) {
int rows = 100, cols = 100; // dimensions of adouble **a;double *b, *c;int master = 0; // rank of masterint myid; // rank of this processint numprocs; // number of processes// allocate memory for a, b and ca = (double**)malloc(rows * sizeof(double*));for( int i = 0; i < rows; i++ )
a[i]=(double*)malloc(cols * sizeof(double));b = (double*)malloc(cols * sizeof(double));c = (double*)malloc(rows * sizeof(double));MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myid );MPI_Comm_size( MPI_COMM_WORLD, &numprocs );if( myid == master )
// execute master codeelse
// execute slave codeMPI_Finalize();
}
![Page 71: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/71.jpg)
Matrixvectormultiplication// initialize a and bfor(int j=0;j<cols;j++) {b[j]=1.0; for(int i=0;i<rows;i++) a[i][j]=i;}// broadcast b to each slaveMPI_Bcast( b, cols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD );// send row of a to each slave, tag = row numberint numsent = 0;for( int i = 0; (i < numprocs-1) && (i < rows); i++ ) {
MPI_Send(a[i], cols, MPI_DOUBLE_PRECISION, i+1,i,MPI_COMM_WORLD);numsent++;
}for( int i = 0; i < rows; i++ ) {
MPI_Status status; double ans; int sender;MPI_Recv( &ans, 1, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status );c[status.MPI_TAG] = ans;sender = status.MPI_SOURCE;if ( numsent < rows ) { // send more work if any
MPI_Send( a[numsent], cols, MPI_DOUBLE_PRECISION,sender, numsent, MPI_COMM_WORLD );
numsent++;} else // send termination message
MPI_Send( MPI_BOTTOM, 0, MPI_DOUBLE_PRECISION, sender,rows, MPI_COMM_WORLD );
}
![Page 72: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/72.jpg)
Matrix-vectormultiplication// broadcast b to each slave (receive here)MPI_Bcast( b,cols,MPI_DOUBLE_PRECISION,master,MPI_COMM_WORLD );// send row of a to each slave, tag = row numberif( myid <= rows ) {
double* buffer=(double*)malloc(cols*sizeof(double));while (true) {
MPI_Status status;MPI_Recv( buffer, cols, MPI_DOUBLE_PRECISION, master,
MPI_ANY_TAG, MPI_COMM_WORLD, &status );if( status.MPI_TAG != rows ) { // not a termination message
double ans = 0.0;for(int i=0; i < cols; i++)
ans += buffer[i]*b[i];MPI_Send( &ans, 1, MPI_DOUBLE_PRECISION, master,
status.MPI_TAG, MPI_COMM_WORLD );} else
break;}
} // more processes than rows => no work for some nodes
![Page 73: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/73.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 74: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/74.jpg)
Networkcostmodeling
• Simpleperformancemodel formultiple communications– Bisectionbandwidth:sumofthebandwidthsofthe(averagenumber
of)linkstocuttopartitionthenetworkintwohalves.– Diameter:maximumnumberofhopstoconnectanytwodevices
Network(firsthalf)
Network(secondhalf)
bisection
![Page 75: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/75.jpg)
Bustopology
Bus =communicationchannelthatisshared byallconnecteddevices• Nomorethantwodevices cancommunicateatanygiventime• Hardwarecontrolswhichdeviceshaveaccess• Highriskofcontention whenmultipledevicestrytoaccessthebus
simultaneously.Busisa“blocking”interconnect.
bisection
bus
Busnetworkproperties• Bisectionbandwidth=point-to-pointbandwidth(independentof#devices)• Diameter=1(singlehop)
![Page 76: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/76.jpg)
Crossbarswitch
i4i3i2i1
o4
o3
o1
inputdevices
outputdevices
switch
Crossbarswitchcanconnectanycombination ofinput/outputpairsatfullbandwidth=fullynon-blocking
![Page 77: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/77.jpg)
Staticroutingincrossbarswitch
i4i3i2i1
o4
o3
o2
o1
inputdevices
outputdevices stateA
Switchesatcrossingofinput/outputshouldbeinstateB.OtherswitchesshouldbeinstateA.Pathselectionisfixed=staticrouting
Input Output
i1 o2i2 o4i3 o1i4 o3
stateB
Input– outputpairings
![Page 78: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/78.jpg)
Crossbarswitchfordistributedmemorysystems
• CrossbarimplementationofaswitchwithP=4machines.• Fullynon-blocking• Expensive:P2 switchesandO(P2)wires
M4
M3
M1
M4
M3
M1
4portnon-blocking
switch
Usuallyimplementedinaswitch
![Page 79: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/79.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
![Page 80: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/80.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
![Page 81: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/81.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
ninputs noutputs
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
![Page 82: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/82.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:cannotfindaconnectionforE!
AB
AB
CD
DC
E
E
![Page 83: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/83.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:rerouteanexistingpath
AB
AB
CD
DC
E
E
![Page 84: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/84.jpg)
Closswitchingnetworks
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
4x4CB4x4CB
inputstage middlestage outputstage
• Builtinthreestages,usingcrossbarswitchesascomponents.
• Everyoutputofaninputstageswitchisconnectedadifferentmiddlestageswitch.
• Everyinput canbeconnectedtoeveryoutput,usinganyoneofthemiddlestageswitches.
• Allinput/outputpairscancommunicatesimultaneously(non-blocking)
• RequiresfarlesswiringthanconventionalCBswitches.
• Adaptiveroutingmaybenecessary:Ecannowbeconnected
AB
AB
CD
DC
E
E
![Page 85: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/85.jpg)
Switchexamples
non-blockingswitchesInfinibandswitch(24ports) GigabitEthernetswitch(24ports)
![Page 86: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/86.jpg)
Meshnetworks
bisection
![Page 87: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/87.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 88: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/88.jpg)
Basicperformanceterminology
• Runtime (“Howlongdoesittaketorunmyprogram”)§ Inpractice,onlywallclocktime matters§ DependsonthenumberofparallelprocessesP§ TP =runtimeusingPprocesses
• Speedup (“Howmuchfasterdoesmyprogramruninparallel”)§ SP =T1 /TP§ Intheidealcase,SP =P§ Superlinearspeedup areusuallyduetocacheeffects
• Parallelefficiency (“Howwellistheparallelinfrastructureused”)§ hP =SP /P (0£ hP £ 1)§ Intheidealcase,hP =1=100%§ Dependsontheapplicationwhatisconsideredacceptableefficiency
![Page 89: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/89.jpg)
Strongscaling:Amdahl’slaw• StrongScaling=increasingthenumberofparallelprocessesfora
fixed-size problem• Simplemodel:partitionsequentialruntimeT1 inaparallelizable
fraction(1-s) andinherentlysequentialfraction(s).§ T1 =sT1 +(1-s)T1 (0£ s £ 1)
§ Therefore, TP =sT1 +𝟏"𝐬 𝐓𝟏𝐏
sequential parallelizableT1=
sequentialT2= parallel
sequentialT3= parallel
sequentialT4= parallel
…sequentialT¥ =
sT1 (1-s)T1
![Page 90: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/90.jpg)
Strongscaling:Amdahl’slaw
• Consequently,SP =&'&(
=&'
)&'*+,- .+/
= ')*+,-/
(Amdahl’slaw)
• Speedupisbounded S¥ =')
§ e.g.s=1%,S¥ = 100§ e.g.s=5%,S¥ = 20
• SourcesofsequentialfractionsT1§ Processstartup overhead§ Inherentlysequential portionsofthecode§ Dependenciesbetweensubtasks§ Communication overhead§ Functioncalling overhead§ Loadmisbalance
![Page 91: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/91.jpg)
Amdahl’slaw
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Parallelspe
edup
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
ForsmallP,closetolinearspeedup
linearspeedup
![Page 92: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/92.jpg)
Amdahl’slaw
0
20
40
60
80
100
120
140
1 4 16 64 256 1024 4096 16384 65536
Parallelspe
edup
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
S¥ =')
Samegraphasonpreviousslide,butlogarithmicx-as
![Page 93: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/93.jpg)
Amdahl’slaw
0
0.2
0.4
0.6
0.8
1
1 4 16 64 256 1024 4096 16384 65536
Parallelefficiency
NumberofparallelprocessesP
s=5%
s=2%
s=1%
s=0%
Samegraphasonpreviousslide,butexpressedasparallelefficiency
![Page 94: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/94.jpg)
Weakscaling:Gustafson’slaw
• Weakscaling:increasingboththenumberofparallelprocessesPandtheproblemsizeN.
• Simplemodel:assumethatthesequentialpartisconstant,andthattheparallelizablepartisproportionaltoN.§ T1(N)=T1(1)[s+(1-s)N] (0£ s £ 1)
sequential parallelizableT1(1)=
sT1(1) (1-s)T1(1)
sequential parallelizableT1(2)=
sequential parallelizableT1(3)=
sequential parallelizableT1(4)=
parallelizable
parallelizable
parallelizable
parallelizable
parallelizableparallelizable
![Page 95: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/95.jpg)
Gustafson’slaw
Therefore,TP(N)=T1(1)[s+'") 1(
]andhence TP(P)=T1(1)SpeedupSP(P)=s+(1-s)P=P- s(P-1)(Gustafson’slaw)
• Solveincreasinglylargerproblemsusingaproportionallyhighernumberofprocesses(N=P).
sequential parallelizableT1(1)=
sT1(1) (1-s)T1(1)
sequential parallelizableT2(2)=
sequential parallelizableT3(3)=
sequential parallelizableT4(4)=
![Page 96: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/96.jpg)
Gustafson’slaw
0
50
100
150
200
250
0 50 100 150 200 250
Parallelspe
edup
Number ofparallelprocesses P,problem size N
s=5%
s=2%
s=1%
s=0%
![Page 97: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/97.jpg)
Gustafson’slaw
0.94
0.95
0.96
0.97
0.98
0.99
1
1 4 16 64 256 1024 4096 16384 65536
Parallelefficiency
Number ofparallelprocesses P,problem size N
s=5%
s=2%
s=1%
s=0%
Samegraphasonpreviousslide,butexpressedasparallelefficiencyandlogscale
h¥ =1-s
![Page 98: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/98.jpg)
Outline
• Distributed-memoryarchitecture:generalconsiderations• Programmingmodel:MessagePassingInterface(MPI)
§ Point-to-pointcommunicationo Blockingcommunicationo Pointtopointnetworkperformanceo Non-blockingcommunication
§ Collectivecommunicationo Collectivecommunicationalgorithmso Globalnetworkperformance
• Parallelprogramperformanceevaluation§ Amdahl’slaw§ Gustafson’slaw
• Parallelprogramdevelopment:casestudies
![Page 99: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/99.jpg)
CaseStudy1:Parallelmatrix-matrixproduct
![Page 100: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/100.jpg)
Casestudy:parallelmatrixmultiplication
• Matrix-matrixmultiplication:C=a*A*B+b*C(BLASxgemm)§ AssumeC=mxn;A=mxk;B=kxnmatrix.§ a, b =scalars;assumea = b =1 inwhatfollows,i.e.C=C+A*B
• Initially,matrixelementsaredistributed amongPprocesses§ AssumesameschemeforeachmatrixA,BandC
• EachprocesscomputesvaluesforCthatarelocaltothatprocess§ RequireddatafromAandBthatisnotlocalneedstobe
communicated§ Performancemodelingassumingthea, b, gmodel
o a =latencyo b =perelementtransfertimeo g =timeforsinglefloatingpointcomputation
Casestud
yreprod
uced
from
J.Dem
mel
![Page 101: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/101.jpg)
Casestudy:parallelmatrix-matrixproduct
• Twodimensionalpartitioning§ Partitionmatricesin2Dinanrxcmesh(P=r*c)§ X(I,J)referstoblock(I,J)ofmatrixX(X={A,B,C})§ Processpn isalsodenotedbypi,j (n=i*c+j)andholdsX(I,J)
p0,0 p0,1 p0,2 … p0,c-1
p1,0 …
pi,j
pr-1,0 pr-1,c-1
Casestud
yreprod
uced
from
J.Dem
mel
k
n
B(I,J)e.g.matrixB
rprocesses
cprocesses
![Page 102: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/102.jpg)
Casestudy:parallelmatrix-matrixproduct
• Secondapproach:twodimensionalpartitioning(SUMMA)§ Processpi,j needstocomputeC(I,J):
C I, J = C I, J +8A I, i ∗ B(i, J)?"'
@AB
∀I = 0… r∀J = 0… c
C(I,J) A(I,J)
B(I,J)
+=*
C A
B
datalocalinpi,jdatathatneedstobecommunicatedtopi,j
datanotneededbypi,j
dothisinparallel
Casestud
yreprod
uced
from
J.Dem
mel
kelements
indexi referstoasinglecolumnofAorsinglerowofB
A(I,i)B(i,J)
![Page 103: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/103.jpg)
Casestudy:parallelmatrix-matrixproduct
• Secondapproach:twodimensionalpartitioning§ SUMMAalgorithm(allIandJinparallel)
§ Costforinnerloop(executedktimes):o log2c(a +b(m/r))+log2r(a +b(n/c))+2mng/P
§ TotalTP =2kmng/P+ka(log2c+log2r)+kb((m/r)log2c)+(n/c)log2r)
§ Forn=m=kandr=c=sqrt(P)wefind:ParallelefficiencyhP = 1/(1+(a/g)(Plog2P)/(2n2)+(b/g)√PlogP/n)Isoefficiency whenngrowsas√P(constantmemorypernode!)
Casestud
yreprod
uced
from
J.Dem
mel
for i = 0 to k–1broadcast A(I, i) within process rowbroadcast B(i, J) within process columnC(I,J) += A(I, i) * B(i, J)
endfor
![Page 104: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/104.jpg)
Casestudy:parallelmatrix-matrixproduct
• Evenmoreefficient:use“blocking”algorithm
• SUMMAalgorithmisimplementedinPBLAS=ParallelBLAS• Algorithmcanbeextendedtoblock-cycliclayout(seefurther)
for i = 0 to k–1 step bend = min(i+b-1, k-1)broadcast A(I, i:end) within process rowbroadcast B(i:end, J) within process columnC(I,J) += A(I, i:end) * B(i:end, J)
endfor
PerformthisproductusingLevel-3BLAS
![Page 105: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/105.jpg)
Casestudy:parallelGaussianElimination
Imagetakefrom
J.Dem
mel
![Page 106: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/106.jpg)
CaseStudy2:ParallelSorting
![Page 107: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/107.jpg)
Casestudy:parallelsortingalgorithm
• Sequential sortingofnkeys(1st Bachelor)§ Bubblesort:O(n2)§ Mergesort:O(nlogn),eveninworst-case§ Quicksort:O(nlogn) expected,O(n2)worst-case,fastinpractice
• Parallel sortingofnkeys,usingPprocesses§ Initially,eachprocessholdsn/pkeys(unsorted)§ Eventually,eachprocessholdsn/pkeys(sorted)
o Keysperprocessaresortedo Ifq<r,eachkeyassignedtoprocessqislessthanorequalto
everykeyassignedtoprocessr(sortkeysinrankorder)
![Page 108: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/108.jpg)
Casestudy:parallelsortingalgorithm
void Bubble_sort(int *a, int n) {for (int listLen = n; listLen >= 2; listLen--)
for (int i = 0; i < listLen-1; i++)if (a[i] > a[i+1]) {
temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}
}
Bubblesort algorithm:O(n2)
Example: 52481® 2451 8® 241 58® 21 458® 1 2458
“Compare-swap”operation
AlgorithmreproducedfromP.Pacheco
afteriteration1afteriteration2afteriteration3afteriteration4
![Page 109: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/109.jpg)
Casestudy:parallelsortingalgorithm
• Bubblesort§ Resultofcurrentstep(a[i]>a[i+1])dependsonpreviousstep
o Valueofa[i]isdeterminedbypreviousstepo Algorithmis“inherentlyserial”o Notmuchpointintryingtoparallelizethisalgorithm
• Odd-eventranspositionsort§ Decouplealgorithmintwophases:evenandodd
o Evenphase:compare-swaponfollowingelements:(a[0],a[1]),(a[2],a[3]),(a[4],a[5]),…
o Oddphase:compare-swapoperationsonfollowingelements:(a[1],a[2]),(a[3],a[4]),(a[5],a[6]),…
![Page 110: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/110.jpg)
Casestudy:parallelsortingalgorithm
void Even_odd_sort(int *a, int n) {for (int phase = 0; phase < n; phase++)
if (phase % 2 == 0) { // even phasefor (int i = 0; i < n-1; i += 2)
if (a[i] > a[i+1]) {temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}} else { // odd phase
for (int i = 1; i < n-1; i += 2)if (a[i] > a[i+1]) {
temp = a[i];a[i] = a[i+1]a[i+1] = temp;
}}
}
Even-oddtranspositionsortalgorithm:O(n2)
“Compare-swap”operation
“Compare-swap”operation
AlgorithmreproducedfromP.Pacheco
![Page 111: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/111.jpg)
Casestudy:parallelsortingalgorithm
Example: 52481® 25481® 24518® 24158® 21458® 12458
evenphaseoddphaseevenphaseoddphaseevenphase
Even-oddtranspositionsortalgorithm:O(n2)
Parallelism withineachevenoroddphaseisnowobvious:Compare-swapbetween(a[i],a[i+1])independentfrom(a[i+2],a[i+3])
![Page 112: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/112.jpg)
Casestudy:parallelsortingalgorithm
Parallelalgorithm• First,assumeP==n(oneelementperprocess)
a[i-1] a[i] a[i+1]
ImagereproducedfromP.Pacheco
a[i-1] a[i] a[i+1]
phasej
phasej+1
Communicatevaluewithneighbor• Rightprocess(highestrank)keepslargestvalue• Leftprocess(lowestrank)keepssmallestvalue
a[i-2]
a[i-2]
executeinparallelduringphasej+1
executeinparallelduringphasej
![Page 113: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/113.jpg)
Casestudy:parallelsortingalgorithm
Parallelalgorithm• Now,assumen/P>>1 (asistypicallythecase)• Example:P=4;n=16
AlgorithmreproducedfromP.Pacheco
initialvalues 15,11,9,16 3,14,8,7 4,6,12,10 5,2,13,1
localsorting 9,11,15,16 3,7,8,14 4,6,10,12 1,2,5,13
phase0(even) 3,7,8,9 11,14,15,16 1,2,4,5 6,10,12,13
phase1(odd) 3,7,8,9 1,2,4,5 11,14,15,16 6,10,12,13
phase2(even) 1,2,3,4 5,7,8,9 6,10,11,12 13,14,15,16
phase3(odd) 1,2,3,4 5,6,7,8 9,10,11,12 13,14,15,16
process0 process1 process2 process3
![Page 114: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/114.jpg)
Casestudy:parallelsortingalgorithm
sort local keysfor (int phase = 0; phase < P; phase++) {
neighbor = computeNeighbor(phase, myRank);if (I’m not idle) { // first and/or last process may be idle
send all my keys to neighborreceive all keys from neighborif (myRank < neighbor)
keep smaller keyselse
keep larger keys}
}
Paralleleven-oddtranspositionsortpseudocode
AlgorithmreproducedfromP.Pacheco
Theorem: Parallelodd-eventranspositionsortalgorithmwillsorttheinputlistafterP(=numberofprocesses)phases.
![Page 115: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/115.jpg)
Casestudy:parallelsortingalgorithm
int computeNeighbor(int phase, int myRank) {int neighbor;if (phase % 2 == 0) {
if (myRank % 2 == 0)neighbor = myRank + 1;
elseneighbor = myRank - 1;
} else {if (myRank % 2 == 0)
neighbor = myRank – 1;else
neighbor = myRank + 1;}if (neighbor == -1 || neighbor == P-1)
neighbor = MPI_PROC_NULL;return neigbor;
}
ImplementationofcomputeNeighbor (MPI)
AlgorithmreproducedfromP.Pacheco
WhenusedasdestinationorsourcerankinMPI_SendorMPI_Recv,nocommunicationtakesplace
![Page 116: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/116.jpg)
Casestudy:parallelsortingalgorithm
if (myRank % 2 == 0) {MPI_Send(...)MPI_Recv(...)
} else {MPI_Recv(...)MPI_Send(...)
}
ImplementationofdataexchangeinMPI• Becarefulofdeadlocks• Inbothevenandoddphases,communicationalwaystakesplace
betweenaprocesswitheven,andaprocesswithoddrank
AlgorithmreproducedfromP.Pacheco
Exchangeordertopreventdeadlocks!
MPI_Sendrecv(...)
OR
![Page 117: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/117.jpg)
Casestudy:parallelsortingalgorithm
Parallelodd-eventranspositionsortalgorithmanalysis• Initialsorting:O(n/Plog(n/P))time
• Useanefficientsequentialsortingalgorithm,e.g.quicksortormergesort
• Perphase:2(a +n/Pb)+gn/P• TotalruntimeTP(n)=O(n/Plog(n/P)+2(aP +nb)+gn
=1/PO(nlogn)+O(n)• LinearspeedupwhenPissmallandnislarge• However,badasymptoticbehaviour
• WhennandPincreaseproportionally,runtimeperprocessisO(n)
• Whatwereallywant:O(logn)• Difficult!(butpossible!)
AlgorithmreproducedfromP.Pacheco
![Page 118: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/118.jpg)
Casestudy:parallelsortingalgorithm
• Sortingnetworks (=graphicaldepictionofsortingalgorithms)§ Numberofhorizontal“wires”(=elementstosort)§ Connectedbyvertical“comparators”(=compareandswap)
elem
entsto
sort(n
)
time
Example(4elementstosort)
x
y
min(x,y)
max(x,y)
![Page 119: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/119.jpg)
Casestudy:parallelsortingalgorithm
Bubblesort algorithm(sequential) Samebubblesortalgorithm(parallel)
Sequentialruntime=numberofcomparators
=“size ofthesortingnetwork”=n*(n– 1)/2
Parallelruntime(assumeP==nprocesses)
=“depth ofsortingnetwork”=2n– 3
![Page 120: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/120.jpg)
Casestudy:parallelsortingalgorithmDefinition:depthofasortingnetwork (=parallelruntime)
• Zeroattheinputsoreachwire• Foracomparatorwithinputswithdepthd1 andd2,thedepthof
itsoutputsis1+max(d1,d2)• Depthofthesortingnetwork=maximumdepthofeachoutput
0
0
0
0
0
0
1
1
2 3 4 5 6 7 8 9
3 5 7 9
2 4 6 8
3 5 7
4 6
5
3 5 7
4 6
5
![Page 121: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/121.jpg)
Casestudy:parallelsortingalgorithm
• Sortingnetworkofodd-eventranspositionsort
• Parallelruntime=“depth ofsortingnetwork”=n• …orP(weassumen==P)
![Page 122: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/122.jpg)
Casestudy:parallelsortingalgorithm
• Canwedobetter?• SequentialsortingalgorithmsareO(nlogn)• Ideally,Pandncanscaleproportionally:P=O(n)• ThatmeansthatwewanttosortnnumbersinO(logn)time
• Thisispossible(!),however,bigconstantpre-factor• WewilldescribeanalgorithmthatcansortnnumbersinO(log2 n)
paralleltime(usingP=O(n)processes)• Thisalgorithmhasbestperformanceinpractice• …unlessnbecomeshuge(n>22000)• Nobodywantstosortthatmanynumbers
![Page 123: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/123.jpg)
Casestudy:parallelsortingalgorithm
• Theorem:Ifasortingnetworkwithninputssortsall2n binarystringsoflengthncorrectly,thenitsortsallsequencescorrectly(proof:seereferences).
1
1
1
0
0
0
0
1
1
1
0
0
WewilldesignanalgorithmthatcansortbinarysequencesinO(log22n) time
Example
![Page 124: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/124.jpg)
Casestudy:parallelsortingalgorithm
• Step1: createasortingnetworkthatsortsbitonic sequences• Definition:Abitonic sequenceisasequencewhichisfirst
increasingandthendecreasing,orcanbecircularlyshiftedtobecomeso.§ (1,2,3,3.14,5,4,3,2,1)isbitonic§ (4,5,4,3,2,1,1,2,3)isbitonic§ (1,2,1,2)isnotbitonic
• Overzerosandones,abitonic sequenceisoftheform§ 0i1j0k or1i0j1k(withe.g.0i=0000…0=i consecutivezeros)§ i,jorkcanbezero
![Page 125: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/125.jpg)
Casestudy:parallelsortingalgorithm
• Now,let’screateasortingnetworkthatsortsabitonic sequence• Ahalf-cleaner networkconnectslinei withlinei +n/2
• Iftheinputisabinarybitonic sequencethenfortheoutput§ Elementsinthetophalfaresmaller thanthecorresponding
elementsinthebottomhalf,i.e.halvesarerelativelysorted.§ Oneofthehalvesoftheoutputconsistsofonlyzerosorones(i.e.
is“clean”),theotherhalfisbitonic.
Half-cleanerexample(n=8)
![Page 126: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/126.jpg)
Casestudy:parallelsortingalgorithm
• Exampleofahalf-cleanernetwork
0
1
0
1
1
0
0
0
0
0
01
0
0
1
1
lowerhalfisbitonic
upperhalfis“clean”
input=bitonicsequence
upperhalfissmallerthanlowerhalf
![Page 127: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/127.jpg)
Casestudy:parallelsortingalgorithm
• Therefore,abitonic sorter[n] (i.e.networkthatsortsabitonicsequenceoflengthn)isobtainedas
Half-cleaner[n]
Bitonic-sorter[n/2]
Bitonic-sorter[n/2]
Abitonic sortersortsabitonic sequenceoflengthn=2k using• size=nk/2=n/2log2ncomparators(=sequentialtime)• depth=k=log2n(=paralleltime)
![Page 128: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/128.jpg)
Casestudy:parallelsortingalgorithm• Step2:Buildanetworkmerger[n] thatmergestwosorted
sequencesoflengthn/2sothattheoutputissorted§ Flipsecondsequenceandconcatenatefirstandflippedsecond§ Concatenatedsequenceisbitonic,sortusingstep1
0
1
1
0
1
0
1
1
0
0
11
1
0
1
1
tobitonic-sorter[n/2]
sortedsequence1
sortedsequence2
tobitonic-sorter[n/2]
![Page 129: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/129.jpg)
Casestudy:parallelsortingalgorithm
• Step3:Buildasorter[n] networkthatsortsarbitrarysequences§ Dothisrecursivelyfrommerger[n]buildingblocks§ Depth:D(1)=0andD(n)=D(n/2)+log2n=O(log22n)
merger[n](fromStep2)
sorter[n/2]
sorter[n/2]
![Page 130: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/130.jpg)
Casestudy:parallelsortingalgorithm
Example:sorter[16]
merger[16]
merger[8]
merger[4]merger[2]
flip-cleaner[16]
half-cleaner[8]
![Page 131: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/131.jpg)
Casestudy:parallelsortingalgorithm
• Furtherreading ofbitonic networks:§ http://valis.cs.uiuc.edu/~sariel/teach/2004/b/
webpage/lec/14_sortnet_notes.pdf• Incasenisnotapoweroftwo:
§ http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/oddn.htm
![Page 132: Introduction to the Message Passing Interface (MPI)](https://reader031.vdocuments.net/reader031/viewer/2022013015/61d062406bec87705b2fb4af/html5/thumbnails/132.jpg)