introduction to mpiteaching.csse.uwa.edu.au/units/cits3402/lectures/2pointtopoint.pdfremainder of...
TRANSCRIPT
Introduction to MPI Continued
Remainder of the Course
1. Why bother with HPC
2. What is MPI
3. Point to point communication (today)
4. User-defined datatypes / Writing parallel code / How to use a super-computer (today)
5. Collective communication
6. Communicators
7. Process topologies
8. File/IO and Parallel profiling
9. Hadoop/Spark
10. More Hadoop / Spark
11. Alternatives to MPI
12. The largest computations in history / general interest / exam prep
Last Time
• Introduced distributed parallelism
• Introduced MPI
• General setup for MPI-based programming
Very brief, I hope you have enjoyed the not-break-break
Today
• Deriving MPI
• Blocking / Non-blocking send and receive
• Datatypes in MPI
• How to use a super-computer
Get the basics right and everything else should fall into place
Deriving MPI
Advantages of Message Passing
• Why would we want to use a message passing model?• And sometimes you don’t (Map/Reduce, ZeroMQ etc.)
• Universality
• Expressivity
• Ease of Debugging
• Performance
Advantages of Message Passing
• Why would we want to use a message passing model?• And sometimes you don’t (Map/Reduce, ZeroMQ etc.)
• Universality• No special hardware requirements• Works with special hardware
• Expressivity• Message passing is a complete model for parallel algorithms• Useful for automatic load balancing
• Ease of Debugging • Still difficult• Thinking w.r.t. messages fairly intuitive
• Performance• Associates data with a processor• Allows compiler and cache manager to work well together
Introduction to MPI (for real)
• Simple goals• Portability
• Efficiency
• Functionality
• So we want a message passing model• Each process has separate address space
• A message is one process copying some of its address space to another• Send
• Receive
Minimal MPI
• What does the sender send?• Data starting address + length(bytes)
• Destination destination address (an int will do)
• What does the receiver receive?• Data starting addres + length(bytes)
• Source source address (filled when received)
Minimal MPI
• So we can send and receive messages
• Might be enough for some applications but there’s something missing
Message selection
• Currently all processes receive all messages
• If we add a tag field processes will be able to ignore messages not intended for them
Minimal MPI
• Our model now becomes
• Send(address, length, destination, tag)
• Receive (address, length, source, tag, actual length)
• We can make the source and tag arguments wildcards to go back to our original model
• This is a complete model for Message-Passing HPC
• Most exotic MPI functions are built by combining these two
Minimal MPI – Problems
• There are still some issues that MPI solves
1. Describing message buffers
2. Separating families of messages
3. Naming processes
4. Communicators
MPI – Describing Buffers
• (address, length) is not sufficient for two main reasons
• Assumes data is contiguous• Often not the case• E.g. sending the row of a matrix stored column-wise
• Assumes data representation is always known• Does not handle heterogenous clusters• E.g. CPU + GPU machines for example
• MPI’s solution• MPI_datatypes→ Abstract one layout up → Allow users to specify their own• (address, length, datatype)
MPI – Separating families of messages
• Consider using a library written with MPI• They will have their own naming of tags and such
• Your code may interact with this
• MPI’s solution• Contexts → Think if this as super-tags
• Provides one more layer of separation between codes running in one application
MPI – Naming Processes
• Processes belong to groups
• A rank is associates with each group
• Using an int is actually sufficient in this case
MPI – Communicators
• Combines contexts and groups into a single structure
• Destination and source ranks are specified relative to a communicator
MPI_Send(start, count, datatype, dest, tag, comm)
• Message buffer described by• Start
• Count
• Data types
• Target process given by• Dest
• Comm
• Tag can be used to create different ‘types’ of messages
MPI_Recv(start, count, datatype, source, tag, comm, status)
• Waits until a matching (source, tag) message is available
• Reads into the buffer• Start
• Count
• Datatype
• Target process specified by• Source
• Comm
• Status contains more information
• Receiving fewer than count occurrences of datatype is okay, more is an error
MPI – Other Interesting Features
• Collective communication• Get all of your friends involved – light up the group chat
• Two flavours• Data movement – E.g. broadcast
• Collective computation – Min, max, average, logical OR etc.
MPI – Other Interesting Features
• Virtual topologies• Allow graphs and grid connections to be imposed on processes• ‘Send to my neighbours’
• Debugging and profiling help• MPI requires ‘hooks’ to be available for debugging implementation
• Communication modes• Blocking vs. Non-blocking
• Support for Libraries• Communicators allow libraries to exist in their own space
• Support for heterogenous networks• MPI_Send/Recv implementation independent
MPI – Other Interesting Features
• Processes vs. Processors• A process is a software concept
• A processor is a rock we tricked into thinking
• Some implementations limit one process per processors
Is MPI Large?
• There are many functions available and many strange idiosyncrasies
• The core is rather tight however
• A full MPI-specification (fundamentally) requires• Init
• Comm_rank
• Comm_size
• Send
• Recv
• Finalize
Point-to-Point Communication
What’s the point?
• The fundamental mechanism in MPI is the transmission of data between a pair of processes• One sender
• One receiver
• Almost all other MPI constructs are short-hand versions of tasks you could achieve with point-to-point methods
• We will linger on this topic a little longer than may first seem necessary• Many idiosyncrasies in MPI come from how point-to-point communication is
achieved in-code
What’s the point?
• Remember• Rank → ID of each process in a communicator
• Communicator → Collection of processes
• MPI_COMM_WORLD → The communicator for all processes
Example: Knock-Knock
What we want to do:
• Find our rank
• If process 0• Send “Knock, knock”
• If process 1• Receive a string from process 0
• Otherwise• Do nothing
Example: Knock-Knock
char msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){
strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
}
This code sends a single string from process 0 to process 1Some things we’ll discuss in detail:• MPI_Send• MPI_Recv• Tags
MPI_Send
1. Msg (in)• The buffer (location in memory) to send from
2. Strlen(msg)+1 (in)• The number of items to send
• + 1 to include the null-byte ‘\0’ → Only relevant when sending strings
3. MPI_CHAR (in)• The MPI datatype (more on this later) indicates the size of each element in
your buffer
if(myrank == 0){strcpy(msg, "Knock knock");status = MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
MPI_Send
4. 1 (in)• The rank of the destination process
5. Tag (in)• The ‘topic’ of the message (will only be received if process 1 Recv’s on tag 99)
6. MPI_COMM_WORLD (in)• The communicator on which we are sending through
• Each communicator (with two processes) has a rank 0 process and a rank 1 process
if(myrank == 0){strcpy(msg, "Knock knock");status = MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
MPI_Recv1. Msg (out)
• The buffer to receive from
2. 20 (in)• The maximum number of elements we want
3. MPI_CHAR (in)• The size of each element
} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
}
MPI_Recv4. 0 (in)
• The process we want to receive from
5. Tag (in)• The ‘topic’ we want to receive on (more on this later)
6. MPI_COMM_WORLD (in)• The communicator we are communicating on
7. &status (inout)• The error code, in this case passed to the function itself (since it returns how
many elements was received)
} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
}
Tags
• A rather good idea at the time
• Rarely used in practice
• Allow processes to provide ‘topics’ for communication• E.g. ‘42’ refers to all communication for a particular sub-task etc.
• MPI_ANY_TAG renders specifying tags useless
Generally, we write our own code
and we assume we know what we’re doing.
MPI Datatypes
• Back to the MPI_CHAR
• Generally, data in a message (sent or received) is described as a triple• Address
• Count
• Datatype
MPI Datatypes
• Back to the MPI_CHAR• Generally, data in a message (sent or received) is described as a triple
• Address• Count• Datatype
• An MPI datatype can be• Predefined, corresponding to a language primitive (MPI_INT, MPI_CHAR)• A contiguous array of MPI datatypes• A strided block of datatypes• An indexed array of blocks of datatypes• An arbitrary structure of datatypes
• There are MPI functions to construct custom datatypes (int, float) tuples for example
MPI Datatypes
• Using MPI datatypes specifies messages as data-points not bytes• Machine independent
• Implementation independent
• Portable between machines
• Portable between languages
We have some more information
Some really, really important points
• Communication requires cooperation; you need to know:• Who you are sending/receiving from/to
• What you are sending/receiving
• When you want to send/receive
• Very specific, requires careful reasoning about algorithms
• All nodes (in general) will run the same executable
Some really, really important points
• Communication requires cooperation; you need to know:• Who you are sending/receiving from/to
• What you are sending/receiving
• When you want to send/receive
• Very specific, requires careful reasoning about algorithms
• All nodes (in general) will run the same executable• Very different style of programming
• The ‘root’ (usually rank 0) may have very different tasks to all other nodes
• Rank becomes very important to dividing the bounds of a problem
Example 2: Knock-knock, who’s there
What we want to do
• Find our rank
• If process 0 • Send ‘knock knock’• Receive from process 1
• If process 1• Send ‘who’s there’ • Receive from process 2
• Else• Do nothing
Example 2: Knock-knock, who’s therechar msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){
strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);
}
Example 2: Knock-knock, who’s therechar msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){
strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);
} else if (myrank == 1){strcpy(msg, "Who's there?");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
}
Example 2: Knock-knock, who’s there
char msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){
strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);
} else if (myrank == 1){strcpy(msg, "Who's there?");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
}
There may be a problem here
Blocking vs. Non-blocking
• Depending on the implementation you use this may cause a deadlock• If you have enough buffer space it might be okay (but don’t rely on this)
• We’ve been using the blocking send/receive functions• Halt execution until completed
• There exist non-blocking versions of send/recv• MPI_ISend – Same arguments
• MPI_IRecv – Same arguments but replace MPI_Status with MPI_Request
• Return immediately and continue with computation
When to use Non-blocking
• Should only be used where performance improves• E.g. sending a large amount of data when a large amount of compute is also
available
• Using non-blocking communication will parallelise a little more
• To check for a communication’s success, need to use• MPI_Wait()
• MPI_Test()
• An alternate interpretation• MPI_Send/Recv is just MPI_Isend/Irecv + MPI_WAIT()
Example 3: Knock-Who’s Knock-There
int main(int argc, char *argv[]) {int myrank, numprocs;int msg[20], msg2[20];MPI_Request request;MPI_Status status;
MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_Comm_rank(MPI_COMM_WORLD, &myid);//… Next slide
}
Example 3: Knock-Who’s Knock-There
if(rank == 0){strcpy(msg, "Knock knock");MPI_Irecv(msg2, 20, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD, &request);MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD);MPI_Wait(&request, &status);
} else if (rank == 1){strcpy(msg, "Who's there?");MPI_Irecv(msg2, 20, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD, &request);MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD);MPI_Wait(&request, &status);
}MPI_Finalize();return 0;
On Multi-threading + MPI
• MPI does not specify the interaction of blocking communication calls and a thread scheduler
• A good implementation would block only the sending thread
• It is the user’s (your) responsibility to ensure other threads do not edit the communicating (possibly shared) buffer
On Message Ordering
• Messages are non-overtaking
• The order a process sends messages is the order another process receives them
• The order multiple processes send messages in does not matter
P0
P1
P2
M02 M0
1
M11
Can be received at P2 as:• M1
1,M01,M0
2
• M01, M1
1, M02
• M01, M0
2, M11
But not:• M1
1,M02,M0
1
• M02, M1
1, M01
• M02, M0
1, M11
On Message Ordering
• Another important note: Ordering is not transitive• Sounds goofy, but easy to make this mistake
• Be careful when using MPI_ANY_SOURCE
P0Send(P1)Send(P2)
P1Recv(P0)Send(P2)
P2Recv(P?)Recv(P?)
On Message Ordering
• One goal of MPI is to encourage deterministic communication patterns
• Using exact addresses, exact buffer sizes, enforced ordering etc.
• Makes code predictable
• Sources of non-determinism• MPI_ANY_SOURCE as source argument
• MPI_CANCEL()
• MPI_WAITANY()
• Threading
Extended Example : Computing Pi (*groans)
• Your favourite example is back again
• This example is ‘perfect’ computationally• Automatic load balancing → All processes do as much work as possible
• Verifiable answer
• Minimal communication
• This time, we use numerical integration
න
0
11
1 + 𝑥𝑑𝑥 = arctan 𝑥 0
1 = arctan 1 − arctan 0 = arctan 1 =𝜋
4
Extended Example – Computing PI
• So we integrate the function 𝑓 𝑥 =4
1+𝑥2
• Our approach is very simple• Divide [0,1] by some value n
• Each forms a rectangle of height 𝑓(𝑛) and width 1
𝑛
• Add up all the rectangles to get a (not very good) approximation of the integration
• This gives us an approximation to 𝜋
Extended Example – Computing PI
• Our parallelism will also be quite simple• One process (the root, rank 0) will obtain n from the user and broadcast this
value to all others
• All other processes will determine how many points they each compute
• All other processes will compute their sub-approximations
• All other processes will send back their approximations
• The root will display the final result
Code Time
Extended Example 2 – Matrix Vector Multiplication• Previous example does not have any ‘actual’ message passing
• We introduce one of the most common structures for a parallel program• Self-scheduling
• Formerly ‘Master-slave’
• I will use ‘Master-node’
• Matrix-Vector multiplication is a good example but there are many scenarios where this approach is applicable• The nodes do not need to communicate with each other
• The amount of work for each node is difficult to predict
Matrix-Vector Multiplication
cbA
Matrix-Vector Multiplication
• All processes receive a copy of the vector b
• Each unit of work is a dot-product of one row of matrix A
• The root sends rows to each of the nodes
• When the result is sent back, another row is sent if available
• After all rows are processed, termination messages are sent
Code Time IIThere’s a git repo with all in-lecture code examples
https://github.com/pritchardn/teachingCITS3402
Studying Parallel Performance
Studying Parallel Performance
• We can do a little better than timing programs
• Goal is to estimate• Computation
• Communication
• Scaling w.r.t. problem size
Studying Parallel Performance
• We can do a little better than timing programs
• Goal is to estimate• Computation
• Communication
• Scaling w.r.t. problem size
• Consider Matrix-vector multiplication• Square, dense matrix 𝑛 × 𝑛
• Each element of c requires 𝑛 multiplications and 𝑛 − 1 additions
• There are 𝑛 elements in c so our FLOP requirements are
𝑛 𝑛 + 𝑛 − 1 = 2𝑛2 − 𝑛
Studying Parallel Performance
• We also consider communication costs
• We assume all processes have the original vector already
• Need to send 𝑛 + 1 values (sending to and back)
• 𝑛 times (for each row)
𝑛 𝑛 + 1 = 𝑛2 + 𝑛
Studying Parallel Performance
• A ratio of communication to computation is𝑛2 + 1
2𝑛2 − 𝑛× (
𝑇𝑐𝑜𝑚𝑚
𝑇𝑐𝑎𝑙𝑐)
• Computation is usually cheaper than communication • We try to minimise this ratio
• Often making the problem larger makes communication overhead insignificant
• Here, this is not the case
• For large 𝑛 the ratio gets closer to 1
2
Studying Parallel Performance
• We could easily adapt our approach for matrix-matrix multiplication
• Instead of a vector b we have another square matrix B
• Each round sees a vector sent back instead of a single value
Studying Parallel Performance
• Computation requirements• The operations for each element of C is 𝑛 multiplications and 𝑛-1 adds• Now 𝑛2 elements to compute
𝑛2 2𝑛 − 1 = 2𝑛3 − 𝑛2
• Communication requirements• 𝑛 (to send each row) + 𝑛 (to send a row back) and there are 𝑛 rows
𝑛 × 2𝑛 = 2𝑛2
• Comm/Calc ratio2𝑛2
2𝑛3 − 𝑛2× (
𝑇𝑐𝑜𝑚𝑚
𝑇𝑐𝑎𝑙𝑐)
• Which scales to 1
𝑛for large 𝑛
User Defined DatatypesWriting Parallel Code
How to use a super-computer
Today
• Introduction to User-Defined Datatypes• Datatype constructors
• Use of derived datatypes
• Addressing
• Packing and Unpacking
• Writing parallel code• Some guidance on getting started
• Reasoning with parallel algorithms
User Defined Datatypes
• We’ve seen MPI communicate a sequence of identical elements, contiguous in memory
• We sometimes want to send other things• Structures containing setup parameters• Non-contiguous array subsections
• We’d like to send this data with a single communication call rather than several smaller ones• Communication speed is network bound
• MPI provides two mechanisms to help
We’ll look at how this works (relatively) briefly, most applications can be dealt with primitive communication
User Defined Datatypes
• Derived datatypes• Specifying your own data layouts
• Useful for sending structs for instance
• Data-packing• An extra routine before/after sending/receiving to compress non-contiguous
data into a dense block
• Often, we can achieve the same data transfer with either method• Obviously, there are pros and cons to either
Derived Datatypes
• All MPI communication functions take a datatype argument
• MPI provides functions to construct ‘types’ of our own (only relevant to MPI of course) to provide to these functions
• They describe layouts of memory of complex-data structures or non-contiguous memory
Derived Datatypes
• Constructed from basic datatypes
• The writers of MPI recursively defined datatypes – allowing us to use them for our own nefarious tasks
• A derived datatype is an opaque object (we can’t edit it after construction) specifying two things• A sequence of primitive datatypes - Type signature
• A sequence of integer (byte) displacements - Type map
Type Signature
• A list of datatypes
• E.g. Typesig = {MPI_CHAR, MPI_INT}
• This describes some datatype which has one or more MPI_CHARs followed by one or more MPI_INTs
Type map
• A list of pairs
• The first elements are your type signature
• The second elements are the displacement in memory (in bytes) from the first location
• E.g. MPI_INT has the type map {(int), 0} – A single integer beginning at the start
• Type maps define the size of the buffer you are sending
This will make more sense once we introduce datatype-constructors
Some utility functions (for reference)
• MPI_TYPE_EXTENT(datatype, extent)• datatype IN datatype (e.g. MPI_CHAR, MPI_MYTHINGY)
• extent OUT the extent of the datatype i.e. the largest possible size the datatype can be (factoring for memory bound requirements)
• MPI_TYPE_SIZE(datatype, size)• datatype IN datatype
• size OUT An array of ints specifying the size (in bytes) of each entry in the type signature
Some utility functions (for reference)
• MPI_Commit(datatype)• ‘compiles’ or flattens the datatype into an internal representation for MPI to
use
• Must be done before user a derived datatype
Importantly, all processes in a parallel program must have the same datatypes committed
MPI_TYPE_CONTIGUOUS(count, oldtype, newtype)
• count IN replication count
• oldtype IN the old datatype
• newtype OUT the new datatype
• Constructs a typemap of count copies of oldtype in contiguous memory
• Example, oldtype = {(double, 0), (char, 8)}
• MPI_TYPE_CONTIGUOUS(3, oldtype, newtype) yields the datatype• {(double, 0),(char, 8),(double, 16),(char, 24), (double, 32),(char, 40)}
Constructors – Contiguous
• MPI_TYPE_CONTIGUOUS(3, oldtype, newtype) yields the datatype• {(double, 0),(char, 8),(double, 16),(char, 24), (double, 32),(char, 40)}
oldtype
count 3
newtype
MPI_TYPE_INDEXED(count, array_of_blocklengths, array_of_displacements, oldtype, newtype)
• Allows one to specify a non-contiguous data layout
• The displacements of each block can differ (and can be repeated)
• count IN the number of blocks
• array_of_blocklengths IN # of elements for each block
• array_of_displacements IN displacement for each block (measured in number of elements)
• oldtype IN the old datatype
• newtype OUT the new datatype
MPI_TYPE_INDEXED(count, array_of_blocklengths, array_of_displacements, oldtype, newtype)
• E.g. oldtype = {(double, 0), (char, 8)}• Blocklengths B = (3, 1)
• Displacements D = (4, 0)
• MPI_TYPE_INDEXED(2, B, D, oldtype, newtype) yields
oldtype
count = 3, blocklength = (2,3,1), displacement = (0, 3, 8)
newtype
Example – Upper Triangle of a Matrix
[0][0] [0][1] [0][2] …
[1][1] [1][2] …
[2][2] [2][3] …
[3][3] [3][4] …
Example – Upper Triangle of a Matrix
double values[100][100];double disp[100];double blocklen[100];MPI_Datatype upper;
/* find the start and end of each row */for(int i = 0; i < 100; ++i){
disp[i] = 100 * i + i; // The i-th row starts at the i-thelement
blocklen[i] = 100 - i; // There are i elements in each row}/* Create datatype */MPI_Type_indexed(100, blocklen, disp, MPI_DOUBLE, &upper);MPI_Type_commit(&upper);/* Send it */MPI_Send(values, 1, upper, dest, tag, MPI_COMM_WORLD);
Packing and Unpacking
• Based on how previous libraries achieved this
• In most cases one can avoid packing and unpacking in favour of derived datatypes• More portable
• More descriptive
• Generally simpler
• Sometimes for simple use cases it can be desireable
• This is where we make use of the MPI_PACKED datatype mentioned last time
MPI_PACK(inbuf, incount, datatype, outbuf, outsize, position, comm)
• inbuf IN input buffer
• incount IN number of elements in the buffer
• datatype IN the type of each elements
• outbuf OUT output buffer
• outsize IN output buffer in size (bytes)
• position INOUT current position in buffer (bytes)
• comm IN communicator for packed message
Used by repeatedly calling MPI_PACK with changed inbuf and outbufvalues
MPI_UNPACK(inbuf, insize, position, outbuf, outcount, datatype, comm)
• inbuf IN input buffer
• insize IN size of input buffer (bytes)
• position INOUT current position (bytes)
• outbuf OUT output buffer
• outcount IN number of components to be unpacked
• datatype IN datatype of each output component
• comm IN communicator
The exact inverse of MPI_PACK. Used by repeatedly calling unpack, extracting each subsequent element
Finally
There are many other functions that are helpful.
Please have a look at the documentation and MPI spec
when you come across a bizarre communication requirement.
How to write Parallel Code
Why bother with this section
• Reasoning with parallel algorithms is bizarre relative to serial algorithms. Subsequently writing and reading parallel code can be painful if you are not prepared
• This is not helped by running code on a remote machine using a job-system.
This section is simply some guidance on where to start, everyone finds some things more useful than others.
Writing parallel code – Quick tips
• Install MPI locally • Cannot be stressed enough
• You can simulate running multiple nodes on a single machine with mpiexec/mpirun calls
• Write a serial version of your solution first• Tight and clean serial code is vastly easier to parallelise
• Especially if you don’t know how to parallelise the code in the first place
• I would not advise trying to do both simultaneously
Writing parallel code – Quick tips
• Write a ‘dummy’ parallelised version of your code first• E.g. decomposing a matrix
• Have each process compute its bounds first and print them out
• Test your scheme for many different configurations and check you’re correct
• Then start adding actual computation
• Read documentation• Most of the time, there will be a function to help make your life easier
• Test your parallel code in serial• A parallelised piece of code should still work if only one node is available
Writing parallel code – Quick tips
• Use the sysadmins• If you end up using a commercial HPC system (e.g. Pawsey Supercomputing
Centre) use the helpdesk – they want to help
• Invest in some tests• Writing a few small examples and making them easy to run will make testing
changes easier
• Good printouts are invaluable• It helps to quickly find out where your code may be bugged and what values
are changing
Writing parallel code – Quick tips
• Make writing code easy for you • Many IDEs / editors (CLion, VSCode etc.) allow for a full remote mode. Write
code directly on the supercomputer using your own editor
• Otherwise • write code locally
• test locally
• commit with git
• pull the edits on the supercomputer and run
Write and run code – you’ll only improve with practice
Common Errors Writing Parallel Code
• Expected argc and argv to be passed to all processors
• Doing things before MPI_Init or after MPI_Finalize
• Matching MPI_Bcast with MPI_Recv
• Assuming your MPI implementation is thread-safe
Summary
• Looked at point to point communication
• MPI_Send
• MPI_Recv
• MPI_ISend
• MPI_Irecv
• MPI_Wait
Bonus material: Tour-de-force of most useful point-to-point function descriptions (might be helpful later)
Quick reminder: https://www.mpi-forum.org/docs/→ Explanation of all function calls
MPI Datatypes
MPI Datatypes
• Since all data is given an MPI type, an MPI implementation can communicate between very different machines
• Specifying application-oriented data layout• Reduces memory-to-memory copies in implementation
• Allows the use of special hardware where available
A list of MPI Datatypes (C)MPI Datatype C Datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE n/a
MPI_PACKED n/a
MPI_BYTE / MPI_PACKED
• MPI_BYTE is precisely a byte (eight bits)
• Un-interpreted and may be different to a character • Some machines may use two bytes for a character for instance
• MPI_PACKED is a much more complicated beastie (we’ll get to it later)• For now, it’s just ‘any-data’
• Used to send structs through MPI
Point-to-Point-Pointers
MPI_Send(start, count, datatype, dest, tag, comm)
• start (IN) initial address of send buffer
• count (IN) number of elements in buffer
• datatype (IN) datatype of each entry
• dest (IN) rank of destination
• tag (IN) message tag
• comm (IN) communicator
Performs a blocking send, returns a status flag (int)
MPI_Recv(start, count, datatype, source, tag, comm, status)
• start
• count
• datatype
• source
• tag
• comm
• status
Performs a blocking receive, return an error flag. Status can be inspected for more information
MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status)
• sendbuf (IN) start of send buffer
• sendcount (IN) number of entries to send
• sendtype (IN) datatype of each entry
• dest (IN) rank of destination
• sendtag (IN) message tag
• recvbuf (OUT) start of receive buffer (must be a different buffer)
• recvcount (IN) number of entries to receive
• recvtype (IN) datatype of receive entries
• source (IN) rank of source
• recvtag(IN) message tag
• comm (IN) communicator
• status (OUT) return status
Performs a standard send and receive as if executed on two separate threads
MPI_ISend(buf, count, datatype, dest, tag, comm, request)
• buf (IN) initial address of send buffer
• count (IN) number of entries to send
• datatype (IN) datatype of each entry
• dest (IN) rank of destination in comm
• tag (IN) message tag
• comm (IN) communicator
• request (OUT) request handle (MPI_REQUEST)
Posts a standard nonblocking send
MPI_IRecv(buf, count, datatype, source, tag, comm, request)
• buf (OUT) start of receive buffer
• count (IN) number of entries to receive
• datatype (IN) type of each entry
• source (IN) rank of source
• tag (IN) message tag
• comm (IN) communicator
• request (OUT) request handle
Posts a nonblocking receive
MPI_Wait(request, status)
• request (INOUT) request handle
• status (OUT) status object
Returns when the operation in request is complete
MPI_Test(request, flag, status)
• request (INOUT) request handle
• flag (OUT) true if operation completed
• status (OUT) status object
Returns true if the operation defined by request is complete. This function can allows single-threaded applications to schedule alternate tasks while waiting for communication to complete.
MPI_Waitall(count, requests, statuses)
• count (IN) list length
• requests (INOUT) array of request handles
• statuses (OUT) array of status objects
Blocks until all communications associated with the array of requests have resolved.
MPI_Testall(count, requests, statuses)
• count (IN) list length
• requests (INOUT) array of request handles
• statuses (OUT) array of statuses
Tests for completion of all communications specified in the array of requests. If all have completed, returns true, false otherwise.
MPI Constants
Communicators
• MPI_ANY_TAG Specifies no tag
• MPI_ANY_SOURCE If passed when sending, any process can recv
• MPI_COMM_NULL A return value occurring when processes are not in a given communicator
• MPI_COMM_SELF A communicator containing only itself
• MPI_COMM_WORLD Communicator containing all processes
Error codes
• MPI _SUCCESS Status code of a successful call
• MPI _ERR Can be used to check if any error has occurred
• MPI _ERR _ARG Indicates an error with a passed argument
• MPI _ERR _COMM Indicates an invalid communicator
• MPI _ERR _IN _STATUS Indicates an error with the error code
• MPI _ERR _ROOT Indicates an invalid root node argument
• MPI _ERR _TAG Indicates an invalid argument for tag
• MPI _ERR _UNKNOWN An error code not matching anything.