advanced mpi programming
Post on 13-Jan-2016
95 Views
Preview:
DESCRIPTION
TRANSCRIPT
Advanced MPI programming
Julien Langou
George Bosilca
Outline
• Point-to-point communications, group and communicators, data-type, collective communications
• 4 real life applications to play with
Point-to-point Communications
Point-to-point Communications
• Data transfer from one process to another– The communication requires a sender and
a receiver, i.e.there must be a way to identify processes
– There must be a way to describe the data– There must be a way to identify messages
Who are you ?
• Processes are identified by a double:– Communicator: a “stream like” safe place
to pass messages– Rank: the relative rank of the remote
process in the group of processes attached to the corresponding communicator
• Messages are identified by a tag: an integer that allow you to differentiate messages in the same communicator
What data to exchange ?
• The data is described by a triple:– The address of the buffer where the data is
located in the memory of the current process
– A count: how many elements make up the message
– A data-type: a description of the memory layout involved in the message
The basic send
• Once this function returns the data have been copied out of the user memory and the buffer can be reused
• This operation may require system buffers, and in this case it will block until enough space is available
• The completion of the send DO NOT state anything about the reception state of the message
MPI_Send(buf, count, type, dest, tag, communicator)
The basic receive
• When this call returns the data is located in the user memory
• Receiving fewer than count is permitted, but receiving more is forbidden
• The status contain information about the received message (MPI_ANY_SOURCE, MPI_ANY_TAG)
MPI_Recv(buf, count, type, dest, tag, communicator, status )
The MPI_Status
• Structure in C (MPI_Status), array in Fortran (integer[MPI_STATUS_SIZE])
• Contain at least 3 fields (integers):– MPI_TAG: the tag of the received message– MPI_SOURCE: the rank of the source
process in the corresponding communicator
– MPI_ERROR: the error raised by the receive (usually MPI_SUCCESS).
Non Blocking Versions
• The memory pointer by buf (and described by the count and type) should not be touched until the communication is completed
• req is a handle to an MPI_Request which can be used to check the status of the communication …
MPI_Isend(buf, count, type, dest, tag, communicator, req ) MPI_Irecv (buf, count, type, dest, tag, communicator, req )
Testing for Completion
• Test or wait the completion status of the request
• If the request is completed:– Fill the status with the related information– Release the request and replace it by
MPI_REQUEST_NULL
BlockingMPI_Wait(req, status)
Non BlockingMPI_Test(req, flag, status)
Multiple Completions
• MPI_Waitany( count, array_of_requests, index, status )• MPI_Testany( count, array_of_requests, index, flag, status)• MPI_Waitall( count, array_of_requests, array_of_statuses)• MPI_Testall( count, array_of_requests, flag, array_of_statuses )• MPI_Waitsome( count, array_of_requests, outcount,
array_of_indices, array_of_statuses )
Synchronous Communications
• MPI_Ssend, MPI_Issend– No restrictions …– Does not complete until the corresponding receive
has been posted, and the operation has been started
• It doesn’t means the operation is completed
– Can be used instead of sending ack– Provide a simple way to avoid unexpected
messages (i.e. no buffering on the receiver side)
Buffered Communications
• MPI_Bsend, MPI_Ibsend• MPI_Buffer_attach, MPI_Buffer_detach
– Buffered sends do not rely on system buffers– The buffer is managed by MPI– The buffer should be large enough to contains the
largest message send by one send– The user cannot use the buffer as long as it is
attached to the MPI library
Ready Communications
• MPI_Rsend, MPI_Irsend– Can ONLY be used if the matching receive has
been posted– It’s the user responsibility to insure the previous
condition, otherwise the outcome is undefined– Can be efficient in some situations as it avoid the
rendezvous handshake– Should be used with extreme caution
Persistent Communications
• MPI_{BR}Send_init, MPI_Recv_init
• MPI_Start, MPI_Startall– Reuse the same requests for multiple
communications– Provide only a half-binding: the send is not
connected to the receive– A persistent send can be matched by any
kind of receive
• Point to point communication– 01-gs (simple)– 02-nanopse (tuning)– 03-summa (tuning)
• Groups and communicators– 03-summa
• Data-types– 04-lila
• Collective– MPI_OP: 04-lila
01-gs
Gram-Schmidt algorithm
19
The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications.
A1
A2
A3
Q1
Q2
Q3
R
Input:A is block distributed by rows
Output:Q is block distributed by rowsR is global
Example of applications: iterative methods.
20
a) in iterative methods with multiple right-hand sides (block iterative methods:)
1) Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).
2) BlockGMRES, BlockGCR, BlockCG, BlockQMR, …
b) in iterative methods with a single right-hand side
1) s-step methods for linear systems of equations (e.g. A. Chronopoulos),
2) LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,
3) Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley).
c) in iterative eigenvalue solvers,
1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),
2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,
3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),
4) PRIMME (A. Stathopoulos, Coll. William & Mary ),
5) And also TRLAN, BLZPACK, IRBLEIGS.
21
Example of applications:
a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),
b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply
c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.
22
Example of applications:
a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),
b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply
c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.
The main characteristics of those three examples are that
a)there is only one column of processors involved but several processor rows,
b) all the data is known from the beginning,
c)and the matrix is dense.
23
Example of applications:
a)in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers),
b)in dense large and more square QR factorization where they are used as the panel factorization step, or more simply
c)in linear least squares problems which the number of equations is extremely larger than the number of unknowns.
The main characteristics of those three examples are that
a)there is only one column of processors involved but several processor rows,
b) all the data is known from the beginning,
c)and the matrix is dense.
Various methods already exist to perform the QR factorization of such matrices:
a)Gram-Schmidt (mgs(row),cgs),
b)Householder (qr2, qrf),
c)or CholeskyQR.
We present a new method:
Allreduce Householder (rhh_qr3, rhh_qrf).
Q = A
r11= || Q1 ||2
Q1 = Q1 / r11
= Q1 / r11
r12 = Q1T Q2
Q2 = Q2 – Q1 r12
r22 = || Q2 ||2
Q2 = Q2 / r22
r13 = Q1T Q3
Q3 = Q3 – Q1 r13
r23 = Q2T Q3
Q3 = Q3 – Q2 r23
r33 = || Q3 ||2
Q3 = Q3 / r33
An example with modified Gram-Schmidt. A nonsingular m x 3
A = QR QTQ = I3
Look at the codes
The CholeskyQR Algorithm
26
chol ( )
\
Ci AiT Ai
AiQi
CR
R
SYRK: C:= ATA ( mn2)
CHOL: R := chol( C ) ( n3/3 )
TRSM: Q := A\R ( mn2)
Bibligraphy
• A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchronization requirements, SIAM Journal on Scientific Computing, 23(6):2165-2182, 2002.
• Popularized by iterative eigensolver libraries:
1) PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),
2) HYPRE (Lawrence Livermore National Lab.) through BLOPEX,
3) Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),
4) PRIMME (A. Stathopoulos, Coll. William & Mary ).
Parallel distributed CholeskyQRThe CholeskyQR method in the parallel distributed context can be described as follows:
28
+
chol ( )
\
+ +
1.
2.
3-4.
5.
C C1 C2 C3 C4
Ci AiT Ai
AiQi
CR
R
1: SYRK: C:= ATA ( mn2)
2: MPI_Reduce: C:= sumprocs C (on proc 0)
3: CHOL: R := chol( C ) ( n3/3 )
4: MPI_Bdcast Broadcast the R factor on proc 0
to all the other processors
5: TRSM: Q := A\R ( mn2)
This method is extremely fast. For two reasons: 1. first, there is only one or two communications phase, 2. second, the local computations are performed with fast operations.
Another advantage of this method is that the resulting code is exactly four lines,3. so the method is simple and relies heavily on other libraries.
Despite all those advantages, 4. this method is highly unstable.
CholeskyQR - code
29
int choleskyqr_A_v0(int mloc, int n, double *A, int lda, double *R, int ldr, MPI_Comm mpi_comm){
int info; cblas_dsyrk( CblasColMajor, CblasUpper, CblasTrans, n, mloc,
1.0e+00, A, lda, 0e+00, R, ldr ); MPI_Allreduce( MPI_IN_PLACE, R, n*n, MPI_DOUBLE, MPI_SUM, mpi_comm );lapack_dpotrf( lapack_upper, n, R, ldr, &info ); cblas_dtrsm( CblasColMajor, CblasRight, CblasUpper, CblasNoTrans,
CblasNonUnit, mloc, n, 1.0e+00, R, ldr, A, lda );
return 0;
}
(And OK, you might want to add an MPI user defined datatype to send only the upper part of R)
+
chol ( )
\
+ +
1.
2.
3-4.
5.
C C1 C2 C3 C4
CiAi
TAi
AiQi
CR
R
Let us look at the codes in 01-gs
0.08
0.16
0.32
0.64
1.28
2.56
5.12
10.24
20.48
1 2 4 8 16 32
Efficient enough?
# of procs
cholqr cgs mgs(row) qrf mgs
1 489.2 (1.02) 134.1 (3.73) 73.5 (6.81) 39.1 (12.78) 56.18 (8.90)
2 467.3 (0.54) 78.9 (3.17) 39.0 (6.41) 22.3 (11.21) 31.21 (8.01)
4 466.4 (0.27) 71.3 (1.75) 38.7 (3.23) 22.2 (5.63) 29.58 (4.23)
8 434.0 (0.14) 67.4 (0.93) 36.7 (1.70) 20.8 (3.01) 21.15 (2.96)
16 359.2 (0.09) 54.2 (0.58) 31.6 (0.99) 18.3 (1.71) 14.44 (2.16)
32 197.8 (0.08) 41.9 (0.37) 29.0 (0.54) 15.8 (0.99) 8.38 (1.87)
MFLOP/sec/proc
Time in sec
050100150200250300350400450500
1 2 4 8 16 32
cholqr
cgs
mgs(row)
qrf
mgs
In this experiment, we fix the problem: m=100,000 and n=50.
Perf
orm
ance
(MFL
OP/
sec/
proc
)
Tim
e (s
ec)
# of procs# of procs
Stability
κ2(A)
κ2(A)
|| A
– Q
R |
| 2 / |
| A
|| 2
|| I
–Q T Q
|| 2
m=100, n=50
02-nanopse
performance analysis tool in PESCAN
Optimization of codes through performance tools
Portability of codes while keeping good performances
idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo
call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo
-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)
idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo
idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo
do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do
call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------
beforeafter
Original code:
13.2% of the overall time at barrier
Modified code:
Removing most of the barrier report a part of the problem on other synchronization point but we observe 6% of improvements
• Lots of self-communication. After investigation, some copies can be avoided.
• Try different variants of asynchronous communication, removing barrier,….
Cd 83 – Se 81 on 16 processors
Original code
Code without me2me communications
Code with asynchronous receive – without barrier
Code with both modifications
Time (s) 18.23 15.42 15.29 13.12
ratio 1.00 0.85 0.83 0.72
• In this example, we see that both modifications are useful and they are working fine together for small number of processors
0
100
200
300
400
500
600
700
16procs
32procs
64procs
128procs
original
me2me
withoutbarrier+assynchall threemodifs
- me2me: 10% improvements on the matrix-vector products, less buffer used- asynchronous communication are not a good idea, (this is why the original code have them)- barrier are useful for large matrices (this why the original code have them)
ZB4096_Ecut30 (order = 2,156,241) – time for 50 matrix-vector products
Comparison of different matrix-vector products implementations
General resultsOriginal (s)
Modified (s)
Ratio
83 Cd – 81 Se (~34K)
16 processors18.23 13.23 28%
232 Cd – 235 Se (~75K)
16 processors36.00 26.16 27%
534 Cd – 527 Se
2x16 processors50.50 38.78 23%
83 Cd – 81 Se – Ecut 30
16 processors 156.33 123.89 21%
83 Cd – 81 Se – Ecut 30
2x16 processors103.50 69.22 34%
idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo--------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo
call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE )--------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo
-------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr)
idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo
idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo
do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do
call mpi_barrier(mpi_comm_world,ierr)-------------------------------------------------------------------------------------
beforeafter
This is just one way of doing it : Solution use (and have) tuned MPI Global Communications (here MPI_Alltoallv )
03-summa
First rule in linear algebra: Have an efficient DGEMM
• All the dense linear algebra operations rely on an efficient DGEMM (matrix Matrix Multiply)
• This is by far the easiest operation O(n3) in Dense Linear Algebra.
– So if we can not implement DGEMM correctly (peak performance), we will not be able to do much for the others operations.
Blocked LU and QR algorithms (LAPACK)
45
-
lu( )
dgetf2
dtrsm (+ dswp)
dgemm
\
L
U
A(1)
A(2)L
U
-
qr( )
dgeqf2 + dlarft
dlarfb
V
R
A(1)
A(2)V
R
LAPACK block LU (right-looking): dgetrf LAPACK block QR (right-looking): dgeqrf
Upd
ate
of th
e re
mai
ning
sub
mat
rixPa
nel
fact
oriz
ation
Blocked LU and QR algorithms (LAPACK)
-
lu( )
dgetf2
dtrsm (+ dswp)
dgemm
\
L
U
A(1)
A(2)L
U
LAPACK block LU (right-looking): dgetrf
Upd
ate
of th
e re
mai
ning
sub
mat
rixPa
nel
fact
oriz
ation
Latency bounded: more than nb AllReduce for n*nb2 ops
CPU - bandwidth bounded: the bulk of the computation: n*n*nb ops
highly paralleliable, efficient and saclable.
B11
B21
B31
Parallel Distributed MM algorithms
A11C11
A21C21
A31C31
B12
B22
B32
A12C12
A22C22
A32C32
B13
B23
B33
A13C13
A23C23
A33C33
*
C11
C21
C31
C12
C22
C32
C13
C23
C33
+ αβ
Parallel Distributed MM algorithms
A B* = A B* + A B*
A B* + A B*+
• Use the outer version of the matrix-matrix multiply algorithm.
+ …
Parallel Distributed MM algorithms
A B* = A B* + A B*
A B* + A B*+ + …
A B*+CC
For k = 1:nb:n,
End For
PDGEMM :
B11
B21
B31
Parallel Distributed MM algorithms
A11 C11
A21 C21
A31 C31
B12
B22
B32
A12 C12
A22 C22
A32 C32
B13
B23
B33
A13 C13
A23 C23
A33 C33
B11
B21
B31
Parallel Distributed MM algorithms
A11 C11
A21 C21
A31 C31
B12
B22
B32
A12 C12
A22 C22
A32 C32
B13
B23
B33
A13 C13
A23 C23
A33 C33
Parallel Distributed MM algorithms
Broadcast of size nb*nloc along the columns, root is active_row.
1
Parallel Distributed MM algorithms
Broadcast of size nb*nloc along the columns, root is active_row.
1
Broadcast of size nloc*nb along the rows, root is active_col.
2
Parallel Distributed MM algorithms
Broadcast of size nb*nloc along the columns, root is active_row.
1
Broadcast of size nloc*nb along the rows, root is active_col.
2
Perform matrix matrix multiply:number of FLOPS is
nloc*nloc*nb
3
Model.
• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast algorithm used
(Pipeline=SUMMA, tree=PUMMA, etc.).
Blocking communication Nonblocking communication.
))log(2,2
max(
223
3
βγp
nppn
nperf=
# of operations
Time for computation = # of ops * ( time for one op ) / (# of proc)
βγp
np
pn
nperf 23
3
)log(22
2
+=
Time with blocking = time comp + time commTime with nonblocking = max( time comp , time comm )
Time for communication = 2 * Bdcast on sqrt(p) processors of [n2 / sqrt(p)] numbers
Model.
• γ is the time for one operation, • β is the time to send one entry.• various algorithms/models depending in the Bdcast
algorithm used (Pipeline=SUMMA, tree=PUMMA, etc.).
• For the TN cluster of PS3,– Bandwidth = 600 MB/sec ( with GigaBit Ethernet)– Flop rate = 149.85 GFLOPs/sec
Blocking communication Nonblocking communication.
βγp
np
pn
nperf 23
3
)log(22
2
+=
))log(2,2
max(
223
3
βγp
nppn
nperf=
jacquard.nersc.gov
• Processor type Opteron 2.2 GHz • Processor theoretical peak 4.4 GFlops/sec • Number of application processors 712 • System theoretical peak (computational nodes) 3.13
TFlops/sec • Number of shared-memory application nodes 356• Processors per node 2 • Physical memory per node 6 GBytes • Usable memory per node 3-5 GBytes • Switch Interconnect InfiniBand • Switch MPI Unidrectional Latency 4.5 μsec • Switch MPI Unidirectional Bandwidth (peak) 620 MB/s • Global shared disk GPFS Usable disk space 30 TBytes • Batch system PBS Pro
Mvapich vs FTMPI
# of processors # of processors
GFL
OPs
/sec
/pro
c
GFL
OPs
/sec
/pro
c
Modeling.
Performance model.
To get 90% of efficiency with nonblocking operations, one need to work on matrices of size 14,848. To get 90% of efficiency with blocking operations, one need to work on matrices of size 146,949.
Really worse using nonblocking communication.
Take for instance, the TN cluster of four PS3:Bandwidth = 600 Mb/sec (with GigaBit Ethernet)Flop rate = 149.85 GFLOPs/sec (theoretical peak is 153.6 GFLOPs/sec)
Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.
Oups…Memory for PS3 is 256MB.
There is no way to hide the n2 term (communication) by the n3 term (computation). n can not get big enough. Actucally, it is the computation that are hidden by the communication.Dense Linear Algebra is stucked. There is nothing to be done.
Alfredo Buttari, Jakub Kurzak, and Jack Dongarra. Limitations of the playstation 3 for high performance cluster computing. Technical Report UT-CS-07-597, Innovative Computing Laboratory, University of Tennessee Knoxville, April 2007.
Three Solutions
Increase the memory of the nodes
Increase the bandwidth of the network
Decrease the computational power of the nodes
600 Mb/sec – 3.3 GB – 6 SPEs
1.39 Gb/sec – 258 MB – 6 SPEs
600 Mb/sec – 258 MB – 2 SPEs
Instead of 600 Mb/sec – 258 MB – 6 SPEs :
Three ways to complain• The network is too slow (complain to GigE)• There is not enough memory on the nodes (complain to Sony)• The nodes are too fast (complain to IBM)
Groups and Communicators
Groups and Communicators
• Break the MPI_COMM_WORLD into smaller sets that have a specific relationship
• Each communicator have a group of processes attached, which are all processes that can be contacted using this communicator– The processes are indexed in the group based on
their rank, that is contiguous and start from 0
Groups vs. Communicators
• The BIG difference– Groups are local entities while communicators are
global
1
2
3
4
5
6
7
0
MPI_Comm_WORLDCommunicator_1Communicator_2
Operations on Groups
• Retrieve the rank and the size• Translate the ranks from one group to
another, compare 2 groups• Constructors: create one group from another
based on a defined relationship– Creating a group is a local operation– Once create the group is not attached to any
communicator (i.e. there is no possible communication).
Group Constructors
• Extract the group from a communicator (MPI_Comm_group)
• Union, intersection or difference of two other groups:– Union: return a group with all processes from
group1 followed by all processes in group2– Intersection: contain all processes that are in both
groups, ordered as in group1– Difference: contain all processes that are in
group1 but not in group2, ordered as in group 1
Example
• Let group1={a,b,c,d,e,f,g} and group2={d,g,a,c,h,i}
• Union(group1,group2): newgroup={a,b,c,d,e,f,g,h,i}
• Intersection(group1,group2): newgroup={a,c,d,g}
• Difference(group1,group2):newgroup={b,e,f}
• Union(group2,group1): newgroup={d,g,a,c,h,i,b,e,f}
• Intersection(group2,group1): newgroup={d,g,a,c}
• Difference(group2,group1):newgroup={h,i}
Group Constructors
• Inclusion and exclusion: by element or by range MPI_Group_*(group, n, ranks, newgroup)
• When by element ranks is a simple array of integers which indicate which rank from group1 goes into group2
• When by range ranks is an array of triple (int ranks[][3]) containing the start rank, the end rank and the stride
• The order of the processes in the resulting group is identical to the original group
Operations on Communicators
• Retrieve the size and the rank• Compare 2 communicators
– MPI_INDENT : they are handles to the same object
– MPI_CONGRUENT : they have the same group attributes
– MPI_SIMILAR : they contain the same processes in different orders
– MPI_UNEQUAL : in all other cases
Communicator Constructors
• MPI_Comm_dup(oldcomm,newcomm)
– the basic communicator duplication– The newcomm have the same attributes as
oldcomm• MPI_Comm_create(oldcomm, group, newcomm)
– The newcomm contains only the processes in the group
– MPI_COMM_NULL is returned to all orher processes
– 2 requirements: group contain a subset of processes from oldcomm and all processes should use the same group.
Communicator Constructors
• MPI_Comm_split(oldcomm, color, key, newcomm)– Create as many groups and communicators as the
number of distinct values of color– The rank in the new group is determined by the
value of key, ties are broken according to the ranking in oldcomm
– MPI_UNDEFINED can be used as color for processes which will not be included in any of the newcomm
Example
• MPI_Comm_split
rank 0 1 2 3 4 5 6 7 8 9 10
proc a b c d e f g h i j k
color U 3 1 1 3 7 3 3 1 U 3
key 0 1 2 3 1 9 3 8 1 0 0
Example
• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0}– Color = 1 : {c:2,d:3,i:1}– Color = 7 : {f:9}
rank 0 1 2 3 4 5 6 7 8 9 10
proc a b c d e f g h i j k
color U 3 1 1 3 7 3 3 1 U 3
key 0 1 2 3 1 9 3 8 1 0 0
Example
• 3 new communicators are created– Color = 3 : {b:1,e:1,g:3,h:8,k:0} -> {k,b,e,g,h}– Color = 1 : {c:2,d:3,i:1} -> {i,c,d}– Color = 7 : {f:9} -> {f}
rank 0 1 2 3 4 5 6 7 8 9 10
proc a b c d e f g h i j k
color U 3 1 1 3 7 3 3 1 U 3
key 0 1 2 3 1 9 3 8 1 0 0
03 - summa (pumma)
Data-types
Data Representation
• Different across different machines– Length: 32 vs. 64 bits (vs. …?)– Endian: big vs. little
• Problems– No standard about the data length in the
programming languages (C/C++)– No standard floating point data
representation• IEEE Standard 754 Floating Point Numbers
– Subnormals, infinities, NANs …
• Same representation but different lengths
MPI Datatypes
• MPI uses “datatypes” to:– Efficiently represent and transfer data– Minimize memory usage
• Even between heterogeneous systems– Used in most communication functions
(MPI_SEND, MPI_RECV, etc.)– And file operations
• MPI contains a large number of pre-defined datatypes
Some of MPI’s Pre-Defined Datatypes
DOUBLE PRECISION*8long doubleMPI_LONG_DOUBLE
DOUBLE PRECISIONdoubleMPI_DOUBLE
REALfloatMPI_FLOAT
unsigned long intMPI_UNSIGNED_LONG
unsigned intMPI_UNSIGNED
unsigned shortMPI_UNSIGNED_SHORT
unsigned charMPI_UNSIGNED_CHAR
signed long intMPI_LONG
INTEGERsigned intMPI_INT
INTEGER*2signed short intMPI_SHORT
CHARACTERsigned charMPI_CHAR
Fortran datatypeC datatypeMPI_Datatype
Datatype Conversion
• “Data sent = data received”
• 2 types of conversions:– Representation conversion: change the
binary representation (e.g., hex floating point to IEEE floating point)
– Type conversion: convert from different types (e.g., int to float)
Only representation conversion is allowed
Datatype Conversion
if (my_rank == root)
MPI_Send(msg, 1, MPI_INT, …)
else
MPI_Recv(msg, 1, MPI_INT, …)
if (my_rank == root) MPI_Send(msg, 1, MPI_INT, …)else MPI_Recv(msg, 1, MPI_FLOAT, …)
Datatype Specifications
• Type signature– Used for message matching
{ type0, type1, …, typen }
• Type map– Used for local operations
{ (type0, disp0), (type1, disp1),…, (typen, dispn) }
It’s all about the memory layout
User-Defined Datatypes
• Applications can define unique datatypes– Composition of other datatypes– MPI functions provided for common patterns
• Contiguous
• Vector
• Indexed
• …
Always reduces to a type map of pre-defined datatypes
Contiguous Blocks
• Replication of the datatype into contiguous locations.MPI_Type_contiguous( 3, oldtype, newtype )
MPI_TYPE_CONTIGUOUS( count, oldtype, newtype )MPI_TYPE_CONTIGUOUS( count, oldtype, newtype ) IN count replication count( positive integer)IN count replication count( positive integer) IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)
Vectors
• Replication of a datatype into locations that consist of equally spaced blocksMPI_Type_vector( 7, 2, 3, oldtype, newtype )
MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )MPI_TYPE_VECTOR( count, blocklength, stride, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN blocklength number of elements in each block (positive integer)IN blocklength number of elements in each block (positive integer)IN stride number of elements between start of each block (integer)IN stride number of elements between start of each block (integer)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)
sb
Indexed Blocks
• Replication of an old datatype into a sequence of blocks, where each block can contain a different number of copies and have a different displacement
MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )MPI_TYPE_INDEXED( count, array_of_blocks, array_of_displs, oldtype, newtype )IN count number of blocks (positive integer)IN count number of blocks (positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_b number of elements per block (array of positive integer)IN a_of_d displacement of each block from the beginning in multiple multipleIN a_of_d displacement of each block from the beginning in multiple multiple of oldtype (array of integers)of oldtype (array of integers)IN oldtype old datatype (MPI_Datatype handle)IN oldtype old datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle) OUT newtype new datatype (MPI_Datatype handle)
Indexed Blocks
array_of_blocklengths[] = { 2, 3, 1, 2, 2, 2 }
array_of_displs[] = { 0, 3, 10, 13, 16, 19 }
MPI_Type_indexed( 6, array_of_blocklengths,
array_of_displs, oldtype, newtype )
B[0] B[1] B[2] B[3] B[4] B[5]
D[0]D[1]
D[2]D[3]
D[4]
Datatype Composition
• Each of the previous functions are the super set of the previous
CONTIGUOUS < VECTOR < INDEXED
• Extend the description of the datatype by allowing more complex memory layout– Not all data structures fit in common patterns– Not all data structures can be described as compositions of
others
“H” Functions
• Displacement is not in multiple of another datatype
• Instead, displacement is in bytes– MPI_TYPE_HVECTOR– MPI_TYPE_HINDEX
• Otherwise, similar to their non-”H” counterparts
Arbitrary Structures
• The most general datatype constructor
• Allows each block to consist of replication of different datatypes
MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength,MPI_TYPE_CREATE_STRUCT( count, array_of_blocklength, array_of_displs, array_of_types, newtype )array_of_displs, array_of_types, newtype )IN count number of entries in each array ( positive integer)IN count number of entries in each array ( positive integer)IN a_of_b number of elements in each block (array of integers)IN a_of_b number of elements in each block (array of integers)IN a_of_d byte displacement in each block (array of Aint)IN a_of_d byte displacement in each block (array of Aint)IN a_of_t type of elements in each block (array of MPI_Datatype handle)IN a_of_t type of elements in each block (array of MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)OUT newtype new datatype (MPI_Datatype handle)
Arbitrary Structures
struct {
int i[3];
float f[2];
} array[100];
Array_of_lengths[] = { 2, 1 };
Array_of_displs[] = { 0, 3*sizeof(int) };
Array_of_types[] = { MPI_INT, MPI_FLOAT };
MPI_Type_struct( 2, array_of_lengths,
array_of_displs, array_of_types, newtype );
int int int float float
length[0] length[1]
displs[0] displs[1]
Arbitrary StructuresArray_of_lengths[] = { 2, 1 };
Array_of_displs[] = { 0, 3*sizeof(int) };
Array_of_types[] = { MPI_INT, MPI_FLOAT };
int int int float float
length[0] length[1]
displs[0] displs[1]
Arbitrary StructuresArray_of_lengths[] = { 2, 1 };
Array_of_displs[] = { 0, 3*sizeof(int) };
Array_of_types[] = { MPI_INT, MPI_FLOAT };
Real datadescription
int int int float float
length[0] length[1]
displs[0] displs[1]
MPI_GET_ADDRESS
• Allow all languages to compute displacements– Necessary in Fortran– Usually unnecessary in C (e.g., “&foo”)
MPI_GET_ADDRESS( location, address )MPI_GET_ADDRESS( location, address )IN location location in the caller memory (choice)IN location location in the caller memory (choice)OUT address address of location (address integer)OUT address address of location (address integer)
And Now the Dark Side…
• Sometimes more complex memory layout have to be expressed
Extent of each element
Interesting part
ddt
MPI_Send( buf, 3, ddt, … )
one datatype
Typemap = { (type0, disp0), …, (typen, dispn) }
Minj dispj if no entry has type lbminj({dispj such that typej = lb) otherwise
Maxj dispj + sizeof(typej) + align if no entry has type ubMaxj{dispj such that typej = ub} otherwise
lb(Typemap)
ub(Typemap)
Lower-Bound andUpper-Bound Markers
• Define datatypes with “holes” at the beginning or end
• 2 pseudo-types: MPI_LB and MPI_UB– Used with MPI_TYPE_STRUCT
MPI_LB and MPI_UB
displs = ( -3, 0, 6 )
blocklengths = ( 1, 1, 1 )
types = ( MPI_LB, MPI_INT, MPI_UB )
MPI_Type_struct( 3, displs, blocklengths, types, type1 )
MPI_Type_contiguous( 3, type1, type2 )
Typemap= { (lb, -3), (int, 0), (ub, 6) }
Typemap= { (lb, -3), (int, 0), (int, 9), (int, 18), (ub, 24) }
Typemap = { (type0, disp0), …, (typen, dispn) }
true_lb(Typemap) = minj { dispj : typej != lb }true_ub(Typemap) = maxj { dispj + sizeof(typej) : typej != ub }
True Lower-Bound andTrue Upper-Bound Markers
• Define the real extent of the datatype: the amount of memory needed to copy the datatype inside
• TRUE_LB define the lower-bound ignoring all the MPI_LB markers.
Information About Datatypes
MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )MPI_TYPE_GET_{TRUE_}EXTENT( datatype, {true_}lb, {true_}extent )IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}lb {true} lower-bound of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)OUT {true_}extent {true} extent of datatype (MPI_AINT)
MPI_TYPE_SIZE( datatype, size)MPI_TYPE_SIZE( datatype, size)IN datatype the datatype (MPI_Datatype handle)IN datatype the datatype (MPI_Datatype handle)OUT size datatype size (integer)OUT size datatype size (integer)
extent
size
true extent
Test your Data-type skills
• Imagine the following architecture:– Integer size is 4 bytes– Cache line is 16 bytes
• We want to create a datatype containing the second integer from each cache line, repeated three times
• How many ways are there?
Solution 1MPI_Datatype array_of_types[] = { MPI_INT, MPI_INT, MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 0, 0, 0, 0 };int array_of_lengths[] = { 1, 1, 1, 1 };struct one_by_cacheline c[4];
MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1].int[1], &(array_of_displs[1]) );MPI_Get_address( &c[2].int[1], &(array_of_displs[2]) );MPI_Get_address( &c[3], &(array_of_displs[3]) );
for( i = 0; i < 4; i++ ) Array_of_displs[i] -= start;
MPI_Type_create_struct( 4, array_of_lengths, array_of_displs, array_of_types, newtype )
newtype
Solution 2MPI_Datatype array_of_types[] = { MPI_INT, MPI_UB };MPI_Aint start, array_of_displs[] = { 4, 16 };int array_of_lengths[] = { 1, 1 };struct one_by_cacheline c[2];
MPI_Get_address( &c[0], &(start) );MPI_Get_address( &c[0].int[1], &(array_of_displs[0]) );MPI_Get_address( &c[1], &(array_of_displs[1]) );
Array_of_displs[0] -= start;Array_of_displs[1] -= start;MPI_Type_create_struct( 2, array_of_lengths, array_of_displs, array_of_types, temp_type )MPI_Type_contiguous( 3, temp_type, newtype )
temp_type
newtype
Data-type for triangular matrices
Application Cholesky QR
EXAMPLE OF CONSTRUCTION OF DATA_TYPE FOR TRIANGULAR MATRICES, EXAMPLE OF
MPI_OP ON TRIANGULAR MATRICES
• See:– choleskyqr_A_v1.c– choleskyqr_B_v1.c– LILA_mpiop_sum_upper.c
• starting from choleskyqr_A_v0.c, this– shows how to construct a datatype for a triangular matrix– show how to use a MPI_OP on the datatype for an
AllReduce operation– here we simply want to summ the upper triangular matrices
together
TRICK FOR TRIANGULAR MATRICES DATATYPES
• See:– check_orthogonality_RFP.c– choleskyqr_A_v2.c– choleskyqr_A_v3.c– choleskyqr_B_v2.c– choleskyqr_B_v3.c
• A trick to use RFP format to do an fast allreduce on P triangular matrices without datatypes. The trick is at the user level.
Idea behind RFP
• Rectangular full packed format
• Just be careful in the case for odd and even matrices
Collective Communications
Collective Communications
• Involves all processes in a communicator– All processes must participate– May be a subset of all running processes– May be more than what you started with
• Blocking (logical) semantics only– BUT: some processes may not block– Except barrier: all processes block until
each process has reached the barrier
Operations• MPI defines several collective operations
– Some are rooted (e.g., broadcast)– Others are rootless (e.g., barrier)
• “Collectives” generally refers to data-passing collective operations– Although technically also refers to any
action in MPI where all processes in a communicator must participate
– Example: communicator maintenance
Barrier Synchronization
• Logical operation– All processes block
until each has reached the barrier
– The official MPI synchronization call
Time A B C
Barrier
Barrier
Barrier
MPI_Barrier( comm )
Broadcast
• Logical operation– Send data from one
node (the “root”) to all the rest
• A broadcast is not a synchronization (!!!)
MPI_Bcast(buffer, cnt, type, root, comm )
Gather
• Logical operation– Obtain data from each node
and assemble in a root process
– Receive argument are only meaningful at the root
– Each process send the same amount of data
– Root can use the MPI_IN_PLACE
MPI_Gather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)
Gatherv
• The vector variant of the gather operation
• Each process participate with a different amount of data
• Allow the root to specify where the data goes
• No overwrite (!!!)
MPI_Gatherv(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, displs, recvtype, root, comm)
Scatter
• Logical operation– Opposite of gather– Send a portion of the root’s
buffer to each process
– Root can use MPI_IN_PLACE for the receive buffer
MPI_Scatter(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)
Scatterv
• The logical extension of the scatter– No portion of the sendbuf
can be send more than once
– Root can use MPI_IN_PLACE for the receive buffer
MPI_Scatterv(sendbuf, sendcnt, displs, sendtype, recvbuf, recvcnt, recvtype, root, comm)
All Gather• Same as gather except
all processes get the full result
• MPI_IN_PLACE can be used on all processes instead of the sendbuf
• Equivalent to a gather followed by a broadcast
• There is a v version
MPI_Allgather(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm)
All to All
• Logical operation– Combined scatter
and gather– Not an all-broadcast
• Uniform and vector versions defined
MPI_Alltoall(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, comm)
Global Reduction• Logical operation
– Mathematical reduction
• Pre-defined MPI operations – Min, max, sum, …– Always commutative
and associative
• User-defined operations
MPI_Reduce(sendbuf, recvbuf, count, type, op, root, comm)
All Reduce
• Logical operation– Reduce where all
processes get the result
– Similar to a reduce followed by a broadcast
MPI_Allreduce(sendbuf, recvbuf, count, type, op, root, comm)
Reduce and Scatter
• Logical operation– Global reduction– Scatterv the result
MPI_Reduce_scatter(sendbuf, recvbuf, recvcount, type, op, comm)
Scan
• Logical operation– Mathematical scan– Both internal and
external
• All processes get a result– Except process 0 in
an external scan
MPI_Scan(sendbuf, recvbuf, count, type, op, comm)
User defined MPI_Op
• MPI_Op_create( function, commute, op )• MPI_Op_free( op )• If commute is true the operation is supposed as being
commutative• Function is a user defined function having 4 arguments:
– Invec : the input vector– Inoutvec : the input and output vector– Count : the number of elements– Datatype : the data-type description
• Return– inoutvec[i] = invec[i] op inoutvec[i] for i in [0..count-1]
User defined MPI_Op
04 - lila
MPI_OP to compute || x || (for Gram-Schmidt)
Weirdest MPI_OP ever: motivation & results
Weirdest MPI_OP ever: how to attach attributes to a datatype
top related