mpi-izing your program
DESCRIPTION
MPI-izing Your Program. CSCI 317 Mike Heroux. Simple Example. Example : Find the max of n positive numbers. Way 1: Single processor ( SISD - for comparison). Way 2: Multiple processor, single memory space (SPMD/SMP). Way 3: Multiple processor, multiple memory spaces. (SPMD/DMP). - PowerPoint PPT PresentationTRANSCRIPT
CSCI 317 Mike Heroux 1
MPI-izing Your Program
CSCI 317Mike Heroux
Simple Example• Example: Find the max of n positive
numbers.– Way 1: Single processor ( SISD - for
comparison).– Way 2: Multiple processor, single memory
space (SPMD/SMP).– Way 3: Multiple processor, multiple memory
spaces. (SPMD/DMP).
SISD Case
maxval = 0; /* Initialize */for (i=0; i < n; i++)
maxval = max(maxval,val(i));
Processor Memory
val[0] … val[n-1]
SPMD/SMP Casemaxval = 0;#pragma omp parallel default(none) \ shared(maxval) { int localmax = 0;#pragma omp for for (int i=0; i< n; ++i) { localmax = (val[i]>localmax) ? val[i]: localmax;
}#pragma omp critical { maxval= (maxval>localmax) ? maxval:localmax; }}
Processors
Memory
val[0…n-1]
0
2
1
3
SPMD/DMP Case (np=4, n=16)maxval = 0;localmax = 0;for (i=0; i < 4; i++)
localmax = (localmax>val[i]) ? localmax: val[i];MPI_Allreduce(&localmax, &maxval, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
Processors Memory
val[0…3] =val[8…11]
0
2
1
3
p = 2
val[0…3]p = 0
val[0…3] =val[4…7]p = 1
val[0…3] =val[12…15]
p = 3
Network
Shared Memory Model Overview• All Processes share the same memory image.• Parallelism often achieved by having processors
take iterations of a for-loop that can be executed in parallel.
• OpenMP, Intel TBB.
Message Passing Overview
• SPMD/DMP programming requires “message passing”.• Traditional Two-sided Message Passing
– Node p sends a message.– Node q receives it.– p and q are both involved in transfer of data.– Data sent/received by calling library routines.
• One-sided Message Passing (mentioned only here)– Node p puts data into the memory of node q. or– Node p gets data from the memory of node q.– Node q is not involved in transfer.– Put’ing and Get’ing done by library calls.
MPI - Message Passing Interface• The most commonly used message passing
standard.• The focus of intense optimization by computer
system vendors.• MPI-2 includes I/O support and one-sided
message passing.• The vast majority of today’s scalable applications
run on top of MPI.• Supports derived data types and communicators.
Hybrid DMP/SMP Models• Many applications exhibit a coarse grain parallel
structure and a simultaneous fine grain parallel structure nested within the coarse.
• Many parallel computers are essentially clusters of SMP nodes.– SMP parallelism is possible within a node.– DMP is required across nodes.
• Compels us to consider programming models where, for example, MPI runs across nodes and OpenMP runs within nodes.
CSCI 317 Mike Heroux 10
First MPI Program• Simple program to measure:
– Asymptotic bandwidth (send big messages).– Latency (send zero-length messages).
• Works with exactly two processors.
CSCI 317 Mike Heroux 11
SimpleCommTest.cpp• Go to SimpleCommTest.cpp• Download on Linux system.• Setup:
– module avail (locate MPI environment, GCC or Intel).– module load …
• Compile/run:– mpicxx SimpleCommTest.cpp– mpirun -np 2 a.out– Try: mpirun -np 4 a.out – Why does it fail? How?
Going from Serial to MPI• One of the most difficult aspects of DMP is:
There is no incremental way to parallelize your existing full-featured code.
• Either a code run in DMP mode or it doesn’t.• One way to address this problem is to:
– Start with a stripped down version of your code.– Parallelize it and incrementally introduce features into
the code.• We will take this approach.
Parallelizing CG• To have a parallel CG solver we need to:
– Introduce MPI_Init/MPI_Finalize into main.cc– Provide parallel implementations of:
• waxpby.cpp, compute_residual.cpp, ddot.cpp (easy)• HPCCG.cpp (also easy)• HPC_sparsemv.cpp (hard).
• Approach:– Do the easy stuff.– Replace (temporarily) the hard stuff with easy.
Parallelizing waxpby• How do we parallelize waxpby?• Easy: You are already done!!
Parallelizing ddot• Parallelizing ddot is very straight-forward
given MPI:// Reduce what you own on a processor.ddot(my_nrow, x, y, &my_result);
//Use MPI's reduce function to collect all partial sums MPI_Allreduce(&my_result, &result, 1, MPI_DOUBLE, MPI_SUM,
MPI_COMM_WORLD);
• Note: – Similar works for compute_residual.
• Replace MPI_SUM with MPI_MAX.• Note: There is a bug in the current version!
Distributed Memory Sparse Matrix-vector Multiplication
Overview• Distributed sparse MV is the most
challenging kernel of parallel CG.• Communication determined by:
– Sparsity pattern.– Distribution of equations.
• Thus, communication pattern must be determined dynamically, i.e., at run-time.
Goals• Computation should be local.
– We want to use our best serial (or SMP) Sparse MV kernels.
– Must transform the matrices to make things look local.
• Speed (obvious). How:– Keep a balance of work across processors.– Minimize the number of off-processor elements
needed by each processor.– Note: This goes back to the basic questions: “Who owns the work, who owns the data?”.
Example
11 12 0 1421 22 0 24 0 0 33 3441 42 43 24
x1x2x3x4
w1w2w3w4
= *
w A x
- On PE 0
- On PE 1
Need to: Transform A on each processor (localize). Need to communicate x4 from PE 1 to 0. Need to communicate x1, x2 from PE 0 to 1.
On PE 0
11 12 1421 22 24
x1x2x3
w1w2 = *
w A x
- On PE 0
- On PE 1
Note:A is now 2x3. Prior to calling sparse MV, must get x4. Special note: Global variable x4 is:
x2 on PE 1. x3 on PE 0.
x4
- Copy of PE 1on PE 0
On PE 1
0 0 41 42
x3x4
x1x2
w3w4 = *
w A x
- On PE 0
- On PE 1
Note:A is now 2x4. Prior to calling sparse MV, must get x1, x2. Special note: Global variables get remapped.
x3 x1x4 x2x1 x3x2 x4
33 3443 24
x1x2
- Copy of PE 0 on PE 1
To Compute w = Ax• Once the global matrix is transformed,
computing Sparse_MV is:– Step one: Copy needed elements of x.
• Send x4 from PE 1 to PE 0.– NOTE: x4 is stored as x2 on PE 1 and will be in x3 on PE 0!
• Send x1 and x2 from PE 0 to PE 1.– NOTE: They will be stored as x3 and x4, resp. on PE 1!
– Call sparsemv to compute w.• PE 0 will compute w1 and w2.• PE 1 will compute w3 and w4.• NOTE: The call of sparsemv on each processor has no
knowledge that it is running in parallel!
Observations• This approach to computing sparse MV
keeps all computation local.– Achieves first goal.
• Still need to look at:– Balancing work.– Minimizing communication (minimize # of
transfers of x entries).
CSCI 317 Mike Heroux 24
HPCCG with MPI• Edit Makefile:
– Uncomment USE_MPI = -DUSING_MPI– Switch to CXX and LINKER = mpicxx– DON’T uncomment MPI_INC (mpicxx handles
this).– To run:
• module avail (locate MPI environment, GCC or Intel).• module load …• mpirun -np 4 test_HPCCG 100 100 100
– Will run on four processors with 100-cubed local problem– Global size is 100 by 100 by 400.
CSCI 317 Mike Heroux 25
Computational Complexity of Sparse_MV
for (i=0; i< nrow; i++) { double sum = 0.0; const double * const cur_vals = ptr_to_vals_in_row[i]; const int * const cur_inds = ptr_to_inds_in_row[i]; const int cur_nnz = nnz_in_row[i]; for (j=0; j< cur_nnz; j++) sum += cur_vals[j]*x[cur_inds[j]]; y[i] = sum; } How many adds/multiplies?
CSCI 317 Mike Heroux 26
Balancing Work• The complexity of sparse MV is 2*nz.
– nz is number of nonzero terms.– We have nz adds, nz multiplies.
• To balance the work we should have the same nz on each processor.
• Note: – There are other factors such as cache hits that
affect the sparse MV performance.– Addressing these is an area of research.
Example: y = AxPattern of A (X=nonzero)
X X 0 0 0 0 0 0
X X 0 0 0 0 0 0
0 0 X X 0 0 0 0
0 0 X X 0 0 0 0
0 0 0 0 X X 0 0
0 0 0 0 X X 0 0
0 0 0 0 0 0 X X
0 0 0 0 0 0 X X
CSCI 317 Mike Heroux 27
Example 2: y = AxPattern of A (X=nonzero)
X X 0 0 X X 0 0
X X 0 0 X 0 0 0
0 0 X X 0 0 0 0
0 X X X 0 0 0 0
0 X 0 0 X X 0 0
0 0 0 0 X X 0 0
0 0 0 0 X 0 X X
X 0 0 0 0 0 X X
CSCI 317 Mike Heroux 28
Example 3: y = AxPattern of A (X=nonzero)
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
CSCI 317 Mike Heroux 29
CSCI 317 Mike Heroux 30
Matrices and Graphs• There is a close connection between sparse
matrices and graphs.• A graph is defined to be
– A set of vertices – With a corresponding set of edges. – An edge exist if there is a connection between two
vertices.• Example:
– Electric Power Grid.• Substations are vertices.• Power lines are edges.
CSCI 317 Mike Heroux 31
The Graph of a Matrix• Let the equations of a matrix be considered
as vertices.• An edge exists between two vertices j and k
if there is a nonzero value ajk or akj.• Let’s see an example...
CSCI 317 Mike Heroux 32
6x6 Matrix and Grapha11 0 0 0 0 a16
0 a22 a23 0 0 0 A = 0 a32 a33 a34 a35 0
0 0 a43 a44 0 0 0 0 a53 0 a55 a56
a61 0 0 0 a65 a66
5
6
1
4
3
2
CSCI 317 Mike Heroux 33
“Tapir” Matrix (John Gilbert)
CSCI 317 Mike Heroux 34
Corresponding Graph
CSCI 317 Mike Heroux 35
2-wayPartitioned Matrix and Grapha11 0 0 0 0 a16
0 a22 a23 0 0 0 A = 0 a32 a33 a43 a35 0
0 0 a43 a44 0 0 0 0 a53 0 a55 a56
a61 0 0 0 a65 a66
5
6
1
4
3
2
Questions:• How many elements must go from
PE 0 to 1 and 1 to 0?• Can we reduce this number? Yes! Try:
5
6
1
4
3
2
CSCI 317 Mike Heroux 36
3-wayPartitioned Matrix and Grapha11 0 0 0 0 a16
0 a22 a23 0 0 0 A = 0 a32 a33 a43 a35 0
0 0 a43 a44 0 0 0 0 a53 0 a55 a56
a61 0 0 0 a65 a66
5
6
1
4
3
2
Questions:• How many elements must go from PE 1 to 0,
2 to 0, 0 to 1, 2 to 1, 0 to 2 and 1 to 2?• Can we reduce these number? Yes!
5
6
1
4
3
2
CSCI 317 Mike Heroux 37
Permuting a Matrix and Graph
5
2
1
6
4
3
5
6
1
4
3
2
Defines a permutation p where:p(1) = 1p(2) = 3p(3) = 4p(4) = 6p(5) = 5p(6) = 2
p can be expressed as a matrix also:
1 0 0 0 0 0 0 0 0 0 0 1
P = 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
CSCI 317 Mike Heroux 38
Properties of P• P is a “rearrangement” of the identity matrix.• P -1 = PT, that the inverse is the transpose.• Let B = PAPT, y = Px, c = Pb.• The solution of
By = c is the same as the solution of
(PAPT)(Px) = (Pb)is the same as the solution of
Ax = b because Px = y, so x = PTPx = PTy• Idea: Find a permutation P that minimizes
communication.
1 0 0 0 0 0 0 0 0 0 0 1
P = 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
CSCI 317 Mike Heroux 39
Permuting a Matrix and Grapha11 0 0 0 0 a16
0 a22 a23 0 0 0 A = 0 a32 a33 a43 a35 0
0 0 a43 a44 0 0 0 0 a53 0 a55 a56
a61 0 0 0 a65 a66a11 a16 0 0 0 0
a61 a66 0 0 a65 0 B = PAPT= 0 0 a22 a23 0 0
0 0 a32 a33 a35 a34
0 a56 0 a53 a55 00 0 0 a43 0 a44
1 0 0 0 0 0 0 0 0 0 0 1
P = 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
CSCI 317 Mike Heroux 40
Communication costs and Edge Separators
• Note that the number of elements of x that we must transfer for Sparse MV is related to the edge separator.
• Minimizing the edge separator is equivalent to minimizing communication.
• Goal: Find a permutation P to minimize edge separator.
• Let’s look at a few examples…
CSCI 317 Mike Heroux 41
32768 x 32768 Matrix on 8 Processors“Natural Ordering”
CSCI 317 Mike Heroux 42
32768 x 32768 Matrix on 8 ProcessorBetter Ordering
CSCI 317 Mike Heroux 43
MFLOP ResultsNo PEs Natural
Ordering"Best"Ordering
1 41.6 41.6
2 77.3 77.3
4 111.5 139.2
8 201 217
16 161 183
CSCI 317 Mike Heroux 44
Edge CutsNo PEs Natural
Ordering"Best"Ordering
1 0 0
2 1024 1024
4 2048 1056
8 2048 817
16 2048 842
Message Passing Flexibility• Message Passing (specifically MPI):
– Each process runs independently in separate memory.– Can run across multiple machine.– Portable across any processor configuration.
• Shared memory parallel:– Parallelism restricted by what?
• Number of shared memory procs.• Amount of memory.• Contention for shared resources. Which ones?
– Memory and channels, I/O speed, disks, …
CSCI 317 Mike Heroux 46
MPI-capable Machines• Which machines are MPI-capable?
– Beefy. How many processors, how much memory? • 8, 48GB
– Beast? • 48, 64GB.
– PE212 machines. How many processors?• 24 machines X 4 cores = 96 !!!, X 4GB = 96GB !!!
CSCI 317 Mike Heroux 47
pe212hostfile• List of machines.• Requirements: passwordless ssh access.% cat pe212hostfilelin2lin3…lin24lin1
CSCI 317 Mike Heroux 48
mpirun on lab systems mpirun --machinefile pe212hosts --verbose -np 96 test_HPCCG 100 100 100Initial Residual = 9898.82Iteration = 15 Residual = 24.5534Iteration = 30 Residual = 0.167899Iteration = 45 Residual = 0.00115722Iteration = 60 Residual = 7.97605e-06Iteration = 75 Residual = 5.49743e-08Iteration = 90 Residual = 3.78897e-10Iteration = 105 Residual = 2.6115e-12Iteration = 120 Residual = 1.79995e-14Iteration = 135 Residual = 1.24059e-16Iteration = 149 Residual = 1.19153e-18Time spent in CG = 47.2836.
Number of iterations = 149.
Final residual = 1.19153e-18.
CSCI 317 Mike Heroux 49
Lab system performance (96 cores)********** Performance Summary (times in sec) ***********
Total Time/FLOPS/MFLOPS = 47.2836/9.15456e+11/19360.9.DDOT Time/FLOPS/MFLOPS = 22.6522/5.7216e+10/2525.84. Minimum DDOT MPI_Allreduce time (over all processors) = 4.43231 Maximum DDOT MPI_Allreduce time (over all processors) = 22.0402 Average DDOT MPI_Allreduce time (over all processors) = 12.7467WAXPBY Time/FLOPS/MFLOPS = 4.31466/8.5824e+10/19891.3.SPARSEMV Time/FLOPS/MFLOPS = 14.7636/7.72416e+11/52319.SPARSEMV MFLOPS W OVRHEAD = 36522.8.SPARSEMV PARALLEL OVERHEAD Time = 6.38525 ( 30.192 % ). SPARSEMV PARALLEL OVERHEAD (Setup) Time = 0.835297 ( 3.94961 % ). SPARSEMV PARALLEL OVERHEAD (Bdry Exchange) Time = 5.54995 ( 26.2424 % ).Difference between computed and exact = 1.39888e-14.
CSCI 317 Mike Heroux 50
Lab system performance (48 cores)% mpirun --bynode --machinefile pe212hosts --verbose -np 48 test_HPCCG 100 100 100
********** Performance Summary (times in sec) ***********
Total Time/FLOPS/MFLOPS = 24.6534/4.57728e+11/18566.6.DDOT Time/FLOPS/MFLOPS = 10.4561/2.8608e+10/2736.02. Minimum DDOT MPI_Allreduce time (over all processors) = 1.9588 Maximum DDOT MPI_Allreduce time (over all processors) = 9.6901 Average DDOT MPI_Allreduce time (over all processors) = 4.04539WAXPBY Time/FLOPS/MFLOPS = 2.03719/4.2912e+10/21064.3.SPARSEMV Time/FLOPS/MFLOPS = 9.85829/3.86208e+11/39176.SPARSEMV MFLOPS W OVRHEAD = 31435.SPARSEMV PARALLEL OVERHEAD Time = 2.42762 ( 19.7594 % ). SPARSEMV PARALLEL OVERHEAD (Setup) Time = 0.127991 ( 1.04177 % ). SPARSEMV PARALLEL OVERHEAD (Bdry Exchange) Time = 2.29963 ( 18.7176 % ).Difference between computed and exact = 1.34337e-14.
CSCI 317 Mike Heroux 51
Lab system performance (48 cores)mpirun --byboard --machinefile pe212hosts --verbose -np 48 test_HPCCG 100 100 100
********** Performance Summary (times in sec) ***********
Total Time/FLOPS/MFLOPS = 21.6507/4.57728e+11/21141.5.DDOT Time/FLOPS/MFLOPS = 7.06463/2.8608e+10/4049.47. Minimum DDOT MPI_Allreduce time (over all processors) = 1.50379 Maximum DDOT MPI_Allreduce time (over all processors) = 6.30749 Average DDOT MPI_Allreduce time (over all processors) = 3.28042WAXPBY Time/FLOPS/MFLOPS = 2.03486/4.2912e+10/21088.4.SPARSEMV Time/FLOPS/MFLOPS = 9.87323/3.86208e+11/39116.7.SPARSEMV MFLOPS W OVRHEAD = 30380.3.SPARSEMV PARALLEL OVERHEAD Time = 2.8392 ( 22.334 % ). SPARSEMV PARALLEL OVERHEAD (Setup) Time = 0.164255 ( 1.29208 % ). SPARSEMV PARALLEL OVERHEAD (Bdry Exchange) Time = 2.67494 ( 21.0419 % ).Difference between computed and exact = 1.34337e-14.
CSCI 317 Mike Heroux 52
Lab system performance (48 cores)mpirun --byslot --machinefile pe212hosts --verbose -np 48 test_HPCCG 100 100 100
********** Performance Summary (times in sec) ***********
Total Time/FLOPS/MFLOPS = 22.3009/4.57728e+11/20525.1.DDOT Time/FLOPS/MFLOPS = 7.32473/2.8608e+10/3905.67. Minimum DDOT MPI_Allreduce time (over all processors) = 2.94072 Maximum DDOT MPI_Allreduce time (over all processors) = 6.5601 Average DDOT MPI_Allreduce time (over all processors) = 4.0015WAXPBY Time/FLOPS/MFLOPS = 2.09876/4.2912e+10/20446.3.SPARSEMV Time/FLOPS/MFLOPS = 10.4333/3.86208e+11/37017.SPARSEMV MFLOPS W OVRHEAD = 29658.2.SPARSEMV PARALLEL OVERHEAD Time = 2.58873 ( 19.8797 % ). SPARSEMV PARALLEL OVERHEAD (Setup) Time = 0.147263 ( 1.13088 % ). SPARSEMV PARALLEL OVERHEAD (Bdry Exchange) Time = 2.44147 ( 18.7488 % ).Difference between computed and exact = 1.34337e-14.
CSCI 317 Mike Heroux 53
Lab system performance (24 cores)mpirun --byslot --machinefile pe212hosts --verbose –np 24 test_HPCCG 100 100 100
********** Performance Summary (times in sec) ***********
Total Time/FLOPS/MFLOPS = 11.8459/2.28864e+11/19320.1.DDOT Time/FLOPS/MFLOPS = 3.30931/1.4304e+10/4322.35. Minimum DDOT MPI_Allreduce time (over all processors) = 0.809083 Maximum DDOT MPI_Allreduce time (over all processors) = 2.85727 Average DDOT MPI_Allreduce time (over all processors) = 1.51294WAXPBY Time/FLOPS/MFLOPS = 1.04615/2.1456e+10/20509.4.SPARSEMV Time/FLOPS/MFLOPS = 5.95526/1.93104e+11/32425.8.SPARSEMV MFLOPS W OVRHEAD = 25391.4.SPARSEMV PARALLEL OVERHEAD Time = 1.64983 ( 21.6938 % ). SPARSEMV PARALLEL OVERHEAD (Setup) Time = 0.11664 ( 1.53371 % ). SPARSEMV PARALLEL OVERHEAD (Bdry Exchange) Time = 1.53319 ( 20.1601 % ).Difference between computed and exact = 9.99201e-15.