mpi for exascale systems (how i learned to stop worrying ... · mpi in the exascale era under a lot...
TRANSCRIPT
MPI for Exascale Systems
Pavan Balaji
Computer Scientist and Group Lead
Argonne National Laboratory
(How I Learned to Stop Worrying about
Exascale and Love MPI)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Argonne
National
Laboratory
About Argonne
$675M operating budget
3,200 employees 1,450 scientists and
engineers 750 Ph.D.s
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Major Scientific
User Facilities
Advanced Photon Source Center for Nanoscale Materials
Electron Microscopy Center
Argonne Leadership Computing Facility
Argonne Tandem Linear Accelerator System
Let’s talk MPI!
Pavan Balaji, Argonne National Laboratory
Standardizing Message-Passing Models
Early vendor systems (Intel’s NX, IBM’s EUI, TMC’s CMMD) were
not portable (or very capable)
Early portable systems (PVM, p4, TCGMSG, Chameleon) were
mainly research efforts
– Did not address the full spectrum of message-passing issues
– Lacked vendor support
– Were not implemented at the most efficient level
The MPI Forum was a collection of vendors, portability writers and
users that wanted to standardize all these efforts
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
What is MPI?
MPI: Message Passing Interface
– The MPI Forum organized in 1992 with broad participation by:
• Vendors: IBM, Intel, TMC, SGI, Convex, Meiko
• Portability library writers: PVM, p4
• Users: application scientists and library writers
• MPI-1 finished in 18 months
– Incorporates the best ideas in a “standard” way
• Each function takes fixed arguments
• Each function has fixed semantics
– Standardizes what the MPI implementation provides and what the
application can and cannot expect
– Each system can implement it differently as long as the semantics match
MPI is not…
– a language or compiler specification
– a specific implementation or product
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI Forum Organizational Structure
Steering committee
Organizers
– Martin Schulz (chair)
Bunch of working groups
– One-sided communication (chairs: Bill Gropp, Rajeev Thakur)
– Hybrid programming (chair: Pavan Balaji)
– Collective communication (chair: Torsten Hoefler)
– Communicators, Contexts and Groups (chair: Pavan Balaji)
– Tools (chair: Kathryn Mohror)
– Fortran (chair: Rolf Rabenseifner)
– Fault Tolerance (chairs: Wesley Bland, Aurelien)
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI-1
MPI-1 supports the classical message-passing programming
model: basic point-to-point communication, collectives,
datatypes, C/Fortran bindings, etc.
MPI-1 was defined (1994) by a broadly based group of parallel
computer vendors, computer scientists, and applications
developers.
– 2-year intensive process
Implementations appeared quickly and now MPI is taken for
granted as vendor-supported software on any parallel
machine.
Free, portable implementations exist for clusters and other
environments (MPICH, Open MPI)
9Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Following MPI Standards
MPI-2 was released in 2000
– Several additional features including MPI + threads, MPI-I/O, remote
memory access, dynamic processes, C++/F90 bindings and others
MPI-2.1 (2008) and MPI-2.2 (2009) were recently released
with some corrections to the standard and small features
MPI-3 (2012) added several new features to MPI
The Standard itself:
– at http://www.mpi-forum.org
– All MPI official releases, in both postscript and HTML
Other information on Web:
– at http://www.mcs.anl.gov/mpi
– pointers to lots of material including tutorials, a FAQ, other MPI pages
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Overview of New Features in MPI-3
Major new features– Nonblocking collectives
– Neighborhood collectives
– Improved one-sided communication interface
– Tools interface
– Fortran 2008 bindings
Other new features– Matching Probe and Recv for thread-safe probe and receive
– Noncollective communicator creation function
– “const” correct C bindings
– Comm_split_type function
– Nonblocking Comm_dup
– Type_create_hindexed_block function
C++ bindings removed; previous deprecated functions removed
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI in the Exascale Era
Under a lot of scrutiny (good!)
– Lots of myths floating around (bad!)
Push to get new programming models designed and
developed for exascale
The truth is that MPI today is a new programming model
(compared to 2004) , and MPI in 2020 will be a new
programming model (compared to today)
Strengths of MPI
– Composability
• Ability to build tools and libraries above and around MPI
• No “do everything under the sun” attitude
– Continuous evolution
• The standard incorporates best research ideas
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI: A Philosophical Perspective
Debate in the community that MPI is too hard to program
– We don’t disagree
– It’s meant to be a low-level portable runtime on top of which higher-
level programming models should be developed
A programming model has to pick a tradeoff between
programmability, portability, and performance
– MPI has chosen to be a high-performance/portable programming model
– Focus has been on completeness and ability to help real and complex
applications meet their computational needs
MPI’s goal is not to make simple programs easy to write, but
to make complex programs possible to write
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI Ecosystem
Concentrates on minimal required features (not minimal functions!) and aims to build a large ecosystem that provides easier/more-specific interfaces– High-level libraries for tasks, global address, math, visualization, I/O
Intel MPI
IBM MPI
CrayMPI
MPICH
MVAPICH
TianheMPI
MPEPETSc
MathWorks
MPI
HPCToolkit
TAU
Totalview
DDT
ADLB
ANSYS
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Current Situation with Production Applications (1)
The vast majority of DOE’s production parallel scientific applications today
use MPI
– Increasing number use (MPI + OpenMP) hybrid
– Some exploring (MPI + accelerator) hybrid
Today’s largest systems in terms of number of regular cores (excluding
GPU cores)
Sequoia (LLNL) 1,572,864 cores
Mira (ANL) 786,432 cores
K computer 705,024 cores
Jülich BG/Q 393,216 cores
Blue Waters 386,816 cores
Titan (ORNL) 299,008 cores
MPI already runs in production on systems with up to 1.6 million cores
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Current Situation with Production Applications (2)
IBM has successfully scaled the LAMMPS application to over 3 million MPI
ranks
Applications are running at scale on LLNL’s Sequoia and achieving 12 to 14
petaflops sustained performance
HACC cosmology code from Argonne (Salman Habib) achieved
14 petaflops on Sequoia
– Ran on full Sequoia system using MPI + OpenMP hybrid
– Used 16 MPI ranks * 4 OpenMP threads on each node, which matches the
hardware architecture: 16 cores per node with 4 hardware threads each
– http://www.hpcwire.com/hpcwire/2012-11-
29/sequoia_supercomputer_runs_cosmology_code_at_14_petaflops.html
– SC12 Gordon Bell prize finalist
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Current Situation with Production Applications (3)
Cardioid cardiac modeling code (IBM & LLNL) achieved 12
petaflops on Sequoia
– Models a beating human heart at near-cellular resolution
– Ran at scale on full system (96 racks)
– Used MPI + threads hybrid: 1 MPI rank per node and 64 threads
– OpenMP was used for thread creation only; all other thread
choreography and synchronization used custom code, not OpenMP
pragmas
– http://nnsa.energy.gov/mediaroom/pressreleases/sequoia112812
– SC12 Gordon Bell Prize finalist
And there are other applications running at similar scales…
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
On the path to Exascale (based on Bill Dally’s talk
at ISC 2013)
0
200
400
600
800
1000
1200
2011/06 2011/11 2012/06 2012/11 2013/06 2013/11
Top500
Green500
Exaflop
Mardi Gras Conference (02/28/2014)
11.1 -26.3X
Device Technology Fabrication Process: 2.2X
Software Improvements: 1.7 – 4X
Logic Circuit Design: 3X
New Features in MPI-3
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Nonblocking Collectives
Nonblocking versions of all collective communication
functions have been added
– MPI_Ibcast, MPI_Ireduce, MPI_Iallreduce, etc.
– There is even a nonblocking barrier, MPI_Ibarrier
They return an MPI_Request object, similar to nonblocking
point-to-point operations
The user must call MPI_Test/MPI_Wait or their variants to
complete the operation
Multiple nonblocking collectives may be outstanding, but they
must be called in the same order on all processes
Pavan Balaji, Argonne National Laboratory
Nonblocking Collectives Overlap
Software pipelining
– More complex parameters
– Progression issues
– Not scale-invariant
Hoefler: Leveraging Non-blocking Collective Communication in High-performance Applications
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Dynamic Sparse Data Exchange (DSDE)
Main Problem: metadata
– Determine who wants to send how much data to me
(I must post receive and reserve memory)
OR:
– Use MPI semantics:
• Unknown sender
– MPI_ANY_SOURCE
• Unknown message size
– MPI_PROBE
• Reduces problem to counting
the number of neighbors
• Allow faster implementation!
T. Hoefler et al.: Scalable Communication Protocols for Dynamic Sparse Data Exchange
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI_Ibarrier (NBX)
Complexity - census (barrier): ( )
– Combines metadata with actual transmission
– Point-to-point
synchronization
– Continue receiving
until barrier completes
– Processes start coll.
synch. (barrier) when
p2p phase ended
• barrier = distributed
marker!
– Better than PEX,
PCX, RSX!
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Parallel Breadth First Search
On a clustered Erdős-Rényi graph, weak scaling
– 6.75 million edges per node (filled 1 GiB)
HW barrier support is significant at large scale!
BlueGene/P – with HW barrier! Myrinet 2000 with LibNBC
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Neighborhood Collectives
New functions MPI_Neighbor_allgather,
MPI_Neighbor_alltoall, and their variants define collective
operations among a process and its neighbors
Neighbors are defined by an MPI Cartesian or graph virtual
process topology that must be previously set
These functions are useful, for example, in stencil
computations that require nearest-neighbor exchanges
They also represent sparse all-to-many communication
concisely, which is essential when running on many thousands
of processes.
– Do not require passing long vector arguments as in MPI_Alltoallv
Pavan Balaji, Argonne National Laboratory
Cartesian Neighborhood Collectives
Buffer ordering example:
T. Hoefler and J. L. Traeff: Sparse Collective Operations for MPI
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Improved RMA Interface
Substantial extensions to the MPI-2 RMA interface
New window creation routines:
– MPI_Win_allocate: MPI allocates the memory associated with the
window (instead of the user passing allocated memory)
– MPI_Win_create_dynamic: Creates a window without memory
attached. User can dynamically attach and detach memory to/from
the window by calling MPI_Win_attach and MPI_Win_detach
– MPI_Win_allocate_shared: Creates a window of shared memory
(within a node) that can be can be accessed simultaneously by direct
load/store accesses as well as RMA ops
New atomic read-modify-write operations
– MPI_Get_accumulate
– MPI_Fetch_and_op (simplified version of Get_accumulate)
– MPI_Compare_and_swap
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
One-sided Communication
The basic idea of one-sided communication models is to
decouple data movement with process synchronization
– Should be able move data without requiring that the remote process
synchronize
– Each process exposes a part of its memory to other processes
– Other processes can directly read from or write to this memory
Process 1 Process 2 Process 3
Private
Memory
Private
Memory
Private
Memory
Process 0
Private
Memory
Remotely
Accessible
Memory
Remotely
Accessible
Memory
Remotely
Accessible
Memory
Remotely
Accessible
Memory
Global Address
Space
Private
Memory
Private
Memory
Private
Memory
Private
Memory
Pavan Balaji, Argonne National Laboratory
Use Case: Distributed Shared Arrays
Quantum Monte Carlo: Ensemble data
– Represents initial quantum state
– Spline representation, cubic basis functions
– Large(100+ GB), read-only table of coeff.
– Accesses are random
Coupled cluster simulations
– Evolving quantum state of the system
– Very large, tables of coefficients
– Tablet read-only, Tablet+1 accumulate-only
– Accesses are non-local/overlapping
Global Arrays PGAS programming model
– Can be supported with passive mode RMA [Dinan et al., IPDPS’12]
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Case-study: NWChem over MPI-3
0
5
10
15
20
25
30
35
1 2 3 4 5
Wal
lclo
ck T
ime
(se
con
ds)
Iteration Number
CCSD Performance
MPI-2
MPI-3
Mardi Gras Conference (02/28/2014)
0
50
100
150
200
250
300
Average Iteration Time
Wal
lclo
ck T
ime
(se
con
ds)
CCSD(T) Performance
MPI-2
MPI-3
Courtesy Jeff Hammond, Argonne National Laboratory
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Hybrid Programming with Shared Memory
MPI-3 allows different processes to allocate shared memory
through MPI
– MPI_Win_allocate_shared
Uses many of the concepts of one-sided communication
Applications can do hybrid programming using MPI or
load/store accesses on the shared memory window
Other MPI functions can be used to synchronize access to
shared memory regions
Can be simpler to program than threads
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Creating Shared Memory Regions in MPI
MPI_COMM_WORLD
MPI_Comm_split_type (COMM_TYPE_SHARED)
Shared memory communicator
MPI_Win_allocate_shared
Shared memory window
Shared memory window
Shared memory window
Shared memory communicator
Shared memory communicator
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Load/store
Regular RMA windows vs. Shared memory windows
Shared memory windows allow
application processes to directly
perform load/store accesses on
all of the window memory
– E.g., x[100] = 10
All of the existing RMA functions
can also be used on such
memory for more advanced
semantics such as atomic
operations
Can be very useful when
processes want to use threads
only to get access to all of the
memory on the node
– You can create a shared memory
window and put your shared data
Local memory
P0
Local memory
P1
Load/storePUT/GET
Traditional RMA windows
Load/store
Local memory
P0 P1
Load/store
Shared memory windows
Load/store
Pavan Balaji, Argonne National Laboratory
Case Study: Genome Assembly
Largest genome assembly to
date: 2.3TB dataset performed
with MPI-3 shared memory
capability
– First terascale genome assembly
Very simple optimization: place
all of the node dataset in
shared memory and access as
read-only data
Could not use threads because
all MPI calls face lock
overheads
Mardi Gras Conference (02/28/2014)
Terabase Assembly on Cray XE6
Courtesy Fangfang Xia, Argonne National Laboratory
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Tools Interface (1)
An extensive interface to allow tools (performance analyzers,
etc.) to portably extract information about MPI processes
Enables the setting of various control variables within an MPI
implementation, such as algorithmic cutoff parameters
– e.g, eager v/s rendezvous thresholds
– Switching between different algorithms for a collective communication
operation
Provides portable access to performance variables that can
provide insight into internal performance information of the
MPI implementation
– e.g., length of unexpected message queue
Note that each implementation defines its own performance
and control variables; MPI does not define them
Pavan Balaji, Argonne National Laboratory
Tools Interface (2)
Standardized model for debuggers to query for MPI library
information
Totalview, DDT, etc., can peek inside the MPI library and
provide information on queues and other operations
This has been supported by almost every MPI implementation
for many years; it was just blessed by the Forum now
(This is a companion standard, not included in MPI-3)
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Fortran 2008 Bindings
An additional set of bindings for the latest Fortran specification
Supports full and better quality argument checking with
individual handles
Support for choice arguments, similar to (void *) in C
Enables passing array subsections to nonblocking functions
First parallel programming library to actually be Fortran
compliant
– Well, not exactly; requires some Fortran TS 29113 extensions
Pavan Balaji, Argonne National Laboratory
Support for Very Large Data
Ability to communicate very large data (more than 2GB)
Model is a little round-about for backward compatibility
reasons, but not too hard
– If < 2GB, use MPI communication directly
– If >= 2GB, create derived datatype that represents large data
Added a few extra functions that would typically be used by
high-level libraries to query for such information
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Status of MPI-3 Implementations
Mardi Gras Conference (02/28/2014)
MPICH MVAPICH Cray Tianhe Intel IBM PEIBM
PlatformOpen MPI SGI Fujitsu MS
NB collectives ✔ ✔ ✔ ✔ ✔ Q1 ‘14 ✔ ✔ ✔
Neighborhood collectives ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14
Q4 ’13(in nightly snapshots)
Q1 ‘14
RMA ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14 Q1 ‘14 Q4 ‘13
Shared memory ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 Q1 ‘14 Q4 ‘13
Tools Interface ✔ Q1 ‘14 Q2 ‘14 Q1 ‘14 Q1 ‘14 Q1 ‘14 Q2 ‘15 ✔ Q1 ‘14
Non-collective comm. create ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15
Q4 ’13(in nightly snapshots)
✔
F08 Bindings(MPI-3 + errata)
(Q1 ‘14) (Q2 ‘14) (Q2 ‘14) (Q1 ‘14) (Q1 ‘14) (Q1 ‘14) (Q3 ‘14) (✔) (Q1 ‘14)
New Datatypes ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 ✔ ✔
Large Counts ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 ✔ Q1 ‘14
Matched Probe ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14 ✔ ✔
Release dates are estimates and are subject to change at any time.
Empty cells indicate no publicly announced plan to implement/support that feature.
Beyond MPI-3
Pavan Balaji, Argonne National Laboratory
MPI Research Activities beyond MPI-3
Improved interoperability with Threads and Shared memory
– Some of this is being considered for MPI-4
Fault Tolerance
– Tolerating faults in processes (fatal crashes), memory, network
– Some of it is being considered for MPI-4
Tasking Models
– Message-driven execution, irregular computations
– Some of it (Active Messages) is being considered for MPI-4
Interacting with Heterogeneous Memory
– Non-DRAM memories (NVRAM, scratchpad, on-chip memory)
– Definitely beyond MPI-4
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
MPI+Threads Hybrid Programming
One of the most successful models in used today
Hybrid programming vs. a single unified programming model
– The number of models we program to should not be too large, but a
small collection of standardized programming models which
interoperate with each other is not a bad thing
– MPI+OpenMP has demonstrated this successfully
Mardi Gras Conference (02/28/2014)
Why is this: better than this?
Pavan Balaji, Argonne National Laboratory
Four levels of MPI Thread Safety
MPI_THREAD_SINGLE
– MPI only, no threads
MPI_THREAD_FUNNELED
– Outside OpenMP parallel region,
or OpenMP master region
MPI_THREAD_SERIALIZED
– Outside OpenMP parallel region,
or OpenMP single region, or
critical region
MPI_THREAD_MULTIPLE
– Any thread is allowed to make
MPI calls at any time
#pragma omp parallel forfor (i = 0; i < N; i++) {
uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0;}
MPI_Function ( );
Mardi Gras Conference (02/28/2014)
#pragma omp parallel{
/* user computation */#pragma omp singleMPI_Function ();
}
#pragma omp parallel{
/* user computation */#pragma omp criticalMPI_Function ();
}
Pavan Balaji, Argonne National Laboratory
Problem: Idle Resources during MPI Calls
Threads are only active in the computation phase
Threads are IDLE during MPI calls
#pragma omp parallel forfor (i = 0; i < N; i++) {
uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0; }
MPI_Function ( );
Master
#pragma omp parallel{
/* user computation */
#pragma omp singleMPI_Function ();
}
(a) Funneled mode
(b) Serialized mode
MPI CALL
MPI CALL
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Our Approach:
Sharing Idle Threads with Application inside MPI
#pragma omp parallel{
/* user computation */}
MPI_Function () {#pragma omp parallel{
/* MPI internal task */}
}
Master
(a) Funneled mode
(b) Serialized mode
#pragma omp parallel{
/* user computation */
#pragma omp singleMPI_Function () {
#pragma omp parallel{
/* MPI internal task */}
}}
MPI CALL
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Challenges
Some parallel algorithms are not efficient with insufficient
threads, need tradeoff, but the number of available threads is
UNKNOWN !
Nested parallelism
– Simply creates new Pthreads
– Offloads thread scheduling to OS, caused threads OVERRUNNING issue
#pragma omp parallel{
/* user computation */
#pragma omp singleMPI_Function( ){
…}
}
MPI CALL
(a) Unknown number of IDLE threads
#pragma omp parallel{
#pragma omp single{
#pragma omp parallel{ … }
}}
(b) Threads overrunning
Creates N Pthreads
Creates N Pthreads
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Derived Data Type Packing Processing
MPI_Pack / MPI_Unpack
Communication using Derived Data Type
– Transfer non-contiguous data
– Pack / unpack data internally
#pragma omp parallel forfor (i=0; i<count; i++){
dest[i] = src[i * stride];}
blocklength
count
stride
Mardi Gras Conference (02/28/2014)
00.5
11.5
22.5
33.5
44.5
5
1 2 4 8 16 32 64 128 240
Spe
edu
p
Number of Threads
Communication Time Speedup
Execution Time Speedup
Hybrid MPI+OpenMP NAS Parallel MG benchmark
Pavan Balaji, Argonne National Laboratory
Parallel shared memory communication
Get as many available cells as we can
Parallelizing large data movement
Shared Buffer
Cell[0]
Cell[1]
Cell[2]
Cell[3]
User Buffer
User Buffer
Sender ReceiverShared Buffer
Cell[0]
Cell[1]
Cell[2]
Cell[3]
User Buffer
User Buffer
Sender Receiver
(a) Sequential Pipelining (b) Parallel pipelining
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Intra-node Large Message Communication
OSU MPI micro-benchmark
Mardi Gras Conference (02/28/2014)
0.125
0.25
0.5
1
2
4
8
16
1 2 4 8 16 32 64 120
Spe
edu
p
Number of Threads
64 KB
256 KB
1 MB
4 MB
16 MB
P2P Bandwidth
0.125
0.25
0.5
1
2
4
8
16
1 2 4 8 16 32 64 120
0.1250.25
0.51248
16
1 2 4 8 16 32 64 120
Similar results of Latency and Message rate
Caused by too small Eager/Rendezvous communication threshold on Xeon Phi.Not by MT-MPI !
Poor pipelining but worse parallelism
Pavan Balaji, Argonne National Laboratory
Parallel InfiniBand Communication
Structures
– IB context
– Protection Domain
– Queue Pair (critical)
• 1 QP per connection
– Completion Queue (critical)
• Shared by 1 or more QPs
RDMA communication
– Post RDMA operation to QP
– Poll completion from CQ
Internally supports Multi-threading
OpenMP contention issue
P0 P1
P2
QPQPCQPD
HCA
IB CTX
ADI3
CH3
SHM nemesis
TCP IB …
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
One-sided Graph500 benchmark
Every process issues many MPI_Accumulate operations to
the other processes in every breadth first search iteration.
Scale 2^22, 16 edge factor, 64 MPI processes
Mardi Gras Conference (02/28/2014)
1.0
1.1
1.2
1.3
1.4
1.0E+06
1.1E+06
1.2E+06
1.3E+06
1.4E+06
1.5E+06
1.6E+06
1.7E+06
1 2 4 8 16 32 64
Imp
rove
me
nt
Har
mo
nic
Mea
n T
EPS
Number of Threads
Improvement
Harmonic Mean TEPS
Pavan Balaji, Argonne National Laboratory
MPI_THREAD_MULTIPLE optimizations
Global– Use a single global mutex, held
from function enter to exit
– Existing implementation
Brief Global– Use a single global mutex, but
reduce the size of the critical section as much as possible
Per-Object– Use one mutex per data object:
lock data not code sections
Lock-Free– Use no mutexes
– Use lock-free algorithms
Mardi Gras Conference (02/28/2014)
Processes Threads
Pavan Balaji, Argonne National Laboratory
Several Optimizations Later…
Reduction of lock
granularities
Thread-local pools to
reduce sharing
Per-object locks
Some atomic
operations (reference
counts)
But the performance
scaling was still
suboptimal
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
From Pure-MPI to Multithreaded-MPI: Multi-Node
Communication Throughput
Example: circular communication pattern
Onesided communications
No intra-node communication
17 nodes x 16 cores/node
Global mutex
Throughput drops severely when moving to threads
Problems:– Safer algorithms?
– Lock contention?
Mardi Gras Conference (02/28/2014)
0
500
1000
1500
2000
0 50 100 150 200 250 300
Thro
ugh
pu
t [K
byt
es/
s]
Core count
P16:T1 P8:T2 P4:T4 P2:T8 P1:T16
Pavan Balaji, Argonne National Laboratory
Contention in a Multithreaded MPI Model
Multithreaded MPI
– Threads can make MPI calls
concurrently
– Thread-safety is necessary
MPI Process
Thread1 Thread2
MPI_Init_thread(…,MPI_THREAD_MULTIPLE,…);
.
.
#pragma omp parallel
{
/* Do Work */
MPI_Put();
/* Do Work */
}
MPI_Put()MPI_Put()
ENTER_CS()
EXIT_CS()
ENTER_CS()
EXIT_CS()
Thread-safety can be ensured by:
Critical Sections (Locks)
Possible Contention !
Using Lock-Free algorithms
Non trivial !
Still does memory barriers
Thread Sleeping/
Polling
CONTENTION
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Hidden Evil: Lock Monopolization (Starvation)
Implementing critical sections
with spin-locks or mutexes
Watch out: no fairness guarantee!
Mardi Gras Conference (02/28/2014)
int waiting_threads = 0;int last_holder;
acquire_lock(L){
bool lock_acquired = false;try_lock(L, lock_acquired) if ((lock_acquired) &&
(my_thread_id == last_holder) && (waiting_threads > 0))
STARVATION_CASE;else if (!lock_acquired){
atomic_incr(waiting_threads);lock(L);
atomic_decr(waiting_threads);}last_holder = my_thread_id;return;
}
Starvation Detection Algorithm
0
10
20
30
40
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Pe
rce
nta
ge o
f Lo
cks
Acq
uir
ed
Number of Starving Threads
Starvation measurement with 16 processes and 16 threads/nodes
Circular_comm Graph500
Pavan Balaji, Argonne National Laboratory
How to fix Lock Monopolization?
Use locks that ensure fairness
Example: Ticket Spin-Lock
Basics:
– Get my ticket and Wait my turn
– Ensures FIFO acquisition
Mardi Gras Conference (02/28/2014)
Simplified Execution flow of a Thead-safe MPI implementation with critical sections
0
0.5
1
1.5
2
2.5
16 32 64
Spe
ed
-up
ove
r M
ute
x
Number of Nodes
2D Stencil, Hallo=2MB/direction, Message size=1KB, 16Threads/Node
Mutex Ticket
Pavan Balaji, Argonne National Laboratory
Priority Locking Scheme
3 basic locks:
– One for mutual exclusion in each priority level
– Another for high priority threads to block lower ones
Watch out: do not forget fairness in the same priority level
– Use exclusively FIFO locks (Ticket)
Mardi Gras Conference (02/28/2014)
0
1
2
3
16 32 64
Spe
ed
-up
ove
r M
ute
x
Number of Nodes
2D Stencil, Hallo=2MB/direction, Message size=1KB, 16Threads/Node
Mutex Ticket Priority
Pavan Balaji, Argonne National Laboratory
Does Fixing Lock Contention
Solve the Problem?
Spin-lock based critical sections
Contention metric: Wasted Polls
Test scenarios:– Micro-benchmarks
– HPC applications
#pragma omp parallel
{
for(i=0; i< NITER; i++)
{
MPI_Put();
/*Delay X us*/
Delay(X)
}
}
Mardi Gras Conference (02/28/2014)
82
84
86
88
90
92
94
96
98
100
0 10 20 30 40 50 60 70 80 90 100
Pe
rce
nta
ge o
f Lo
cks
Acq
uir
ed
Number of Polls in Lock Acquisition
Lock Contention (MPI_PUT, 32 nodes x 8 cores)
Delay=0us
Delay=10us
Delay=20us
Delay=50us
Delay=100us
88
90
92
94
96
98
100
0 10 20 30 40 50 60 70 80 90 100
Pe
rcen
tage
of
Lock
s A
cqu
ire
d
Number of Polls in Lock Acquisition
Lock Contention (Graph500 , 32 nodes x 8 cores)
SCALE=14
SCALE=16
SCALE=18
“When you have eliminated the impossible, whatever remains, however improbable,
must be the truth.” – Sherlock Holmes, Sign of Four, Sir Arthur Conan Doyle
Pavan Balaji, Argonne National Laboratory
Hybrid MPI+OpenMP (or other threading models)
Thread execution model exposed to applications is too
simplistic
OpenMP threads can be pthreads (i.e., can execute
concurrently) or user-level threads such as qthreads (i.e.,
might or might not execute concurrently)
– Not exposed to users/MPI library
What does this mean to MPI?
– MPI runtime never knows when two threads can execute concurrently
and when they cannot
– Always has to perform locks and memory consistency calls (memory
barriers) even when switching between user-level threads
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Argobots: Integrated Computation and Data Movement
with Lightweight Work Units
Execution Model
– Execution stream: a thread
executed on a hardware processing
element
– Work units: a user level thread or a
tasklet with a function pointer
Memory Model
– Memory Domains: A memory
consistency call on a big domain
also impacts all internal domains
– Synchronization: explicit & implicit
memory consistency calls
– Network: PUT/GET, atomic ops Load/Store (LS) Domain (LS0)
Noncoherent Load/Store (NCLS) Domain
LS Noncoherent memory
Put Get
Consistency Domain (CD0)
CD consistent memory
Consistency Domain (CD1)
CD consistent memory
LS Noncoherent memory
Work Units
Execution Stream
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Argobots Ongoing Works: Fine-grained Context-
aware Thread Library
Two Level of Threads
– execution stream: a normal thread
– work unit: a user level thread or a tasklet
with a function pointer
Avoid Unnecessary lock/unlock
– Case 1: switch the execution to another
work unit in the same execution stream
without unlock
– Case 2: switch to another execution stream,
call unlock
Scheduling Work Units in Batch Order
– work units in the same execution stream will
be batch executed
Mardi Gras Conference (02/28/2014)
WorkUnits
ExecutionStreams
w1w3
w5
e1 e2
12
Pavan Balaji, Argonne National Laboratory
MPI + Shared memory with Partitioned Address Space
Partitioned virtual address
space allows us to create
different virtual address
images instead a single OS
virtual address space
Each process has a virtual
address space, but they can
look into each others virtual
address space
All memory is semi-private
Mardi Gras Conference (02/28/2014)
Courtesy Atsushi Hori, University of Tokyo
Pavan Balaji, Argonne National Laboratory
MPI + Shared memory with User-level processes
Extending the PVAS concept
to “user-level processes”
Multiple “MPI processes”
inside a single OS process
Each MPI process has its own
address space, but the MPI
implementation can look into
other process’ memory and
cooperatively yield to them
Mardi Gras Conference (02/28/2014)
Collaborative work with Atsushi Hori, University of Tokyo
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Fault Tolerance: What failures do we expect?
CPU (or full system)
– Total Failure
Memory
– ECC Errors (corrected/uncorrected)
– Silent Errors
Network
– Dropped packets
– Lost routes
– NIC Failure
GPU/Accelerator
– Memory Errors
– Total Failure
Disk
– R/W Errors
– Total Failure
CPU
Memory
Network
GPU
Disk
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
CPU (Total System) Failures
Generally will result in a process failure from the perspective of other (off-node) processes
Need to recover/repair lots of parts of the system– Communication library (MPI)
– Computational capacity (if necessary)
– Data
• C/R, ABFT, Natural Fault Tolerance, etc.
MPI-3 has the theoretical ability to deal with faults, but the user has to do a bunch of bookkeeping to make that happen– New communicator has to be created
– All requests have to be kept track of and migrated to the new communicator
– Need to watch out for failure messages from other processes
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
MPIXFT: MPI-3 based library for FT bookkeeping
Lightweight virtualization infrastructure
– Gives users virtual communicators and requests and
internally manages the real ones
Automatically repairs MPI communicators as
failures occur
– Handles running in n-1 model
Virtualizes MPI Communicator
– User gets an MPIXFT wrapper communicator
– On failure, the underlying MPI communicator is
replaced with a new, working communicator
MPI_COMM
MPI_COMM
MPIXFT_COMM
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
MPIXFT Design
Possible because of new MPI-3
capabilities
– Non-blocking equivalents for (almost)
everything
– MPI_COMM_CREATE_GROUP
MPI_BARRIER
MPI_WAITANY
MPI_IBARRIER
Failure notification request
0 1
2 3
4 5 MP
I_IS
END
()
CO
MM
_CR
EATE
_GR
OU
P()
0 1
2
4 5
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
MIPXFT Results
MCCK Mini-app
– Domain decomposition
communication kernel
– Overhead within standard
deviation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Tim
e (s
)
Processes
MCCK Mini-app
MPICH
MPIXFT
0
0.2
0.4
0.6
0.8
1
1.2
Tim
e (s
)
Processes
Halo Exchange
MPIXFT 1DMPICH 1DMPIXFT 2DMPICH 2DMPIXFT 3DMPICH 3D
Halo Exchange (1D, 2D, 3D)
– Up to 6 outstanding requests at a time
– Very low overhead
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
User Level Failure Mitigation
Proposed change to MPI Standard for MPI-4
Repair MPI after process failure
– Enable more custom recovery than MPIXFT
Does not pick a particular recovery technique as better or
worse than others
Introduce minimal changes to MPI
Works around performance problems with MPIXFT
Treat process failure as fail-stop failures
– Transient failures are masked as fail-stop
Ability to notify remaining processes on errors
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Recovery with only notification
Master/Worker Example
Post work to multiple processes
MPI_Recv returns error due to failure– MPI_ERR_PROC_FAILED if named
– MPI_ERR_PROC_FAILED_PENDING if wildcard
Master discovers which process has failed with
ACK/GET_ACKED
Master reassigns work to worker 2
MasterWorker
1Worker
2Worker
3
Sen
dR
ecv
Discovery
Sen
d
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Failure Propagation
When necessary, manual propagation is available.
– MPI_Comm_revoke(MPI_Comm comm)
• Interrupts all non-local MPI calls on all processes in comm.
• Once revoked, all non-local MPI calls on all processes in comm will return
MPI_ERR_REVOKED.
– Exceptions are MPI_COMM_SHRINK and MPI_COMM_AGREE (later)
– Necessary for deadlock prevention
Often unnecessary
– Let the application
discover the error as it
impacts correct completion
of an operation.
0
1
2
3
Recv(1) Failed
Recv(3)
Send(2)Recv(0)
Revoked
RevokedRevoked
Revoke
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
GVR (Global View Resilience) for Memory Errors
Multi-versioned, distributed memory
– Application commits “versions” which are stored by a backend
– Versions are coordinated across entire system
Different from C/R
– Don’t roll back full application stack, just the specific data.
Rollback & re-compute if uncorrected error
Parallel Computation proceeds from phase to phase
Phases create newlogical versions
App-semanticsbased recovery
Pavan Balaji, Argonne National Laboratory
Memory Fault Tolerance
Memory stashing
Multi-versioned memory
Implementation-defined block-level retrieval across versions
Silent-data corruption handled through application hooks
(only works if applications can detect “silent errors”)
Nondeterministic methods can naturally deal with them
because of some common hardware properties
– Exponents are less error prone than mantissa, because of hardware
logic models (mantissa uses simpler hardware)
– Instruction caches are more reliable than data caches
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
MPI Memory Resilience
(Planned Work)
MPI_PUT
MPI_PUT
Traditional
Replicated
Integrate memory stashing within the MPI stack
– Data can be replicated across different kinds of memories (or storage)
– On error, repair data from backup memory (or disk)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
Traditional Model
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
VOCL Model
Native OpenCL Library
Compute Node
Virtual GPU
Application
VOCL Library
OpenCL API
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
Virtual GPU
Compute Node
Physical GPU
Application
Native OpenCL Library
OpenCL API
Transparent utilization of remote GPUs
Efficient GPU resource management:
– Migration (GPU / server)
– Power Management: pVOCL
VOCL: Transparent Remote GPU Computing
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
bufferWritelanchKernelbufferReadsync
lanchKernelbufferReadsync
bufferWritelanchKernelbufferRead sync
bufferWrite detecting threadlanchKernelbufferReadsync
lanchKernelbufferReadsync
bufferWritelanchKernelbufferReadsync
Double- (uncorrected) and single-bit (corrected)
error counters may be queried in both models
ECC QueryCheckpointing
Synchronous Detection Model
VOCL FTFunctionality
ECC QueryCheckpointing
User App.
Asynchronous Detection Model
User App. VOCL FT Thread
ECC Query
ECC Query
ECC Query
ECC Query
ECC Query
Minimum overhead, but
double-bit errors will trash whole
executions
VOCL-FT (Fault Tolerant Virtual OpenCL)
Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)
0
1000
2000
3000
4000
5000
512*512 1024*1024 2048*2048 4096*4096
exec
uti
on
tim
e (m
s)
#threads
Matrix Transpose
vocl async sync
0
2000
4000
6000
8000
10000
1K 2K 3K 4K 5K 6K
exec
uti
on
tim
e(m
s)
query size
Smith-Waterman
vocl async sync
0
10000
20000
30000
40000
50000
60000
15360 23040 30720 38400
exec
uti
on
tim
e(m
s)
#of bodies
Nbody
vocl async sync
0
2000
4000
6000
8000
10000
12000
512*512 1024*1024 2048*2048 4096*4096
exec
uti
on
tim
e(m
s)
matrix block
DGEMM
vocl async sync
VOCL-FT: Single and Double Bit Error Detection Overhead
Pavan Balaji, Argonne National Laboratory
Data-Intensive Applications Graph Algorithm in Social Network Analysis
DNA Sequence Assembly
remote search
local node
remote nodeACGCGATTCAG
GCGATTCAGTAACGCGATTCAGTA
DNA consensus sequence
Common Characteristics
– Organized around sparse structures
– Communication-to-computation ratio is high
– Irregular communication pattern
79Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Tasking Models
Several tasking models built on top of MPI (ADLB, Scioto)
Provide a dynamic task model, potentially with dependencies, for applications to use– MPI processes are static; application view of computation/data is dynamic
Need to be taken with a grain of salt– Asynchronous execution not always great: can screw up cache locality
– Synchronization does not necessarily mean idleness
– Idleness does not necessarily mean worse performance or energy usage
Big problem with these and other tasking models is MPI interoperability– Libraries built on top of MPI are better in this regard, but still not perfect,
because there is no support for it from the runtime system
– E.g., if I spawn off tasks to execute on a remote process, while the remote process is waiting in an MPI_BARRIER, the tasking library does not have process control to execute the tasks
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Current MPI is not well-suitable to data-intensive applicationsProcess 0 Process 1
Send (data)Receive (data)
Process 0 Process 1
Put (data)
Get (data)
Acc (data) +=Send (data)
two-sided communication
(explicit sends and receives) one-sided (RMA) communication
(explicit sends, implicit receives, simple
remote operations)
origin target
messages handler
reply handler
✗✗
✓
Message Passing Models
Receive (data)
Active Messages
– Sender explicitly sends message
– Upon message’s arrival, message handler is triggered, receiver is not explicitly involved
– User-defined operations on remote process
81Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
– Correctness semantics
• Memory consistency
– MPI runtime must ensure consistency of window
• Three different type of ordering
• Concurrency: by default, MPI runtime behaves “as if” AMs are executed in sequential order. User can
release concurrency by setting MPI assert.
– Streaming AMs
• define “segment”--- minimum number of elements for AM execution
• achieve pipeline effect and reduce temporary buffer requirement
Generalized and MPI-Interoperable AM
82[ICPADS 2013] X. Zhao, P. Balaji, W. Gropp, R. Thakur, “MPI-Interoperable and Generalized Active Messages”, in proceedings of ICPADS’ 13
– Explicit and implicit buffer management
• system buffers: eager protocol, not always enough
• user buffers: rendezvous protocol, guarantee correct execution
AMinput data
AM output
data
RMA window
origin input buffer origin output buffer
target input buffer target output buffer
target persistent buffer
AM handler
private memory
private memory
MPI-AM workflow
AM handler
memory barrier
memory barrier
AM handler
memory barrier
flush cache line back
SEPARATE window model
UNIFIED window model
MPI-AM: an MPI-interoperable framework that can dynamically manage data movement and user-defined remote computation.
memory barrier
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Asynchronous and MPI-Interoperable AM– Supporting asynchronous AMs internally from MPI library
• Inter-node messages: spawn a thread in network module
– Block waiting for AM
– Separate sockets for AM and other MPI messages
• Intra-node messages: “origin computation”
– Processes on the same node allocate window on a shared-memory region
– Origin process directly fetches data from target process’s memory, completes computation locally and writes data back to target process’s memory
Network
rank 0(target)
rank 1(origin)
rank 2(origin)
Shared-memory
NODE 0 NODE 1
83
[CCGrid2013] X. Zhao, D. Buntinas, J. Zounmevo, J. Dinan, D. Goodell, P. Balaji, R. Thakur, A. Afsahi, W. Gropp, “Towards Asynchronous and
MPI-Interoperable Active Messages”, in proceedings of CCGrid’ 13
Design of asynchronous AMsGraph500 results, strong scaling (gathered on Fusion cluster at ANL, 320 nodes, 2560 cores, QDR InfiniBand)
0
20
40
60
80
100
120
140
160
128 256 512
TEP
S (X
10
00
)
# processesvertices number = 2^15
Default-g500
DDT-g500
AM-g500
0
500
1000
1500
2000
128 256 512
TEP
S (X
10
00
)
# processesvertices number = 2^20
Default-g500
DDT-g500
AM-g500
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Tasking Models based on MPI AM
Arbitrary graphs can be created where tasks have
dependencies on computation and communication
Spawn off an AM when some collection of MPI requests are
complete
– Each MPI request can correspond to communication operations or
other AMs
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Data Movement for Heterogeneous MemoryDMEM: exploring efficient techniques for data movement on heterogeneous memory architectures
Research areas:
• Programming constructs
• Performance and power/energy
• Transparency through introspection tools
• Evaluation through an integrated test
suite and various DOE applications
RelevanceminiMD default dataset- 7G data accesses
6 sec execution in 1 core- 54Gb of data accesses
Cache
Core
Main Memory
NVRAM
Disk
Hierarchical Memory View
Core
Scratchpad Memory
Main Memory
NVRAM
Less Reliable Memory
Accelerator Memory
Heterogeneous Memory as First-class Citizens
Compute-capable Memory0
1
2
0
200
400
1 6 11 16 21 26 Am
ou
nt
(x1
09)
Acc
ess
es
(x1
06)
CPU Instructions Executed (x109)
Loads Stores Loaded Stored
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory 86
DMEM Inside the MPICH MPI implementation
– Enables GPU pointers in MPI calls
• (CUDA & OpenCL); generic support
for heterogeneous memory
subsystems is coming
– Coding productivity + performance
– Generic approach, feature
independent (UVA not needed)
– Datatype and communicator attrs
• Pointer location
• Streams / queues
MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems
AM Aji, LS Panwar, F Ji, M Chabbi, K Murthy, P Balaji, KR Bisset, JS Dinan, W Feng, J Mellor-Crummey, X Ma, and RS Thakur. “On the efficacy of GPU-integrated MPI for scientific applications”, in HPDC 2013.AM Aji, P Balaji, JS Dinan, W Feng, and RS Thakur. “Synchronization and ordering semantics in hybrid MPI+GPU programming”. In AsHES 2013.J Jenkins, JS Dinan, P Balaji, NF Samatova, and RS Thakur. “Enabling fast, noncontiguous GPU data movement in hybrid MPI+GPU environments. In Cluster 2012.AM Aji, JS Dinan, DT Buntinas, P Balaji, W Feng, KR Bisset, and RS Thakur. “MPI-ACC: an integrated and extensible approach on data movement in accelerator-based systems”. In HPCC 2012.F Ji, AM Aji, JS Dinan, DT Buntinas, P Balaji, RS Thakur, W Feng, and X Ma. DMA-assisted, intranode communication in GPU accelerated systems. In HPCC 2012.F Ji, AM Aji, JS Dinan, DT Buntinas, P Balaji, W Feng, and X Ma. “Efficient Intranode Communication in GPU-accelerated systems”. In IPDPSW 2012.
Evaluating Epidemiology Simulation with MPI-ACC
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Understanding the Relevance of Heterogeneous
Memory
NVRAM
DDR
LP-DDR
DDR ECC
DDR 2-ECCVector Mem.
CPU
Cache Scratchpad
Per-memory object data
access pattern analysis
– Statically allocated
– Dinamically allocated
Strategy:
– HW emulator
• Instruction execution
• Per-object differentiation
– Cache simulator
– Memory simulator
Goals: – To estimate the number of stall cycles per object
– To propose the most suitable memory for them
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Access Pattern Overview (w/o Cache Effects) –
MiniAPPs
Min
iMD
MC
CK
CIA
N C
ou
plin
gN
ekb
on
e
Mardi Gras Conference (02/28/2014)
Pavan Balaji, Argonne National Laboratory
Take Away
MPI has a lot to offer for Exascale systems– MPI-3 and MPI-4 incorporate some of the research ideas
– MPI implementations moving ahead with newer ideas for Exascale
– Several optimizations inside implementations, and new functionality
The work is not done, still a long way to go– But a start-from-scratch approach is neither practical nor necessary
– Invest in orthogonal technologies that work with MPI (MPI+X)
I don’t know what tomorrow’s scientific computing language will look like, but I know it will be called Fortran
I don’t know what tomorrow’s social networking site will look like, but I know it will be called Facebook
I don’t know what tomorrow’s parallel programming model will look like, but I know it will be called MPI (+X)
Mardi Gras Conference (02/28/2014)