mpi for exascale systems (how i learned to stop worrying ... · mpi in the exascale era under a lot...

89
MPI for Exascale Systems Pavan Balaji Computer Scientist and Group Lead Argonne National Laboratory (How I Learned to Stop Worrying about Exascale and Love MPI)

Upload: others

Post on 02-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

MPI for Exascale Systems

Pavan Balaji

Computer Scientist and Group Lead

Argonne National Laboratory

(How I Learned to Stop Worrying about

Exascale and Love MPI)

Page 2: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Page 3: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Argonne

National

Laboratory

About Argonne

$675M operating budget

3,200 employees 1,450 scientists and

engineers 750 Ph.D.s

Page 4: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Major Scientific

User Facilities

Advanced Photon Source Center for Nanoscale Materials

Electron Microscopy Center

Argonne Leadership Computing Facility

Argonne Tandem Linear Accelerator System

Page 5: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Let’s talk MPI!

Page 6: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Standardizing Message-Passing Models

Early vendor systems (Intel’s NX, IBM’s EUI, TMC’s CMMD) were

not portable (or very capable)

Early portable systems (PVM, p4, TCGMSG, Chameleon) were

mainly research efforts

– Did not address the full spectrum of message-passing issues

– Lacked vendor support

– Were not implemented at the most efficient level

The MPI Forum was a collection of vendors, portability writers and

users that wanted to standardize all these efforts

Mardi Gras Conference (02/28/2014)

Page 7: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

What is MPI?

MPI: Message Passing Interface

– The MPI Forum organized in 1992 with broad participation by:

• Vendors: IBM, Intel, TMC, SGI, Convex, Meiko

• Portability library writers: PVM, p4

• Users: application scientists and library writers

• MPI-1 finished in 18 months

– Incorporates the best ideas in a “standard” way

• Each function takes fixed arguments

• Each function has fixed semantics

– Standardizes what the MPI implementation provides and what the

application can and cannot expect

– Each system can implement it differently as long as the semantics match

MPI is not…

– a language or compiler specification

– a specific implementation or product

Mardi Gras Conference (02/28/2014)

Page 8: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI Forum Organizational Structure

Steering committee

Organizers

– Martin Schulz (chair)

Bunch of working groups

– One-sided communication (chairs: Bill Gropp, Rajeev Thakur)

– Hybrid programming (chair: Pavan Balaji)

– Collective communication (chair: Torsten Hoefler)

– Communicators, Contexts and Groups (chair: Pavan Balaji)

– Tools (chair: Kathryn Mohror)

– Fortran (chair: Rolf Rabenseifner)

– Fault Tolerance (chairs: Wesley Bland, Aurelien)

Mardi Gras Conference (02/28/2014)

Page 9: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI-1

MPI-1 supports the classical message-passing programming

model: basic point-to-point communication, collectives,

datatypes, C/Fortran bindings, etc.

MPI-1 was defined (1994) by a broadly based group of parallel

computer vendors, computer scientists, and applications

developers.

– 2-year intensive process

Implementations appeared quickly and now MPI is taken for

granted as vendor-supported software on any parallel

machine.

Free, portable implementations exist for clusters and other

environments (MPICH, Open MPI)

9Mardi Gras Conference (02/28/2014)

Page 10: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Following MPI Standards

MPI-2 was released in 2000

– Several additional features including MPI + threads, MPI-I/O, remote

memory access, dynamic processes, C++/F90 bindings and others

MPI-2.1 (2008) and MPI-2.2 (2009) were recently released

with some corrections to the standard and small features

MPI-3 (2012) added several new features to MPI

The Standard itself:

– at http://www.mpi-forum.org

– All MPI official releases, in both postscript and HTML

Other information on Web:

– at http://www.mcs.anl.gov/mpi

– pointers to lots of material including tutorials, a FAQ, other MPI pages

Mardi Gras Conference (02/28/2014)

Page 11: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Overview of New Features in MPI-3

Major new features– Nonblocking collectives

– Neighborhood collectives

– Improved one-sided communication interface

– Tools interface

– Fortran 2008 bindings

Other new features– Matching Probe and Recv for thread-safe probe and receive

– Noncollective communicator creation function

– “const” correct C bindings

– Comm_split_type function

– Nonblocking Comm_dup

– Type_create_hindexed_block function

C++ bindings removed; previous deprecated functions removed

Mardi Gras Conference (02/28/2014)

Page 12: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI in the Exascale Era

Under a lot of scrutiny (good!)

– Lots of myths floating around (bad!)

Push to get new programming models designed and

developed for exascale

The truth is that MPI today is a new programming model

(compared to 2004) , and MPI in 2020 will be a new

programming model (compared to today)

Strengths of MPI

– Composability

• Ability to build tools and libraries above and around MPI

• No “do everything under the sun” attitude

– Continuous evolution

• The standard incorporates best research ideas

Mardi Gras Conference (02/28/2014)

Page 13: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI: A Philosophical Perspective

Debate in the community that MPI is too hard to program

– We don’t disagree

– It’s meant to be a low-level portable runtime on top of which higher-

level programming models should be developed

A programming model has to pick a tradeoff between

programmability, portability, and performance

– MPI has chosen to be a high-performance/portable programming model

– Focus has been on completeness and ability to help real and complex

applications meet their computational needs

MPI’s goal is not to make simple programs easy to write, but

to make complex programs possible to write

Mardi Gras Conference (02/28/2014)

Page 14: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI Ecosystem

Concentrates on minimal required features (not minimal functions!) and aims to build a large ecosystem that provides easier/more-specific interfaces– High-level libraries for tasks, global address, math, visualization, I/O

Intel MPI

IBM MPI

CrayMPI

MPICH

MVAPICH

TianheMPI

MPEPETSc

MathWorks

MPI

HPCToolkit

TAU

Totalview

DDT

ADLB

ANSYS

Mardi Gras Conference (02/28/2014)

Page 15: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Current Situation with Production Applications (1)

The vast majority of DOE’s production parallel scientific applications today

use MPI

– Increasing number use (MPI + OpenMP) hybrid

– Some exploring (MPI + accelerator) hybrid

Today’s largest systems in terms of number of regular cores (excluding

GPU cores)

Sequoia (LLNL) 1,572,864 cores

Mira (ANL) 786,432 cores

K computer 705,024 cores

Jülich BG/Q 393,216 cores

Blue Waters 386,816 cores

Titan (ORNL) 299,008 cores

MPI already runs in production on systems with up to 1.6 million cores

Mardi Gras Conference (02/28/2014)

Page 16: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Current Situation with Production Applications (2)

IBM has successfully scaled the LAMMPS application to over 3 million MPI

ranks

Applications are running at scale on LLNL’s Sequoia and achieving 12 to 14

petaflops sustained performance

HACC cosmology code from Argonne (Salman Habib) achieved

14 petaflops on Sequoia

– Ran on full Sequoia system using MPI + OpenMP hybrid

– Used 16 MPI ranks * 4 OpenMP threads on each node, which matches the

hardware architecture: 16 cores per node with 4 hardware threads each

– http://www.hpcwire.com/hpcwire/2012-11-

29/sequoia_supercomputer_runs_cosmology_code_at_14_petaflops.html

– SC12 Gordon Bell prize finalist

Mardi Gras Conference (02/28/2014)

Page 17: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Current Situation with Production Applications (3)

Cardioid cardiac modeling code (IBM & LLNL) achieved 12

petaflops on Sequoia

– Models a beating human heart at near-cellular resolution

– Ran at scale on full system (96 racks)

– Used MPI + threads hybrid: 1 MPI rank per node and 64 threads

– OpenMP was used for thread creation only; all other thread

choreography and synchronization used custom code, not OpenMP

pragmas

– http://nnsa.energy.gov/mediaroom/pressreleases/sequoia112812

– SC12 Gordon Bell Prize finalist

And there are other applications running at similar scales…

Mardi Gras Conference (02/28/2014)

Page 18: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

On the path to Exascale (based on Bill Dally’s talk

at ISC 2013)

0

200

400

600

800

1000

1200

2011/06 2011/11 2012/06 2012/11 2013/06 2013/11

Top500

Green500

Exaflop

Mardi Gras Conference (02/28/2014)

11.1 -26.3X

Device Technology Fabrication Process: 2.2X

Software Improvements: 1.7 – 4X

Logic Circuit Design: 3X

Page 19: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

New Features in MPI-3

Page 20: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Nonblocking Collectives

Nonblocking versions of all collective communication

functions have been added

– MPI_Ibcast, MPI_Ireduce, MPI_Iallreduce, etc.

– There is even a nonblocking barrier, MPI_Ibarrier

They return an MPI_Request object, similar to nonblocking

point-to-point operations

The user must call MPI_Test/MPI_Wait or their variants to

complete the operation

Multiple nonblocking collectives may be outstanding, but they

must be called in the same order on all processes

Page 21: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Nonblocking Collectives Overlap

Software pipelining

– More complex parameters

– Progression issues

– Not scale-invariant

Hoefler: Leveraging Non-blocking Collective Communication in High-performance Applications

Mardi Gras Conference (02/28/2014)

Page 22: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Dynamic Sparse Data Exchange (DSDE)

Main Problem: metadata

– Determine who wants to send how much data to me

(I must post receive and reserve memory)

OR:

– Use MPI semantics:

• Unknown sender

– MPI_ANY_SOURCE

• Unknown message size

– MPI_PROBE

• Reduces problem to counting

the number of neighbors

• Allow faster implementation!

T. Hoefler et al.: Scalable Communication Protocols for Dynamic Sparse Data Exchange

Mardi Gras Conference (02/28/2014)

Page 23: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI_Ibarrier (NBX)

Complexity - census (barrier): ( )

– Combines metadata with actual transmission

– Point-to-point

synchronization

– Continue receiving

until barrier completes

– Processes start coll.

synch. (barrier) when

p2p phase ended

• barrier = distributed

marker!

– Better than PEX,

PCX, RSX!

Mardi Gras Conference (02/28/2014)

Page 24: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Parallel Breadth First Search

On a clustered Erdős-Rényi graph, weak scaling

– 6.75 million edges per node (filled 1 GiB)

HW barrier support is significant at large scale!

BlueGene/P – with HW barrier! Myrinet 2000 with LibNBC

Mardi Gras Conference (02/28/2014)

Page 25: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Neighborhood Collectives

New functions MPI_Neighbor_allgather,

MPI_Neighbor_alltoall, and their variants define collective

operations among a process and its neighbors

Neighbors are defined by an MPI Cartesian or graph virtual

process topology that must be previously set

These functions are useful, for example, in stencil

computations that require nearest-neighbor exchanges

They also represent sparse all-to-many communication

concisely, which is essential when running on many thousands

of processes.

– Do not require passing long vector arguments as in MPI_Alltoallv

Page 26: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Cartesian Neighborhood Collectives

Buffer ordering example:

T. Hoefler and J. L. Traeff: Sparse Collective Operations for MPI

Mardi Gras Conference (02/28/2014)

Page 27: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Improved RMA Interface

Substantial extensions to the MPI-2 RMA interface

New window creation routines:

– MPI_Win_allocate: MPI allocates the memory associated with the

window (instead of the user passing allocated memory)

– MPI_Win_create_dynamic: Creates a window without memory

attached. User can dynamically attach and detach memory to/from

the window by calling MPI_Win_attach and MPI_Win_detach

– MPI_Win_allocate_shared: Creates a window of shared memory

(within a node) that can be can be accessed simultaneously by direct

load/store accesses as well as RMA ops

New atomic read-modify-write operations

– MPI_Get_accumulate

– MPI_Fetch_and_op (simplified version of Get_accumulate)

– MPI_Compare_and_swap

Page 28: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

One-sided Communication

The basic idea of one-sided communication models is to

decouple data movement with process synchronization

– Should be able move data without requiring that the remote process

synchronize

– Each process exposes a part of its memory to other processes

– Other processes can directly read from or write to this memory

Process 1 Process 2 Process 3

Private

Memory

Private

Memory

Private

Memory

Process 0

Private

Memory

Remotely

Accessible

Memory

Remotely

Accessible

Memory

Remotely

Accessible

Memory

Remotely

Accessible

Memory

Global Address

Space

Private

Memory

Private

Memory

Private

Memory

Private

Memory

Page 29: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Use Case: Distributed Shared Arrays

Quantum Monte Carlo: Ensemble data

– Represents initial quantum state

– Spline representation, cubic basis functions

– Large(100+ GB), read-only table of coeff.

– Accesses are random

Coupled cluster simulations

– Evolving quantum state of the system

– Very large, tables of coefficients

– Tablet read-only, Tablet+1 accumulate-only

– Accesses are non-local/overlapping

Global Arrays PGAS programming model

– Can be supported with passive mode RMA [Dinan et al., IPDPS’12]

Mardi Gras Conference (02/28/2014)

Page 30: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Case-study: NWChem over MPI-3

0

5

10

15

20

25

30

35

1 2 3 4 5

Wal

lclo

ck T

ime

(se

con

ds)

Iteration Number

CCSD Performance

MPI-2

MPI-3

Mardi Gras Conference (02/28/2014)

0

50

100

150

200

250

300

Average Iteration Time

Wal

lclo

ck T

ime

(se

con

ds)

CCSD(T) Performance

MPI-2

MPI-3

Courtesy Jeff Hammond, Argonne National Laboratory

Page 31: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Hybrid Programming with Shared Memory

MPI-3 allows different processes to allocate shared memory

through MPI

– MPI_Win_allocate_shared

Uses many of the concepts of one-sided communication

Applications can do hybrid programming using MPI or

load/store accesses on the shared memory window

Other MPI functions can be used to synchronize access to

shared memory regions

Can be simpler to program than threads

Page 32: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Creating Shared Memory Regions in MPI

MPI_COMM_WORLD

MPI_Comm_split_type (COMM_TYPE_SHARED)

Shared memory communicator

MPI_Win_allocate_shared

Shared memory window

Shared memory window

Shared memory window

Shared memory communicator

Shared memory communicator

Page 33: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Load/store

Regular RMA windows vs. Shared memory windows

Shared memory windows allow

application processes to directly

perform load/store accesses on

all of the window memory

– E.g., x[100] = 10

All of the existing RMA functions

can also be used on such

memory for more advanced

semantics such as atomic

operations

Can be very useful when

processes want to use threads

only to get access to all of the

memory on the node

– You can create a shared memory

window and put your shared data

Local memory

P0

Local memory

P1

Load/storePUT/GET

Traditional RMA windows

Load/store

Local memory

P0 P1

Load/store

Shared memory windows

Load/store

Page 34: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Case Study: Genome Assembly

Largest genome assembly to

date: 2.3TB dataset performed

with MPI-3 shared memory

capability

– First terascale genome assembly

Very simple optimization: place

all of the node dataset in

shared memory and access as

read-only data

Could not use threads because

all MPI calls face lock

overheads

Mardi Gras Conference (02/28/2014)

Terabase Assembly on Cray XE6

Courtesy Fangfang Xia, Argonne National Laboratory

Page 35: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Tools Interface (1)

An extensive interface to allow tools (performance analyzers,

etc.) to portably extract information about MPI processes

Enables the setting of various control variables within an MPI

implementation, such as algorithmic cutoff parameters

– e.g, eager v/s rendezvous thresholds

– Switching between different algorithms for a collective communication

operation

Provides portable access to performance variables that can

provide insight into internal performance information of the

MPI implementation

– e.g., length of unexpected message queue

Note that each implementation defines its own performance

and control variables; MPI does not define them

Page 36: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Tools Interface (2)

Standardized model for debuggers to query for MPI library

information

Totalview, DDT, etc., can peek inside the MPI library and

provide information on queues and other operations

This has been supported by almost every MPI implementation

for many years; it was just blessed by the Forum now

(This is a companion standard, not included in MPI-3)

Mardi Gras Conference (02/28/2014)

Page 37: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Fortran 2008 Bindings

An additional set of bindings for the latest Fortran specification

Supports full and better quality argument checking with

individual handles

Support for choice arguments, similar to (void *) in C

Enables passing array subsections to nonblocking functions

First parallel programming library to actually be Fortran

compliant

– Well, not exactly; requires some Fortran TS 29113 extensions

Page 38: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Support for Very Large Data

Ability to communicate very large data (more than 2GB)

Model is a little round-about for backward compatibility

reasons, but not too hard

– If < 2GB, use MPI communication directly

– If >= 2GB, create derived datatype that represents large data

Added a few extra functions that would typically be used by

high-level libraries to query for such information

Mardi Gras Conference (02/28/2014)

Page 39: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Status of MPI-3 Implementations

Mardi Gras Conference (02/28/2014)

MPICH MVAPICH Cray Tianhe Intel IBM PEIBM

PlatformOpen MPI SGI Fujitsu MS

NB collectives ✔ ✔ ✔ ✔ ✔ Q1 ‘14 ✔ ✔ ✔

Neighborhood collectives ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14

Q4 ’13(in nightly snapshots)

Q1 ‘14

RMA ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14 Q1 ‘14 Q4 ‘13

Shared memory ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 Q1 ‘14 Q4 ‘13

Tools Interface ✔ Q1 ‘14 Q2 ‘14 Q1 ‘14 Q1 ‘14 Q1 ‘14 Q2 ‘15 ✔ Q1 ‘14

Non-collective comm. create ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15

Q4 ’13(in nightly snapshots)

F08 Bindings(MPI-3 + errata)

(Q1 ‘14) (Q2 ‘14) (Q2 ‘14) (Q1 ‘14) (Q1 ‘14) (Q1 ‘14) (Q3 ‘14) (✔) (Q1 ‘14)

New Datatypes ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 ✔ ✔

Large Counts ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q2 ‘15 ✔ Q1 ‘14

Matched Probe ✔ ✔ ✔ ✔ ✔ Q1 ‘14 Q3 ‘14 ✔ ✔

Release dates are estimates and are subject to change at any time.

Empty cells indicate no publicly announced plan to implement/support that feature.

Page 40: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Beyond MPI-3

Page 41: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI Research Activities beyond MPI-3

Improved interoperability with Threads and Shared memory

– Some of this is being considered for MPI-4

Fault Tolerance

– Tolerating faults in processes (fatal crashes), memory, network

– Some of it is being considered for MPI-4

Tasking Models

– Message-driven execution, irregular computations

– Some of it (Active Messages) is being considered for MPI-4

Interacting with Heterogeneous Memory

– Non-DRAM memories (NVRAM, scratchpad, on-chip memory)

– Definitely beyond MPI-4

Mardi Gras Conference (02/28/2014)

Page 42: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI+Threads Hybrid Programming

One of the most successful models in used today

Hybrid programming vs. a single unified programming model

– The number of models we program to should not be too large, but a

small collection of standardized programming models which

interoperate with each other is not a bad thing

– MPI+OpenMP has demonstrated this successfully

Mardi Gras Conference (02/28/2014)

Why is this: better than this?

Page 43: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Four levels of MPI Thread Safety

MPI_THREAD_SINGLE

– MPI only, no threads

MPI_THREAD_FUNNELED

– Outside OpenMP parallel region,

or OpenMP master region

MPI_THREAD_SERIALIZED

– Outside OpenMP parallel region,

or OpenMP single region, or

critical region

MPI_THREAD_MULTIPLE

– Any thread is allowed to make

MPI calls at any time

#pragma omp parallel forfor (i = 0; i < N; i++) {

uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0;}

MPI_Function ( );

Mardi Gras Conference (02/28/2014)

#pragma omp parallel{

/* user computation */#pragma omp singleMPI_Function ();

}

#pragma omp parallel{

/* user computation */#pragma omp criticalMPI_Function ();

}

Page 44: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Problem: Idle Resources during MPI Calls

Threads are only active in the computation phase

Threads are IDLE during MPI calls

#pragma omp parallel forfor (i = 0; i < N; i++) {

uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0; }

MPI_Function ( );

Master

#pragma omp parallel{

/* user computation */

#pragma omp singleMPI_Function ();

}

(a) Funneled mode

(b) Serialized mode

MPI CALL

MPI CALL

Mardi Gras Conference (02/28/2014)

Page 45: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Our Approach:

Sharing Idle Threads with Application inside MPI

#pragma omp parallel{

/* user computation */}

MPI_Function () {#pragma omp parallel{

/* MPI internal task */}

}

Master

(a) Funneled mode

(b) Serialized mode

#pragma omp parallel{

/* user computation */

#pragma omp singleMPI_Function () {

#pragma omp parallel{

/* MPI internal task */}

}}

MPI CALL

Mardi Gras Conference (02/28/2014)

Page 46: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Challenges

Some parallel algorithms are not efficient with insufficient

threads, need tradeoff, but the number of available threads is

UNKNOWN !

Nested parallelism

– Simply creates new Pthreads

– Offloads thread scheduling to OS, caused threads OVERRUNNING issue

#pragma omp parallel{

/* user computation */

#pragma omp singleMPI_Function( ){

…}

}

MPI CALL

(a) Unknown number of IDLE threads

#pragma omp parallel{

#pragma omp single{

#pragma omp parallel{ … }

}}

(b) Threads overrunning

Creates N Pthreads

Creates N Pthreads

Mardi Gras Conference (02/28/2014)

Page 47: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Derived Data Type Packing Processing

MPI_Pack / MPI_Unpack

Communication using Derived Data Type

– Transfer non-contiguous data

– Pack / unpack data internally

#pragma omp parallel forfor (i=0; i<count; i++){

dest[i] = src[i * stride];}

blocklength

count

stride

Mardi Gras Conference (02/28/2014)

00.5

11.5

22.5

33.5

44.5

5

1 2 4 8 16 32 64 128 240

Spe

edu

p

Number of Threads

Communication Time Speedup

Execution Time Speedup

Hybrid MPI+OpenMP NAS Parallel MG benchmark

Page 48: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Parallel shared memory communication

Get as many available cells as we can

Parallelizing large data movement

Shared Buffer

Cell[0]

Cell[1]

Cell[2]

Cell[3]

User Buffer

User Buffer

Sender ReceiverShared Buffer

Cell[0]

Cell[1]

Cell[2]

Cell[3]

User Buffer

User Buffer

Sender Receiver

(a) Sequential Pipelining (b) Parallel pipelining

Mardi Gras Conference (02/28/2014)

Page 49: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Intra-node Large Message Communication

OSU MPI micro-benchmark

Mardi Gras Conference (02/28/2014)

0.125

0.25

0.5

1

2

4

8

16

1 2 4 8 16 32 64 120

Spe

edu

p

Number of Threads

64 KB

256 KB

1 MB

4 MB

16 MB

P2P Bandwidth

0.125

0.25

0.5

1

2

4

8

16

1 2 4 8 16 32 64 120

0.1250.25

0.51248

16

1 2 4 8 16 32 64 120

Similar results of Latency and Message rate

Caused by too small Eager/Rendezvous communication threshold on Xeon Phi.Not by MT-MPI !

Poor pipelining but worse parallelism

Page 50: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Parallel InfiniBand Communication

Structures

– IB context

– Protection Domain

– Queue Pair (critical)

• 1 QP per connection

– Completion Queue (critical)

• Shared by 1 or more QPs

RDMA communication

– Post RDMA operation to QP

– Poll completion from CQ

Internally supports Multi-threading

OpenMP contention issue

P0 P1

P2

QPQPCQPD

HCA

IB CTX

ADI3

CH3

SHM nemesis

TCP IB …

Mardi Gras Conference (02/28/2014)

Page 51: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

One-sided Graph500 benchmark

Every process issues many MPI_Accumulate operations to

the other processes in every breadth first search iteration.

Scale 2^22, 16 edge factor, 64 MPI processes

Mardi Gras Conference (02/28/2014)

1.0

1.1

1.2

1.3

1.4

1.0E+06

1.1E+06

1.2E+06

1.3E+06

1.4E+06

1.5E+06

1.6E+06

1.7E+06

1 2 4 8 16 32 64

Imp

rove

me

nt

Har

mo

nic

Mea

n T

EPS

Number of Threads

Improvement

Harmonic Mean TEPS

Page 52: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI_THREAD_MULTIPLE optimizations

Global– Use a single global mutex, held

from function enter to exit

– Existing implementation

Brief Global– Use a single global mutex, but

reduce the size of the critical section as much as possible

Per-Object– Use one mutex per data object:

lock data not code sections

Lock-Free– Use no mutexes

– Use lock-free algorithms

Mardi Gras Conference (02/28/2014)

Processes Threads

Page 53: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Several Optimizations Later…

Reduction of lock

granularities

Thread-local pools to

reduce sharing

Per-object locks

Some atomic

operations (reference

counts)

But the performance

scaling was still

suboptimal

Mardi Gras Conference (02/28/2014)

Page 54: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

From Pure-MPI to Multithreaded-MPI: Multi-Node

Communication Throughput

Example: circular communication pattern

Onesided communications

No intra-node communication

17 nodes x 16 cores/node

Global mutex

Throughput drops severely when moving to threads

Problems:– Safer algorithms?

– Lock contention?

Mardi Gras Conference (02/28/2014)

0

500

1000

1500

2000

0 50 100 150 200 250 300

Thro

ugh

pu

t [K

byt

es/

s]

Core count

P16:T1 P8:T2 P4:T4 P2:T8 P1:T16

Page 55: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Contention in a Multithreaded MPI Model

Multithreaded MPI

– Threads can make MPI calls

concurrently

– Thread-safety is necessary

MPI Process

Thread1 Thread2

MPI_Init_thread(…,MPI_THREAD_MULTIPLE,…);

.

.

#pragma omp parallel

{

/* Do Work */

MPI_Put();

/* Do Work */

}

MPI_Put()MPI_Put()

ENTER_CS()

EXIT_CS()

ENTER_CS()

EXIT_CS()

Thread-safety can be ensured by:

Critical Sections (Locks)

Possible Contention !

Using Lock-Free algorithms

Non trivial !

Still does memory barriers

Thread Sleeping/

Polling

CONTENTION

Mardi Gras Conference (02/28/2014)

Page 56: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Hidden Evil: Lock Monopolization (Starvation)

Implementing critical sections

with spin-locks or mutexes

Watch out: no fairness guarantee!

Mardi Gras Conference (02/28/2014)

int waiting_threads = 0;int last_holder;

acquire_lock(L){

bool lock_acquired = false;try_lock(L, lock_acquired) if ((lock_acquired) &&

(my_thread_id == last_holder) && (waiting_threads > 0))

STARVATION_CASE;else if (!lock_acquired){

atomic_incr(waiting_threads);lock(L);

atomic_decr(waiting_threads);}last_holder = my_thread_id;return;

}

Starvation Detection Algorithm

0

10

20

30

40

50

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Pe

rce

nta

ge o

f Lo

cks

Acq

uir

ed

Number of Starving Threads

Starvation measurement with 16 processes and 16 threads/nodes

Circular_comm Graph500

Page 57: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

How to fix Lock Monopolization?

Use locks that ensure fairness

Example: Ticket Spin-Lock

Basics:

– Get my ticket and Wait my turn

– Ensures FIFO acquisition

Mardi Gras Conference (02/28/2014)

Simplified Execution flow of a Thead-safe MPI implementation with critical sections

0

0.5

1

1.5

2

2.5

16 32 64

Spe

ed

-up

ove

r M

ute

x

Number of Nodes

2D Stencil, Hallo=2MB/direction, Message size=1KB, 16Threads/Node

Mutex Ticket

Page 58: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Priority Locking Scheme

3 basic locks:

– One for mutual exclusion in each priority level

– Another for high priority threads to block lower ones

Watch out: do not forget fairness in the same priority level

– Use exclusively FIFO locks (Ticket)

Mardi Gras Conference (02/28/2014)

0

1

2

3

16 32 64

Spe

ed

-up

ove

r M

ute

x

Number of Nodes

2D Stencil, Hallo=2MB/direction, Message size=1KB, 16Threads/Node

Mutex Ticket Priority

Page 59: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Does Fixing Lock Contention

Solve the Problem?

Spin-lock based critical sections

Contention metric: Wasted Polls

Test scenarios:– Micro-benchmarks

– HPC applications

#pragma omp parallel

{

for(i=0; i< NITER; i++)

{

MPI_Put();

/*Delay X us*/

Delay(X)

}

}

Mardi Gras Conference (02/28/2014)

82

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

Pe

rce

nta

ge o

f Lo

cks

Acq

uir

ed

Number of Polls in Lock Acquisition

Lock Contention (MPI_PUT, 32 nodes x 8 cores)

Delay=0us

Delay=10us

Delay=20us

Delay=50us

Delay=100us

88

90

92

94

96

98

100

0 10 20 30 40 50 60 70 80 90 100

Pe

rcen

tage

of

Lock

s A

cqu

ire

d

Number of Polls in Lock Acquisition

Lock Contention (Graph500 , 32 nodes x 8 cores)

SCALE=14

SCALE=16

SCALE=18

“When you have eliminated the impossible, whatever remains, however improbable,

must be the truth.” – Sherlock Holmes, Sign of Four, Sir Arthur Conan Doyle

Page 60: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Hybrid MPI+OpenMP (or other threading models)

Thread execution model exposed to applications is too

simplistic

OpenMP threads can be pthreads (i.e., can execute

concurrently) or user-level threads such as qthreads (i.e.,

might or might not execute concurrently)

– Not exposed to users/MPI library

What does this mean to MPI?

– MPI runtime never knows when two threads can execute concurrently

and when they cannot

– Always has to perform locks and memory consistency calls (memory

barriers) even when switching between user-level threads

Mardi Gras Conference (02/28/2014)

Page 61: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Argobots: Integrated Computation and Data Movement

with Lightweight Work Units

Execution Model

– Execution stream: a thread

executed on a hardware processing

element

– Work units: a user level thread or a

tasklet with a function pointer

Memory Model

– Memory Domains: A memory

consistency call on a big domain

also impacts all internal domains

– Synchronization: explicit & implicit

memory consistency calls

– Network: PUT/GET, atomic ops Load/Store (LS) Domain (LS0)

Noncoherent Load/Store (NCLS) Domain

LS Noncoherent memory

Put Get

Consistency Domain (CD0)

CD consistent memory

Consistency Domain (CD1)

CD consistent memory

LS Noncoherent memory

Work Units

Execution Stream

Mardi Gras Conference (02/28/2014)

Page 62: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Argobots Ongoing Works: Fine-grained Context-

aware Thread Library

Two Level of Threads

– execution stream: a normal thread

– work unit: a user level thread or a tasklet

with a function pointer

Avoid Unnecessary lock/unlock

– Case 1: switch the execution to another

work unit in the same execution stream

without unlock

– Case 2: switch to another execution stream,

call unlock

Scheduling Work Units in Batch Order

– work units in the same execution stream will

be batch executed

Mardi Gras Conference (02/28/2014)

WorkUnits

ExecutionStreams

w1w3

w5

e1 e2

12

Page 63: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI + Shared memory with Partitioned Address Space

Partitioned virtual address

space allows us to create

different virtual address

images instead a single OS

virtual address space

Each process has a virtual

address space, but they can

look into each others virtual

address space

All memory is semi-private

Mardi Gras Conference (02/28/2014)

Courtesy Atsushi Hori, University of Tokyo

Page 64: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

MPI + Shared memory with User-level processes

Extending the PVAS concept

to “user-level processes”

Multiple “MPI processes”

inside a single OS process

Each MPI process has its own

address space, but the MPI

implementation can look into

other process’ memory and

cooperatively yield to them

Mardi Gras Conference (02/28/2014)

Collaborative work with Atsushi Hori, University of Tokyo

Page 65: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Fault Tolerance: What failures do we expect?

CPU (or full system)

– Total Failure

Memory

– ECC Errors (corrected/uncorrected)

– Silent Errors

Network

– Dropped packets

– Lost routes

– NIC Failure

GPU/Accelerator

– Memory Errors

– Total Failure

Disk

– R/W Errors

– Total Failure

CPU

Memory

Network

GPU

Disk

Page 66: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

CPU (Total System) Failures

Generally will result in a process failure from the perspective of other (off-node) processes

Need to recover/repair lots of parts of the system– Communication library (MPI)

– Computational capacity (if necessary)

– Data

• C/R, ABFT, Natural Fault Tolerance, etc.

MPI-3 has the theoretical ability to deal with faults, but the user has to do a bunch of bookkeeping to make that happen– New communicator has to be created

– All requests have to be kept track of and migrated to the new communicator

– Need to watch out for failure messages from other processes

Page 67: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

MPIXFT: MPI-3 based library for FT bookkeeping

Lightweight virtualization infrastructure

– Gives users virtual communicators and requests and

internally manages the real ones

Automatically repairs MPI communicators as

failures occur

– Handles running in n-1 model

Virtualizes MPI Communicator

– User gets an MPIXFT wrapper communicator

– On failure, the underlying MPI communicator is

replaced with a new, working communicator

MPI_COMM

MPI_COMM

MPIXFT_COMM

Page 68: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

MPIXFT Design

Possible because of new MPI-3

capabilities

– Non-blocking equivalents for (almost)

everything

– MPI_COMM_CREATE_GROUP

MPI_BARRIER

MPI_WAITANY

MPI_IBARRIER

Failure notification request

0 1

2 3

4 5 MP

I_IS

END

()

CO

MM

_CR

EATE

_GR

OU

P()

0 1

2

4 5

Page 69: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

MIPXFT Results

MCCK Mini-app

– Domain decomposition

communication kernel

– Overhead within standard

deviation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

Tim

e (s

)

Processes

MCCK Mini-app

MPICH

MPIXFT

0

0.2

0.4

0.6

0.8

1

1.2

Tim

e (s

)

Processes

Halo Exchange

MPIXFT 1DMPICH 1DMPIXFT 2DMPICH 2DMPIXFT 3DMPICH 3D

Halo Exchange (1D, 2D, 3D)

– Up to 6 outstanding requests at a time

– Very low overhead

Page 70: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

User Level Failure Mitigation

Proposed change to MPI Standard for MPI-4

Repair MPI after process failure

– Enable more custom recovery than MPIXFT

Does not pick a particular recovery technique as better or

worse than others

Introduce minimal changes to MPI

Works around performance problems with MPIXFT

Treat process failure as fail-stop failures

– Transient failures are masked as fail-stop

Ability to notify remaining processes on errors

Page 71: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Recovery with only notification

Master/Worker Example

Post work to multiple processes

MPI_Recv returns error due to failure– MPI_ERR_PROC_FAILED if named

– MPI_ERR_PROC_FAILED_PENDING if wildcard

Master discovers which process has failed with

ACK/GET_ACKED

Master reassigns work to worker 2

MasterWorker

1Worker

2Worker

3

Sen

dR

ecv

Discovery

Sen

d

Page 72: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Failure Propagation

When necessary, manual propagation is available.

– MPI_Comm_revoke(MPI_Comm comm)

• Interrupts all non-local MPI calls on all processes in comm.

• Once revoked, all non-local MPI calls on all processes in comm will return

MPI_ERR_REVOKED.

– Exceptions are MPI_COMM_SHRINK and MPI_COMM_AGREE (later)

– Necessary for deadlock prevention

Often unnecessary

– Let the application

discover the error as it

impacts correct completion

of an operation.

0

1

2

3

Recv(1) Failed

Recv(3)

Send(2)Recv(0)

Revoked

RevokedRevoked

Revoke

Page 73: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

GVR (Global View Resilience) for Memory Errors

Multi-versioned, distributed memory

– Application commits “versions” which are stored by a backend

– Versions are coordinated across entire system

Different from C/R

– Don’t roll back full application stack, just the specific data.

Rollback & re-compute if uncorrected error

Parallel Computation proceeds from phase to phase

Phases create newlogical versions

App-semanticsbased recovery

Page 74: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Memory Fault Tolerance

Memory stashing

Multi-versioned memory

Implementation-defined block-level retrieval across versions

Silent-data corruption handled through application hooks

(only works if applications can detect “silent errors”)

Nondeterministic methods can naturally deal with them

because of some common hardware properties

– Exponents are less error prone than mantissa, because of hardware

logic models (mantissa uses simpler hardware)

– Instruction caches are more reliable than data caches

Mardi Gras Conference (02/28/2014)

Page 75: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

MPI Memory Resilience

(Planned Work)

MPI_PUT

MPI_PUT

Traditional

Replicated

Integrate memory stashing within the MPI stack

– Data can be replicated across different kinds of memories (or storage)

– On error, repair data from backup memory (or disk)

Page 76: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

Traditional Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

VOCL Model

Native OpenCL Library

Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

Virtual GPU

Compute Node

Physical GPU

Application

Native OpenCL Library

OpenCL API

Transparent utilization of remote GPUs

Efficient GPU resource management:

– Migration (GPU / server)

– Power Management: pVOCL

VOCL: Transparent Remote GPU Computing

Page 77: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

bufferWritelanchKernelbufferReadsync

lanchKernelbufferReadsync

bufferWritelanchKernelbufferRead sync

bufferWrite detecting threadlanchKernelbufferReadsync

lanchKernelbufferReadsync

bufferWritelanchKernelbufferReadsync

Double- (uncorrected) and single-bit (corrected)

error counters may be queried in both models

ECC QueryCheckpointing

Synchronous Detection Model

VOCL FTFunctionality

ECC QueryCheckpointing

User App.

Asynchronous Detection Model

User App. VOCL FT Thread

ECC Query

ECC Query

ECC Query

ECC Query

ECC Query

Minimum overhead, but

double-bit errors will trash whole

executions

VOCL-FT (Fault Tolerant Virtual OpenCL)

Page 78: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory Mardi Gras Conference (02/28/2014)

0

1000

2000

3000

4000

5000

512*512 1024*1024 2048*2048 4096*4096

exec

uti

on

tim

e (m

s)

#threads

Matrix Transpose

vocl async sync

0

2000

4000

6000

8000

10000

1K 2K 3K 4K 5K 6K

exec

uti

on

tim

e(m

s)

query size

Smith-Waterman

vocl async sync

0

10000

20000

30000

40000

50000

60000

15360 23040 30720 38400

exec

uti

on

tim

e(m

s)

#of bodies

Nbody

vocl async sync

0

2000

4000

6000

8000

10000

12000

512*512 1024*1024 2048*2048 4096*4096

exec

uti

on

tim

e(m

s)

matrix block

DGEMM

vocl async sync

VOCL-FT: Single and Double Bit Error Detection Overhead

Page 79: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Data-Intensive Applications Graph Algorithm in Social Network Analysis

DNA Sequence Assembly

remote search

local node

remote nodeACGCGATTCAG

GCGATTCAGTAACGCGATTCAGTA

DNA consensus sequence

Common Characteristics

– Organized around sparse structures

– Communication-to-computation ratio is high

– Irregular communication pattern

79Mardi Gras Conference (02/28/2014)

Page 80: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Tasking Models

Several tasking models built on top of MPI (ADLB, Scioto)

Provide a dynamic task model, potentially with dependencies, for applications to use– MPI processes are static; application view of computation/data is dynamic

Need to be taken with a grain of salt– Asynchronous execution not always great: can screw up cache locality

– Synchronization does not necessarily mean idleness

– Idleness does not necessarily mean worse performance or energy usage

Big problem with these and other tasking models is MPI interoperability– Libraries built on top of MPI are better in this regard, but still not perfect,

because there is no support for it from the runtime system

– E.g., if I spawn off tasks to execute on a remote process, while the remote process is waiting in an MPI_BARRIER, the tasking library does not have process control to execute the tasks

Mardi Gras Conference (02/28/2014)

Page 81: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Current MPI is not well-suitable to data-intensive applicationsProcess 0 Process 1

Send (data)Receive (data)

Process 0 Process 1

Put (data)

Get (data)

Acc (data) +=Send (data)

two-sided communication

(explicit sends and receives) one-sided (RMA) communication

(explicit sends, implicit receives, simple

remote operations)

origin target

messages handler

reply handler

✗✗

Message Passing Models

Receive (data)

Active Messages

– Sender explicitly sends message

– Upon message’s arrival, message handler is triggered, receiver is not explicitly involved

– User-defined operations on remote process

81Mardi Gras Conference (02/28/2014)

Page 82: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

– Correctness semantics

• Memory consistency

– MPI runtime must ensure consistency of window

• Three different type of ordering

• Concurrency: by default, MPI runtime behaves “as if” AMs are executed in sequential order. User can

release concurrency by setting MPI assert.

– Streaming AMs

• define “segment”--- minimum number of elements for AM execution

• achieve pipeline effect and reduce temporary buffer requirement

Generalized and MPI-Interoperable AM

82[ICPADS 2013] X. Zhao, P. Balaji, W. Gropp, R. Thakur, “MPI-Interoperable and Generalized Active Messages”, in proceedings of ICPADS’ 13

– Explicit and implicit buffer management

• system buffers: eager protocol, not always enough

• user buffers: rendezvous protocol, guarantee correct execution

AMinput data

AM output

data

RMA window

origin input buffer origin output buffer

target input buffer target output buffer

target persistent buffer

AM handler

private memory

private memory

MPI-AM workflow

AM handler

memory barrier

memory barrier

AM handler

memory barrier

flush cache line back

SEPARATE window model

UNIFIED window model

MPI-AM: an MPI-interoperable framework that can dynamically manage data movement and user-defined remote computation.

memory barrier

Mardi Gras Conference (02/28/2014)

Page 83: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Asynchronous and MPI-Interoperable AM– Supporting asynchronous AMs internally from MPI library

• Inter-node messages: spawn a thread in network module

– Block waiting for AM

– Separate sockets for AM and other MPI messages

• Intra-node messages: “origin computation”

– Processes on the same node allocate window on a shared-memory region

– Origin process directly fetches data from target process’s memory, completes computation locally and writes data back to target process’s memory

Network

rank 0(target)

rank 1(origin)

rank 2(origin)

Shared-memory

NODE 0 NODE 1

83

[CCGrid2013] X. Zhao, D. Buntinas, J. Zounmevo, J. Dinan, D. Goodell, P. Balaji, R. Thakur, A. Afsahi, W. Gropp, “Towards Asynchronous and

MPI-Interoperable Active Messages”, in proceedings of CCGrid’ 13

Design of asynchronous AMsGraph500 results, strong scaling (gathered on Fusion cluster at ANL, 320 nodes, 2560 cores, QDR InfiniBand)

0

20

40

60

80

100

120

140

160

128 256 512

TEP

S (X

10

00

)

# processesvertices number = 2^15

Default-g500

DDT-g500

AM-g500

0

500

1000

1500

2000

128 256 512

TEP

S (X

10

00

)

# processesvertices number = 2^20

Default-g500

DDT-g500

AM-g500

Mardi Gras Conference (02/28/2014)

Page 84: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Tasking Models based on MPI AM

Arbitrary graphs can be created where tasks have

dependencies on computation and communication

Spawn off an AM when some collection of MPI requests are

complete

– Each MPI request can correspond to communication operations or

other AMs

Mardi Gras Conference (02/28/2014)

Page 85: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Data Movement for Heterogeneous MemoryDMEM: exploring efficient techniques for data movement on heterogeneous memory architectures

Research areas:

• Programming constructs

• Performance and power/energy

• Transparency through introspection tools

• Evaluation through an integrated test

suite and various DOE applications

RelevanceminiMD default dataset- 7G data accesses

6 sec execution in 1 core- 54Gb of data accesses

Cache

Core

Main Memory

NVRAM

Disk

Hierarchical Memory View

Core

Scratchpad Memory

Main Memory

NVRAM

Less Reliable Memory

Accelerator Memory

Heterogeneous Memory as First-class Citizens

Compute-capable Memory0

1

2

0

200

400

1 6 11 16 21 26 Am

ou

nt

(x1

09)

Acc

ess

es

(x1

06)

CPU Instructions Executed (x109)

Loads Stores Loaded Stored

Mardi Gras Conference (02/28/2014)

Page 86: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory 86

DMEM Inside the MPICH MPI implementation

– Enables GPU pointers in MPI calls

• (CUDA & OpenCL); generic support

for heterogeneous memory

subsystems is coming

– Coding productivity + performance

– Generic approach, feature

independent (UVA not needed)

– Datatype and communicator attrs

• Pointer location

• Streams / queues

MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems

AM Aji, LS Panwar, F Ji, M Chabbi, K Murthy, P Balaji, KR Bisset, JS Dinan, W Feng, J Mellor-Crummey, X Ma, and RS Thakur. “On the efficacy of GPU-integrated MPI for scientific applications”, in HPDC 2013.AM Aji, P Balaji, JS Dinan, W Feng, and RS Thakur. “Synchronization and ordering semantics in hybrid MPI+GPU programming”. In AsHES 2013.J Jenkins, JS Dinan, P Balaji, NF Samatova, and RS Thakur. “Enabling fast, noncontiguous GPU data movement in hybrid MPI+GPU environments. In Cluster 2012.AM Aji, JS Dinan, DT Buntinas, P Balaji, W Feng, KR Bisset, and RS Thakur. “MPI-ACC: an integrated and extensible approach on data movement in accelerator-based systems”. In HPCC 2012.F Ji, AM Aji, JS Dinan, DT Buntinas, P Balaji, RS Thakur, W Feng, and X Ma. DMA-assisted, intranode communication in GPU accelerated systems. In HPCC 2012.F Ji, AM Aji, JS Dinan, DT Buntinas, P Balaji, W Feng, and X Ma. “Efficient Intranode Communication in GPU-accelerated systems”. In IPDPSW 2012.

Evaluating Epidemiology Simulation with MPI-ACC

Mardi Gras Conference (02/28/2014)

Page 87: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Understanding the Relevance of Heterogeneous

Memory

NVRAM

DDR

LP-DDR

DDR ECC

DDR 2-ECCVector Mem.

CPU

Cache Scratchpad

Per-memory object data

access pattern analysis

– Statically allocated

– Dinamically allocated

Strategy:

– HW emulator

• Instruction execution

• Per-object differentiation

– Cache simulator

– Memory simulator

Goals: – To estimate the number of stall cycles per object

– To propose the most suitable memory for them

Mardi Gras Conference (02/28/2014)

Page 88: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Access Pattern Overview (w/o Cache Effects) –

MiniAPPs

Min

iMD

MC

CK

CIA

N C

ou

plin

gN

ekb

on

e

Mardi Gras Conference (02/28/2014)

Page 89: MPI for Exascale Systems (How I Learned to Stop Worrying ... · MPI in the Exascale Era Under a lot of scrutiny (good!) –Lots of myths floating around (bad!) Push to get new programming

Pavan Balaji, Argonne National Laboratory

Take Away

MPI has a lot to offer for Exascale systems– MPI-3 and MPI-4 incorporate some of the research ideas

– MPI implementations moving ahead with newer ideas for Exascale

– Several optimizations inside implementations, and new functionality

The work is not done, still a long way to go– But a start-from-scratch approach is neither practical nor necessary

– Invest in orthogonal technologies that work with MPI (MPI+X)

I don’t know what tomorrow’s scientific computing language will look like, but I know it will be called Fortran

I don’t know what tomorrow’s social networking site will look like, but I know it will be called Facebook

I don’t know what tomorrow’s parallel programming model will look like, but I know it will be called MPI (+X)

Mardi Gras Conference (02/28/2014)