Programming with Charm++ and AMPI: Introduction
Laxmikant (Sanjay) Kale & Eric Bohmhttp://charm.cs.uiuc.edu
Parallel Programming LaboratoryDepartment of Computer Science
University of Illinois at Urbana Champaign
Motivation, Concepts and Benefits
Before we plunge into the details of programming, let us examineo Why we developed Charm++o What are some of the central ideaso What benefits you get by using Charm++
And so, in what contexts is it especially useful
Specifically:o What is object‐based over‐decomposition
A bit of a look under the hood: how is it implemented
o How Charm++ and AMPI embody this ideao Capabilities of the system: what it automates, supports, …
August 5th, 2009 2Charm ++ and AMPI: Session I
Summarizing the State of Art
Petascaleo Very powerful parallel computers are being builto Application domains exist that need that kind of powerNew generation of applicationso Use sophisticated algorithmso Dynamic adaptive refinementso Multi‐scale, multi‐physicsChallenge: o Parallel programming is more complex than sequentialo Difficulty in achieving performance that scales to PetaFLOPs and beyond
o Difficulty in getting correct behavior from programs
August 5th, 2009 3Charm ++ and AMPI: Session I
Guiding Principles Behind Charm++
No magico Parallelizing compilers have achieved close to technical perfection, but are not enough
o Sequential programs obscure too much information
Seek an optimal division of labor between the system and the programmer
Design abstractions based solidly on use‐caseso Application‐oriented yet computer‐science centered approach
August 5th, 2009 4Charm ++ and AMPI: Session I
Charm ++ and CSE Applications
5
Enabling CS technology of parallel objects and intelligent runtime systems has led to several CSE collaborative applications
Synergy
Well‐known Biophysics molecular simulations App
Gordon Bell Award, 2002
Computational Astronomy
Nano‐Materials..
Charm ++ and AMPI: Session IAugust 5th, 2009
Object‐based Over‐decomposition
Let the programmer decompose computation into objects
Work units, data‐units, composites
Let an intelligent runtime system assign objects to processorso RTS can change this assignment (mapping) during execution
Locality of data references is a critical attribute for performance A parallel object can access only its own datao Asynchronous method invocation for accessing other objects’ data
August 5th, 2009 6Charm ++ and AMPI: Session I
Object‐based Over‐decomposition: Charm++
August 5th, 2009 7
User view
System implementation
• Multiple “indexed collections” of C++ objects• Indices can be multi‐dimensional and/or sparse• Programmer expresses communication between objects
– with no reference to processors
Charm ++ and AMPI: Session I
Object‐based Over‐decomposition: AMPI
Each MPI process is implemented as a user‐level threadThreads are light‐weight and migratable!
o <1 microsecond context switch time, potentially >100k threads per core
Each thread is embedded in a charm++ object (chare)
August 5th, 2009 8
Real Processors
MPI processes
Virtual Processors (user‐level migratablethreads)
Charm ++ and AMPI: Session I
B e n e f i t s o f t h e C h a r m M o d e l : O u t l i n e
Software engineeringo Number of virtual processors can be independently controlledo Separate VPs for different modulesMessage driven execution
o Adaptive overlap of communicationo Predictability :
Automatic out‐of‐corePrefetch to local stores
o Asynchronous reductionsDynamic mapping
o Heterogeneous clusters: vacate, adjust to speed, shareo Automatic checkpointingo Change set of processors usedo Automatic dynamic load balancingo Communication optimizationo Fault tolerance
August 5th, 2009 9Charm ++ and AMPI: Session I
P a r a l l e l D e c o m p o s i t i o n a n d P r o c e s s o r s
MPI style encourageso Decomposition into P pieces, where P is the number of physical processors available
o If your natural decomposition is a cube, then the number of processors must be a cube
o …
Charm++/AMPI style “virtual processors”o Decompose into natural objects of the applicationo Let the runtime map them to processorso Decouple decomposition from load balancing
August 5th, 2009 10Charm ++ and AMPI: Session I
Decomposition Independent of Number of Cores
August 5th, 2009 11
Rocket simulation example under traditional MPI vs. Charm++/AMPI framework
o Benefit: load balance, communication optimizations, modularity
Solid
Fluid
Solid
Fluid
Solid
Fluid. . .
1 2 P
Solid1
Fluid1
Solid2
Fluid2
Solidn
Fluidm. . .
Solid3. . .
Charm ++ and AMPI: Session I
Parallel Composition: A1; (B || C ); A2
Recall: Different modules, written in different languages/paradigms, can overlap in time and on processors, without programmer having to worry about this explicitly
Charm ++ and AMPI: Session IAugust 5th, 2009 12
Without message-driven execution (and virtualization), you get either:Space-division
Charm ++ and AMPI: Session IAugust 5th, 2009 13
OR: Sequentialization
Charm ++ and AMPI: Session IAugust 5th, 2009 14
P a r a l l e l i z a t i o n U s i n g C h a r m + +
August 5th, 2009 15
Bhatele, A., Kumar, S., Mei, C., Phillips, J. C., Zheng, G. & Kale, L. V. 2008 Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Miami, FL, USA, April 2008.
Charm ++ and AMPI: Session I
Performance of NAMD
12481632641282565121024
128
256
512
1024
2048
4096
8192
16384
32768
STMV Blue Gene/L
STMV Blue Gene/P
STMV Cray XT4
ApoA1 Blue Gene/L
ApoA1 Blue Gene/P
ApoA1 Cray XT4
August 5th, 2009 16
No. of cores
Time (m
s pe
r step
)
STMV: ~1 million atoms
Blue Gene results based on work on DCMF many‐to‐many pattern by Sameer Kumar, IBM Research
ApoA1: ~92K atoms
Charm ++ and AMPI: Session I
S c h e d u l e r SchedulerMessage Q Message Q
Message‐driven ExecutionAdaptively and automatically overlaps communication and computationAlso, you can predict data accesso You can peak ahead at the queue
Can use to scale memory wallPrefetching of needed data
– into scratch pad memories, for exampleAutomatic out‐of‐core execution
August 5th, 2009 17Charm ++ and AMPI: Session I
Automatic Dynamic Load Balancing
Measurement based load balancerso Principle of persistence: In many CSE applications,
computational loads and communication patterns tend to persist, even in dynamic computations
o So, recent past is a good predictor of near future
o Charm++ provides a suite of load‐balancers periodic measurement and migration of objects
Seed balancers (for task‐parallelism)o Useful for divide‐and‐conquer and state‐space‐search
applications
o Seeds for charm++ objects moved around until they take root
August 5th, 2009 18Charm ++ and AMPI: Session I
Fault Tolerance
Automatic checkpointing
o Migrate objects to disko In‐memory checkpointing as
an optiono Automatic fault detection and
restart
“Impending fault” response
o Migrate objects to other processors
o Adjust processor‐level parallel data structures
Scalable fault toleranceo Experimental featureo When a processor out of
100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!
o Sender‐side message loggingo Latency tolerance helps
mitigate costso Restart can be speeded up by
spreading out objects from failed processor
August 5th, 2009 Charm ++ and AMPI: Session I 19
How to Tune Performance for a Future Machine?
For example, Blue Waters will arrive in 2011o But need to prepare applications for it starting now
Even for extant machineso Full size machine may not be available as often as needed for tuning runs
Our approach: BigSimo Full scale emulation, leveraging virtualizationo Trace‐driven simulation
August 5th, 2009 20Charm ++ and AMPI: Session I
The Rest of the Day
Before lunchAMPI Hands‐on ExerciseBasic Charm++Charm++ Hands‐on Exercise
After lunchLoad BalancingHands‐on with Load BalancingIntermediate Charm++Overview of Advanced Charm++ Concepts
August 5th, 2009 Charm ++ and AMPI: Session I 21
AMPI HANDS‐ON TUTORIALEric J. Bohm
Charm ++ and AMPI: Session I 22
Converting MPI to AMPI
Jacobi 1D Starting Point: version1.cpp
This MPI Program already works as an AMPI program under certain limitations:
o AMPI virtualization ratio must be 1(data stored in global variable is not private to each MPI rank)
o AMPI migration is not used (no pup)
23Charm ++ and AMPI: Session IAugust 5th, 2009
Converting MPI to AMPI
We will fix these two problems!
Specifically:
o Non‐constant data stored in global variables must be privatized (one per AMPI thread)
o PUP (serialization) routines must be created to allow AMPI threads to migrate with user data
24Charm ++ and AMPI: Session IAugust 5th, 2009
Fixing First Problem
The first of these problems is fixed in:version2.cpp
Data stored in global variables is privatized by allocating it on the heap and/or stack
25
version 2.cpp
// called by each AMPI rank in main()chunk *cp = new chunk;
version1.cpp
// in global scopechunk cp;
Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Create a new file called version3.cpp with the same contents as version2.cpp
We will add some code to create a functional AMPI version with automatic load balancing
26Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Pack‐Unpack(PUP) routines must be created to allow AMPI threads to migrate
PUP routines serialize the user data for an AMPI rank when it migrates
In this program all user data is stored in the heap allocated “chunk *cp”
27Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Step 1: Add this PUP routine
28
#ifdef AMPIvoid chunk_pup(pup_er p, void *d) {chunk **cpp = (chunk **) d;if(pup_isUnpacking(p))
*cpp = new chunk;chunk *cp = *cpp;pup_doubles(p, cp->t, (BLOCKSIZE+2)); // PUP an array of
doublespup_int(p, &cp->xp); // PUP an integer valuepup_int(p, &cp->xm); // PUP an integer valueif(pup_isDeleting(p))
delete cp;}#endif
Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Step 2: Register the newly created pup routine in main() after allocation of cp
All program state must be saved and recreated via this PUP routine. Normally this includes just program data, but open files, sockets, and other resources might need to be handled.
29
#ifdef AMPIMPI_Register((void*)&cp, (MPI_PupFn) chunk_pup);
#endif
Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Step 3: Request load balancing every few iterations
The AMPI migration framework may migrate AMPI ranks at these points in the program
30
#ifdef AMPIif(iter%10 == 0) MPI_Migrate();
#endif
Charm ++ and AMPI: Session IAugust 5th, 2009
Hands On: Enabling Migration
Step 4: Enable the desired load balancing module at link time
Turn on load balancing debug output at runtime to verify that everything is working
See the Charm++ manual for more load balancing optionshttp://charm.cs.uiuc.edu/manuals/
31
-balancer GreedyLB
+LBDebug 1
Charm ++ and AMPI: Session IAugust 5th, 2009
Using Charm++ with Arrays
Laxmikant (Sanjay) Kale & Eric Bohmhttp://charm.cs.uiuc.edu
Parallel Programming LaboratoryDepartment of Computer Science
University of Illinois at Urbana Champaign
The Charm++ Model
Parallel objects (chares) communicate via asynchronous method invocations (entry methods)
The runtime system maps chares onto processors and schedules execution of entry methods
33Charm ++ and AMPI: Session IAugust 5th, 2009
The Charm++ Model
34
User View:
System View:
Charm ++ and AMPI: Session IAugust 5th, 2009
Proxy Objects
Entry methods for each chare are invoked using proxy objects‐‐lightweight handles to potentially remote chares
35Charm ++ and AMPI: Session IAugust 5th, 2009
Limitations of Plain Proxies
In a large program, keeping track of all the proxies is difficult
A simple proxy doesn’t tell you anything about the chare other than its type
Managing collective operations like broadcast and reduce is complicated
36Charm ++ and AMPI: Session IAugust 5th, 2009
Example: Molecular Dynamics
Patches: blocks of space containing particles
Computes: chares which compute the interactions between particles in two patches
37Charm ++ and AMPI: Session IAugust 5th, 2009
Example: Molecular Dynamics
We could write this program using just chares and plain proxies
But, it’ll be much simpler if we use Chare Arrays
38
Note: you can find complete MD code in examples/charm++/Molecular2D in the Charm distribution
Charm ++ and AMPI: Session IAugust 5th, 2009
Chare Arrays
Arrays organize chares into indexed collections
Each chare in the array has a proxy for the other array elements, accessible using simple syntax
sampleArray[i] // i’th proxy
39Charm ++ and AMPI: Session IAugust 5th, 2009
Array Dimensions
Anything can be used as array indiceso integers
o tuples
o bit vectors
o user‐defined types
40Charm ++ and AMPI: Session IAugust 5th, 2009
Arrays in MD
array [2D] Patch { ... }
array [4D] Compute { ... }
note: arrays can be sparse
41
x and y coordinates of patch
x and y coordinates of both patches involved
Charm ++ and AMPI: Session IAugust 5th, 2009
Managing Mapping
Arrays let the programmer control the mapping of array elements to PEso Round‐robin, block‐cyclic, etc.
o User defined mapping
42Charm ++ and AMPI: Session IAugust 5th, 2009
Broadcasts
Simple way to invoke the same entry method on each array element
Syntax:// call on one element
sampleArray[i].method()
// broadcast to whole array
sampleArray.method()
43Charm ++ and AMPI: Session IAugust 5th, 2009
Broadcasts in MD
Used to fire off the first timestep after initialization is complete
// All chares are created, time to // start the simulationpatchArray.start();
44Charm ++ and AMPI: Session IAugust 5th, 2009
Reductions
No global flow of control, so each chare must contribute data independently using contribute()A user callback (created using CkCallback) is invoked when the reduction is complete
45Charm ++ and AMPI: Session IAugust 5th, 2009
Reductions
Runtime system builds reduction tree
User specifies reduction operation
At root of tree, a callback is performed on a specified chare
46Charm ++ and AMPI: Session IAugust 5th, 2009
Reduction Example
Sum local error estimators to determine global error
CkCallback cb(CkIndex_Main::computeGlobalError(),mainProxy);
contribute(sizeof(myError), (void*)&myError,CkReduction::sum, cb);
47
Function to call
Chare to handle callback function
Local Data
Reduction operation
Charm ++ and AMPI: Session IAugust 5th, 2009
CHARM++ HANDS‐ON TUTORIALEric J. Bohm
Charm ++ and AMPI: Session IAugust 5th, 2009 48
Ring Example
Different chare elements pass a token around
August 5th, 2009 49Charm ++ and AMPI: Session I
ring.ci
August 5th, 2009 50
mainmodule ring {readonly CProxy_Main mainProxy;readonly int numChares;
mainchare Main {entry Main(CkArgMsg *m);entry void done(void);
};
array [1D] Ring {entry Ring(void);entry void passToken(int token);
}; };
Charm ++ and AMPI: Session I
ring.C (Main Chare)
August 5th, 2009 51
#include "ring.decl.h“
/*readonly*/ CProxy_Main mainProxy;/*readonly*/ int numChares;
/*mainchare*/ class Main : public Chare{public:Main(CkArgMsg* m) {numChares = 5;if(m->argc > 1)
numChares = atoi(m->argv[1]);delete m;
CkPrintf("Running Ring on %d processors for %d elements\n", CkNumPes(), numChares);mainProxy = thishandle;
CProxy_Ring arr = CProxy_Ring::ckNew(numChares);
arr[0].passToken(11);};
void done(void) {CkPrintf("All done\n");CkExit();
};};
Charm ++ and AMPI: Session I
ring.C (1D array)
August 5th, 2009 52
/*array [1D]*/class Ring : public CBase_Ring{public:Ring() {// CkPrintf("Ring %d created\n",thisIndex);
}
Ring(CkMigrateMessage *m) { }
void passToken(int token) {CkPrintf("Token No [%d] received by chare element %d\n", token, thisIndex);if (thisIndex < numChares-1)
// Pass the tokenthisProxy[thisIndex+1].passToken(token+1);
else // We've been around once-- we're done.mainProxy.done();
}};
#include "ring.def.h"
Charm ++ and AMPI: Session I
Compiling a Charm++ Program
Steps in compiling a Charm++ program:
Running the program:
August 5th, 2009 53
charmc ring.cicharmc –O3 –c ring.Ccharmc –O3 –language charm++ –o ring ring.o
./charmrun +p4 ./ring 13
Charm ++ and AMPI: Session I
Output of the Ring Program
August 5th, 2009 54
[bhatele@zen] ring $ ./charmrun +p4 ./ring 13
Charm++: scheduler running in netpoll mode.Charm++> cpu topology info is being gathered. Charm++> 1 unique compute nodes detected. Running Ring on 4 processors for 13 elementsToken No [11] received by chare element 0Token No [12] received by chare element 1Token No [13] received by chare element 2Token No [14] received by chare element 3Token No [15] received by chare element 4Token No [16] received by chare element 5Token No [17] received by chare element 6Token No [18] received by chare element 7Token No [19] received by chare element 8Token No [20] received by chare element 9Token No [21] received by chare element 10Token No [22] received by chare element 11Token No [23] received by chare element 12All done
[bhatele@zen] ring $
Charm ++ and AMPI: Session I
Task
Change the entry method passToken(…) such that it passes an array of integers around
Every chare prints the entire array
August 5th, 2009 55Charm ++ and AMPI: Session I