Download - Programming AMPI: Introduction - Great Lakes Consortium · AMPI threads to migrate PUP routines serialize the user data for an AMPI rank when it migrates In this program all user

Programming with Charm++ and AMPI: Introduction

Laxmikant (Sanjay) Kale & Eric Bohmhttp://charm.cs.uiuc.edu

Parallel Programming LaboratoryDepartment of Computer Science

University of Illinois at Urbana Champaign

http://charm.cs.uiuc.edu/

Motivation, Concepts and Benefits

Before we plunge into the details of programming, let us examineo Why we developed Charm++o What are some of the central ideaso What benefits you get by using Charm++

And so, in what contexts is it especially useful

Specifically:o What is object‐based over‐decomposition

A bit of a look under the hood: how is it implemented

o How Charm++ and AMPI embody this ideao Capabilities of the system: what it automates, supports, …

August 5th, 2009 2Charm ++ and AMPI: Session I

Summarizing the State of Art

Petascaleo Very powerful parallel computers are being builto Application domains exist that need that kind of powerNew generation of applicationso Use sophisticated algorithmso Dynamic adaptive refinementso Multi‐scale, multi‐physicsChallenge: o Parallel programming is more complex than sequentialo Difficulty in achieving performance that scales to PetaFLOPs and beyond

o Difficulty in getting correct behavior from programs


Guiding Principles Behind Charm++

No magico Parallelizing compilers have achieved close to technical perfection, but are not enough

o Sequential programs obscure too much information

Seek an optimal division of labor between the system and the programmer

Design abstractions based solidly on use‐caseso Application‐oriented yet computer‐science centered approach


Charm ++ and CSE Applications

5

Enabling CS technology of parallel objects and intelligent runtime systems has led to several CSE collaborative applications

Synergy

Well‐known Biophysics molecular simulations App

Gordon Bell Award, 2002

Computational Astronomy

Nano‐Materials..

Charm ++ and AMPI: Session IAugust 5th, 2009

Object‐based Over‐decomposition

Let the programmer decompose computation into objects

Work units, data‐units, composites

Let an intelligent runtime system assign objects to processorso RTS can change this assignment (mapping) during execution

Locality of data references is a critical attribute for performance A parallel object can access only its own datao Asynchronous method invocation for accessing other objects’ data


Object‐based Over‐decomposition: Charm++

August 5th, 2009 7

User view

System implementation

• Multiple “indexed collections” of C++ objects• Indices can be multi‐dimensional and/or sparse• Programmer expresses communication between objects

– with no reference to processors

Charm ++ and AMPI: Session I

Object‐based Over‐decomposition: AMPI

Each MPI process is implemented as a user‐level threadThreads are light‐weight and migratable!

o <1 microsecond context switch time, potentially >100k threads per core

Each thread is embedded in a charm++ object (chare)

August 5th, 2009 8

Real Processors

MPI processes

Virtual Processors (user‐level migratablethreads)


B e n e f i t s o f t h e C h a r m M o d e l : O u t l i n e

Software engineeringo Number of virtual processors can be independently controlledo Separate VPs for different modulesMessage driven execution

o Adaptive overlap of communicationo Predictability :

Automatic out‐of‐corePrefetch to local stores

o Asynchronous reductionsDynamic mapping

o Heterogeneous clusters: vacate, adjust to speed, shareo Automatic checkpointingo Change set of processors usedo Automatic dynamic load balancingo Communication optimizationo Fault tolerance


P a r a l l e l D e c o m p o s i t i o n a n d P r o c e s s o r s

MPI style encourageso Decomposition into P pieces, where P is the number of physical processors available

o If your natural decomposition is a cube, then the number of processors must be a cube

o …

Charm++/AMPI style “virtual processors”o Decompose into natural objects of the applicationo Let the runtime map them to processorso Decouple decomposition from load balancing


Decomposition Independent of Number of Cores

August 5th, 2009 11

Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

o Benefit: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .


Parallel Composition: A1; (B || C ); A2

Recall: Different modules, written in different languages/paradigms, can overlap in time and on processors, without programmer having to worry about this explicitly

Charm ++ and AMPI: Session IAugust 5th, 2009 12

Without message-driven execution (and virtualization), you get either:Space-division


OR: Sequentialization


P a r a l l e l i z a t i o n U s i n g C h a r m + +

August 5th, 2009 15

Bhatele, A., Kumar, S., Mei, C., Phillips, J. C., Zheng, G. & Kale, L. V. 2008 Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms. In Proceedings of IEEE International Parallel and Distributed Processing Symposium, Miami, FL, USA, April 2008.


Performance of NAMD

12481632641282565121024

128

256

512

1024

2048

4096

8192

16384

32768

STMV Blue Gene/L

STMV Blue Gene/P

STMV Cray XT4

ApoA1 Blue Gene/L

ApoA1 Blue Gene/P

ApoA1 Cray XT4

August 5th, 2009 16

No. of cores

Time (m

s pe

r step

)

STMV: ~1 million atoms

Blue Gene results based on work on DCMF many‐to‐many pattern by Sameer Kumar, IBM Research

ApoA1: ~92K atoms


S c h e d u l e r SchedulerMessage Q Message Q

Message‐driven ExecutionAdaptively and automatically overlaps communication and computationAlso, you can predict data accesso You can peak ahead at the queue

Can use to scale memory wallPrefetching of needed data

– into scratch pad memories, for exampleAutomatic out‐of‐core execution


Automatic Dynamic Load Balancing

Measurement based load balancerso Principle of persistence: In many CSE applications,

computational loads and communication patterns tend to persist, even in dynamic computations

o So, recent past is a good predictor of near future

o Charm++ provides a suite of load‐balancers periodic measurement and migration of objects

Seed balancers (for task‐parallelism)o Useful for divide‐and‐conquer and state‐space‐search

applications

o Seeds for charm++ objects moved around until they take root


Fault Tolerance

Automatic checkpointing

o Migrate objects to disko In‐memory checkpointing as

an optiono Automatic fault detection and

restart

“Impending fault” response

o Migrate objects to other processors

o Adjust processor‐level parallel data structures

Scalable fault toleranceo Experimental featureo When a processor out of

100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

o Sender‐side message loggingo Latency tolerance helps

mitigate costso Restart can be speeded up by

spreading out objects from failed processor

August 5th, 2009 Charm ++ and AMPI: Session I 19

How to Tune Performance for a Future Machine?

For example, Blue Waters will arrive in 2011o But need to prepare applications for it starting now

Even for extant machineso Full size machine may not be available as often as needed for tuning runs

Our approach: BigSimo Full scale emulation, leveraging virtualizationo Trace‐driven simulation


The Rest of the Day

Before lunchAMPI Hands‐on ExerciseBasic Charm++Charm++ Hands‐on Exercise

After lunchLoad BalancingHands‐on with Load BalancingIntermediate Charm++Overview of Advanced Charm++ Concepts

August 5th, 2009 Charm ++ and AMPI: Session I 21

AMPI HANDS‐ON TUTORIALEric J. Bohm

Charm ++ and AMPI: Session I 22

Converting MPI to AMPI

Jacobi 1D Starting Point: version1.cpp

This MPI Program already works as an AMPI program under certain limitations:

o AMPI virtualization ratio must be 1(data stored in global variable is not private to each MPI rank)

o AMPI migration is not used (no pup)

23Charm ++ and AMPI: Session IAugust 5th, 2009

Converting MPI to AMPI

We will fix these two problems!

Specifically:

o Non‐constant data stored in global variables must be privatized (one per AMPI thread)

o PUP (serialization) routines must be created to allow AMPI threads to migrate with user data


Fixing First Problem

The first of these problems is fixed in:version2.cpp

Data stored in global variables is privatized by allocating it on the heap and/or stack

25

version 2.cpp

// called by each AMPI rank in main()chunk *cp = new chunk;

version1.cpp

// in global scopechunk cp;


Hands On: Enabling Migration

Create a new file called version3.cpp with the same contents as version2.cpp

We will add some code to create a functional AMPI version with automatic load balancing



Pack‐Unpack(PUP) routines must be created to allow AMPI threads to migrate

PUP routines serialize the user data for an AMPI rank when it migrates

In this program all user data is stored in the heap allocated “chunk *cp”



Step 1: Add this PUP routine

28

#ifdef AMPIvoid chunk_pup(pup_er p, void *d) {chunk **cpp = (chunk **) d;if(pup_isUnpacking(p))

*cpp = new chunk;chunk *cp = *cpp;pup_doubles(p, cp->t, (BLOCKSIZE+2)); // PUP an array of

doublespup_int(p, &cp->xp); // PUP an integer valuepup_int(p, &cp->xm); // PUP an integer valueif(pup_isDeleting(p))

delete cp;}#endif



Step 2: Register the newly created pup routine in main() after allocation of cp

All program state must be saved and recreated via this PUP routine. Normally this includes just program data, but open files, sockets, and other resources might need to be handled.

29

#ifdef AMPIMPI_Register((void*)&cp, (MPI_PupFn) chunk_pup);

#endif



Step 3: Request load balancing every few iterations

The AMPI migration framework may migrate AMPI ranks at these points in the program

30

#ifdef AMPIif(iter%10 == 0) MPI_Migrate();

#endif



Step 4: Enable the desired load balancing module at link time

Turn on load balancing debug output at runtime to verify that everything is working

See the Charm++ manual for more load balancing optionshttp://charm.cs.uiuc.edu/manuals/

31

-balancer GreedyLB

+LBDebug 1


http://charm.cs.uiuc.edu/manuals/

Using Charm++ with Arrays

Laxmikant (Sanjay) Kale & Eric Bohmhttp://charm.cs.uiuc.edu

Parallel Programming LaboratoryDepartment of Computer Science

University of Illinois at Urbana Champaign

http://charm.cs.uiuc.edu/

The Charm++ Model

Parallel objects (chares) communicate via asynchronous method invocations (entry methods)

The runtime system maps chares onto processors and schedules execution of entry methods


The Charm++ Model

34

User View:

System View:


Proxy Objects

Entry methods for each chare are invoked using proxy objects‐‐lightweight handles to potentially remote chares


Limitations of Plain Proxies

In a large program, keeping track of all the proxies is difficult

A simple proxy doesn’t tell you anything about the chare other than its type

Managing collective operations like broadcast and reduce is complicated


Example: Molecular Dynamics

Patches: blocks of space containing particles

Computes: chares which compute the interactions between particles in two patches


Example: Molecular Dynamics

We could write this program using just chares and plain proxies

But, it’ll be much simpler if we use Chare Arrays

38

Note: you can find complete MD code in examples/charm++/Molecular2D in the Charm distribution


Chare Arrays

Arrays organize chares into indexed collections

Each chare in the array has a proxy for the other array elements, accessible using simple syntax

sampleArray[i] // i’th proxy


Array Dimensions

Anything can be used as array indiceso integers

o tuples

o bit vectors

o user‐defined types


Arrays in MD

array [2D] Patch { ... }

array [4D] Compute { ... }

note: arrays can be sparse

41

x and y coordinates of patch

x and y coordinates of both patches involved


Managing Mapping

Arrays let the programmer control the mapping of array elements to PEso Round‐robin, block‐cyclic, etc.

o User defined mapping


Broadcasts

Simple way to invoke the same entry method on each array element

Syntax:// call on one element

sampleArray[i].method()

// broadcast to whole array

sampleArray.method()


Broadcasts in MD

Used to fire off the first timestep after initialization is complete

// All chares are created, time to // start the simulationpatchArray.start();


Reductions

No global flow of control, so each chare must contribute data independently using contribute()A user callback (created using CkCallback) is invoked when the reduction is complete


Reductions

Runtime system builds reduction tree

User specifies reduction operation

At root of tree, a callback is performed on a specified chare


Reduction Example

Sum local error estimators to determine global error

CkCallback cb(CkIndex_Main::computeGlobalError(),mainProxy);

contribute(sizeof(myError), (void*)&myError,CkReduction::sum, cb);

47

Function to call

Chare to handle callback function

Local Data

Reduction operation


CHARM++ HANDS‐ON TUTORIALEric J. Bohm


Ring Example

Different chare elements pass a token around


ring.ci

August 5th, 2009 50

mainmodule ring {readonly CProxy_Main mainProxy;readonly int numChares;

mainchare Main {entry Main(CkArgMsg *m);entry void done(void);

};

array [1D] Ring {entry Ring(void);entry void passToken(int token);

}; };


ring.C (Main Chare)

August 5th, 2009 51

#include "ring.decl.h“

/*readonly*/ CProxy_Main mainProxy;/*readonly*/ int numChares;

/*mainchare*/ class Main : public Chare{public:Main(CkArgMsg* m) {numChares = 5;if(m->argc > 1)

numChares = atoi(m->argv[1]);delete m;

CkPrintf("Running Ring on %d processors for %d elements\n", CkNumPes(), numChares);mainProxy = thishandle;

CProxy_Ring arr = CProxy_Ring::ckNew(numChares);

arr[0].passToken(11);};

void done(void) {CkPrintf("All done\n");CkExit();

};};


ring.C (1D array)

August 5th, 2009 52

/*array [1D]*/class Ring : public CBase_Ring{public:Ring() {// CkPrintf("Ring %d created\n",thisIndex);

}

Ring(CkMigrateMessage *m) { }

void passToken(int token) {CkPrintf("Token No [%d] received by chare element %d\n", token, thisIndex);if (thisIndex < numChares-1)

// Pass the tokenthisProxy[thisIndex+1].passToken(token+1);

else // We've been around once-- we're done.mainProxy.done();

}};

#include "ring.def.h"


Compiling a Charm++ Program

Steps in compiling a Charm++ program:

Running the program:

August 5th, 2009 53

charmc ring.cicharmc –O3 –c ring.Ccharmc –O3 –language charm++ –o ring ring.o

./charmrun +p4 ./ring 13


Output of the Ring Program

August 5th, 2009 54

[bhatele@zen] ring $ ./charmrun +p4 ./ring 13

Charm++: scheduler running in netpoll mode.Charm++> cpu topology info is being gathered. Charm++> 1 unique compute nodes detected. Running Ring on 4 processors for 13 elementsToken No [11] received by chare element 0Token No [12] received by chare element 1Token No [13] received by chare element 2Token No [14] received by chare element 3Token No [15] received by chare element 4Token No [16] received by chare element 5Token No [17] received by chare element 6Token No [18] received by chare element 7Token No [19] received by chare element 8Token No [20] received by chare element 9Token No [21] received by chare element 10Token No [22] received by chare element 11Token No [23] received by chare element 12All done

[bhatele@zen] ring $


Task

Change the entry method passToken(…) such that it passes an array of integers around

Every chare prints the entire array


Download - Programming AMPI: Introduction - Great Lakes Consortium · AMPI threads to migrate PUP routines serialize the user data for an AMPI rank when it migrates In this program all user

Top Related