basic charm++ and load balancing
DESCRIPTION
Basic Charm++ and Load Balancing. Gengbin Zheng charm.cs.uiuc.edu 10/11/2005. Charm++ Basics. Charm++. Parallel library for Object-Oriented C++ applications Invoke functions remotely Messaging via remote method calls (like CORBA) Communication “proxy” objects Methods called by scheduler - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/1.jpg)
1
Basic Charm++ and Load Balancing
Gengbin Zheng
charm.cs.uiuc.edu10/11/2005
![Page 2: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/2.jpg)
2
Charm++ Basics
![Page 3: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/3.jpg)
3
Charm++ Parallel library for Object-
Oriented C++ applications Invoke functions remotely Messaging via remote method
calls (like CORBA) Communication “proxy” objects
Methods called by scheduler System determines who runs next
Multiple objects per processor Object migration fully supported
Even with broadcasts, reductions
![Page 4: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/4.jpg)
4
Virtualized Programming Model
User View
System implementation
User writes code in terms of communicating objects
System maps objects to processors
![Page 5: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/5.jpg)
5
Chares – Concurrent Objects
Can be dynamically created on any available processor
Can be accessed from remote processors
Send messages to each other asynchronously
Contain “entry methods”
![Page 6: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/6.jpg)
6
Charm++ Features: Object Arrays
A[0] A[1] A[2] A[3] A[n]
User’s view
Applications are written as a set of communicating objects
![Page 7: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/7.jpg)
7
Charm++ Features: Object Arrays
Charm++ maps those objects onto processors, routing messages as needed
A[0] A[1] A[2] A[3] A[n]
A[3]A[0]
User’s view
System view
![Page 8: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/8.jpg)
8
Charm++ Features: Object Arrays
Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc.
A[0] A[1] A[2] A[3] A[n]
A[3]A[0]
User’s view
System view
![Page 9: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/9.jpg)
9
Charm++ Array Definition
array[1D] foo { entry foo(int problemNo); entry void bar(int x); }
Interface (.ci) file
class foo : public CBase_foo {public:// Remote calls foo(int problemNo) { ... } void bar(int x) { ... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...}};
In a .C file
![Page 10: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/10.jpg)
10
Charm++ Remote Method Calls
To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file:
array[1D] foo { entry foo(int problemNo); entry void bar(int x); };
Interface (.ci) file
CProxy_foo someFoo=...;someFoo[i].bar(17);
In a .C file
This results in a network message, and eventually to a call to the real object’s method:
void foo::bar(int x) { ...
}
In another .C file
Generated class
i’th object method and parameters
![Page 11: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/11.jpg)
11
Charm++ Startup Process: Main
module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); }};
Interface (.ci) file
#include “myModule.decl.h”class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); }};#include “myModule.def.h”
In a .C file Generated class
Called at startup on PE 0
Special startup object
![Page 12: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/12.jpg)
12
.ci filemainmodule hello {
mainchare mymain {
entry mymain(CkArgMsg *m);
};
};
“ Hello World!”
Generates
hello.decl.h
hello.def.h
#include “hello.decl.h”class mymain : public CBase_mymain{public: mymain(CkArgMsg *m) {
ckout <<“Hello World” <<endl;CkExit();
}};#include “hello.def.h”
.C file
![Page 13: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/13.jpg)
13
Compile and run the programCompiling
• charmc <options> <source file>• -o, -g, -language, -module, -tracemode
pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++
To run a CHARM++ program named ``pgm'' on four processors, type:
charmrun pgm +p4 <params>
Nodelist file (for network architecture)• list of machines to run the program• host <hostname> <qualifiers>
Example Nodelist File:group main ++shell sshhost Host1host Host2
![Page 14: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/14.jpg)
14
Charm++: Portability Runs on:
Any machine with MPI, including• IBM SP, Blue Gene/L•Cray XT3•Origin2000
PSC’s Lemieux (Quadrics Elan) Clusters with Ethernet (Udp/Tcp) Clusters with Myrinet (GM) Clusters with Amasso cards Apple clusters Even Windows!
SMP-Aware (pthreads)
![Page 15: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/15.jpg)
15
Build Charm++ Download from website
http://charm.cs.uiuc.edu/download.html
Build Charm++ ./build <target> <version> <options>
[compile flags]• ./build charm++ net-linux gm -g
Parallel make (-j2)
Compile code using charmc Portable compiler wrapper Link with “-language charm++”
Run code using charmrun
![Page 16: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/16.jpg)
16
How Charmrun Works?
ssh
connect
Acknowledge
Charmrun charmrun +p4 ./pgm
![Page 17: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/17.jpg)
17
Charmrun (batch mode)
ssh
connect
Acknowledge
Charmrun charmrun ++batch 8
![Page 18: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/18.jpg)
18
Debugging Charm++ Applications
Printf Gdb
Sequentially (standalone mode)
• gdb ./pgm +vp16 Run debugger in
xterm• charmrun +p4 pgm
++debug• charmrun +p4 pgm
++debug-no-pause Memory paranoid Parallel debugger
![Page 19: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/19.jpg)
19
Charm++ Features
![Page 20: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/20.jpg)
20
Message Driven Execution
Scheduler Scheduler
Message Q Message Q
Virtualization leads to Message Driven Execution
![Page 21: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/21.jpg)
21
Prioritized Messages Number of priority bits passed during
message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages
Signed integer priorities:*CkPriorityPtr(msg)=-1;
CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities
CkPriorityPtr(msg)[0]=0x7fffffff;
CkSetQueueing(m, CK_QUEUEING_BFIFO);
![Page 22: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/22.jpg)
22
Advanced Message Features
Expedited messages Message do not go through the
charm++ scheduler (faster) Top priority messages
Immediate messages Entries are executed in an
interrupt or the communication thread
Very fast, but tough to get right
![Page 23: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/23.jpg)
23
Object Migration
![Page 24: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/24.jpg)
24
How to Migrate a Virtual Processor?
Move all application state to new processor
Stack Data (threads) Subroutine variables and calls Managed by compiler
Heap Data Allocated with malloc/free Managed by user
Global Variables Open files, environment
variables, etc. (not handled yet!)
![Page 25: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/25.jpg)
25
Migration Solutions
Stack Data (threads) Automatic: isomalloc stacks
Heap Data Use “-memory isomalloc” -or- Write pup routines
Global Variables Use “-swapglobals”
•Works on ELF platform (Linux and Sun)• Just a pointer swap, no data copying
-or- Remove globals entirely
![Page 26: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/26.jpg)
26
Migrate Heap Data: PUP
Packing/unpacking user allocated data
Basic contract: here is my data Sizing: counts up data size Packing: copies data into message Unpacking: copies data back out Same call works for network,
memory, disk I/O ...
![Page 27: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/27.jpg)
27
Migrate Heap Data: PUP C++ Example
#include “pup.h”#include “pup_stl.h”
class myMesh { std::vector<float> nodes; std::vector<int> elts;public: ... void pup(PUP::er &p) { p|nodes; p|elts; }};
![Page 28: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/28.jpg)
28
Migrate Heap Data: PUP F90 ExampleTYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: eltsEND TYPE
SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh);END SUBROUTINE
![Page 29: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/29.jpg)
29
Automatic Load Balancing
![Page 30: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/30.jpg)
30
Motivation Irregular or dynamic applications
Initial static load balancing Application behaviors change
dynamically Difficult to implement with good parallel
efficiency Versatile, automatic load balancers
Application independent No/little user effort is needed in load
balance Work for both Charm++ and Adaptive
MPI
![Page 31: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/31.jpg)
31
Using Dynamic Mapping to Processors
Migrate objects between processors Use that for dynamic (and static, initial)
load balancing Two major approaches
No predictability of load patterns• Fully dynamic
• Early work on State Space Search, Branch&Bound, ..
With certain predictability• Measurement-based load balancing strategy• CSE, molecular dynamics simulation
![Page 32: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/32.jpg)
32
Applications lack of predictability
Flow of tasks - application generates a continuous flow of tasks The goal of the load balancing
strategies is to balance these tasks across the system for a fast response time and a better throughput
Tasks are assigned at creation time, no migration afterwards
![Page 33: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/33.jpg)
33
Seed Load Balancing
Neighborhood averaging with work-stealing when Idle using immediate messages Load balancing among
neighboring processors• Load is represented by
length of queue Work-stealing at idle
time with interruption-based message
• Fast response to the request
80000 objects, 10% heavy objects
![Page 34: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/34.jpg)
34
Link with a seed load balancer
Use –balance <random|neighbor> Charmc –o pgm pgm.o –balance
neighbor Specify topology
+LBTopo <ring|torus2d|…>
![Page 35: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/35.jpg)
35
Principle of Persistence Once an application is expressed in
terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior
• Abrupt and large,but infrequent changes (eg:AMR)
• Slow and small changes (eg: particle migration) Parallel analog of principle of locality
Heuristics, that holds for most CSE applications
Run-time instrumentation is possible
![Page 36: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/36.jpg)
36
Measurement Based Load Balancing
Runtime instrumentation Measures CPU load per object Measures communication volume
between objects Measurement based load
balancers Use the instrumented database
periodically to make new decisions A load balancing strategy takes the
database as input and generates a new object-to-processor mapping
![Page 37: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/37.jpg)
37
Load Balancing – graph partitioning
LB View
mapping of objectsWeighted object graph in view of Load Balancer
Charm++ PE
![Page 38: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/38.jpg)
38
Charm++ Load Balancer in Action
Automatic Load Balancing in Crack Propagation
![Page 39: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/39.jpg)
39
Load Balancer Categories
Centralized Object load data
are sent to processor 0
Integrate to a complete object graph
Migration decision is broadcasted from processor 0
Global barrier
Distributed Load balancing
among neighboring processors
Build partial object graph
Migration decision is sent to its neighbors
No global barrier
![Page 40: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/40.jpg)
40
Main Centralized Load Balancing Strategies
GreedyCommLB a “greedy” load balancing strategy which uses the process
load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor
RefineLB Incremental adjustment by moving objects off overloaded
processors to under-utilized processors to reach average load MetisLB
uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges.
OrbLB treats objects with spatial coordinates. It applies an
orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space.
Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed
![Page 41: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/41.jpg)
41
Load Balancing StrategiesBaseLB
CentralLB NborBaseLB
OrbLBDummyLB MetisLB RecBisectBfLB
GreedyLB RandCentLB RefineLB
GreedyCommLB RandRefLB RefineCommLB
NeighborLB
GreedyRefLB
![Page 42: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/42.jpg)
42
Neighborhood Load Balancing Strategies
NeighborLB processor tries to average out its
load only among its neighbors WSLB
A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop
![Page 43: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/43.jpg)
43
Compiler Interface Link time options
-module: Link load balancers as modules
• -module EveryLB
Link multiple modules into binary• -balancer GreedyCommLB -balancer RefineLB
• -balancer ComboCentLB:GreedyLB,RefineLB
![Page 44: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/44.jpg)
44
Runtime Options Run-time options do the same
thing, but override the compile time options +balancer: invoke a load balancer Can have multiple load balancers
•+balancer GreedyCommLB +balancer RefineLB
![Page 45: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/45.jpg)
45
When to Re-balance Load?
Programmer Control: ReadyLoadBalance()
Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load
balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare
has finished
Default: Load balancer is periodicProvide period as a runtime parameter (+LBPeriod)
![Page 46: Basic Charm++ and Load Balancing](https://reader035.vdocuments.net/reader035/viewer/2022062301/56814605550346895db31144/html5/thumbnails/46.jpg)
46
Thank You!
Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/
Parallel Programming Lab at University of Illinois