Grand LargeINRIA
1
Hybrid Preemptive Hybrid Preemptive Scheduling of MPI Scheduling of MPI
applicationsapplicationsAurélien Bouteiller, Hinde Lilia
Bouziane, Thomas Hérault, Pierre Lemarinier, Franck Cappello
MPICH-V teamINRIA Grand-Large
LRI, University Paris South
Grand LargeINRIA
2
Problem definition• Context: Clusters and Grids (made of clusters) shared by many users
(less available resources than required at a given time)In this study : finite sets of MPI applications.
Time sharing of parallel applications is attractive to increase fairness between users, compared to Batch scheduling
• It is very likely that several applications will reside in the virtual memory at the same time, exceeding the total physical memory
Out-of-core scheduling of parallel applications on clusters! (scheduling // applications on cluster under mem. constraint)
• Most of the proposed approaches tries to avoid this situation (by limiting job admission based on mem. requirement, delaying some jobsunpredictably if the jobs exec. time is not known)
Issue: Novel approach (out-of-core) that avoid delaying some jobs?
Constraint: No OS modification (no kernel patch)
Grand LargeINRIA
3
Outline
• Introduction (related work)• A Hybrid approach dedicated to out-of-core• Evaluation• Concluding remarks
Grand LargeINRIA
4
Related work 1
Co-scheduling: all processes of each application are scheduled independently(no coordination)
Appl 2Appl 1 Appl 3
Time
Proc 1
Time
Communication Scheduling overhead
Time slice
Proc 1
Proc 2
Gang-scheduling:all processes of each application are executed simultaneously(coordination)
Scheduling parallel applications on distributed memory machines: a long history of research, still very active (5 papers in 2004 in main conferences: IPDPS, Cluster, SC, Grid, Europar)!
sometimes called “co-scheduling”
Expected advantage: overlapping comm. And comp.
Expected advantage: scheduling communicating processes
Proc 2 Appl 2 Appl 1Appl 3
Grand LargeINRIA
5
Comparison between Gang and Co scheduling:Gang scheduling out-performs co-scheduling•D. G. Feitelson and L. Rudolph. Gang Scheduling Performance Benefits for Fine-Grained Synchronization. Journal of Parallel and Distributed Computing, 16(4):306–318, December 1992.
Co-scheduling out-performs gang scheduling•Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003(Gang schedule only applications that take advantage of if – after classification)
•Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das,“Coscheduling in Clusters: Is It a Viable Alternative?”, SC2004(increases priority of processes during communications)
•Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, Cluster 2004.(Ethernet, Score for Gang, MatMul and Linpack)
Multiple parameter problemConclusion depends on assumptions !
Related work 2
Grand LargeINRIA
6
Metrics for measuring performance:
Metrics and Benchmarking for Parallel Job Scheduling [Fe98]Performance
•Makespan•Throughput•Response time
Not so much investigated. Still very important:Fairness
• Standard deviation of the response time for a set of homogeneous applications• Minimum is the best fairness
Related work 3
Grand LargeINRIA
7
Outline
• Introduction (related work)• A Hybrid approach dedicated to out-of-core• Evaluation• Concluding remarks
Grand LargeINRIA
8
Our approach 1/2: Hybrid
Time
Application Subset context switch Communications
Time slice
In-core co-schedulingOut-of-core gang scheduling
• A given set of parallel applications to schedule• Application SubSets: a set of applications fitting in memory• Co-scheduling applications within a SubSet• Gang scheduling SubSets of applications
Example: 1 set of 6 apps. with 2 Subsets of 3 applications:
Principle:
Expected benefits:
• Overlapping Comm. and I/O with computation in a SubSet • No memory page miss/replacement during Subset exec.• Allow known Co-scheduling optimizations in SubSet
Potential limitation:
• High “Subset context” switching overhead
Grand LargeINRIA
Basic OS virtual memory management:• Paging in pages on request• Paging out pages on replacement (LRU)• Interaction with OS scheduler (someOS are deliberately unfair in out-of-coresituation.)
Poor performance for HPC applications
9
Our approach 2/2: Checkpointing
Adaptive Memory Paging for Gang-Scheduling [Ry04]Best performance for gang scheduling is obtained by:1) Selective paging out (swapping out only pages of descheduled processes)2) Aggressive paging out (evicting pages of descheduled processes at once)3) Adaptive paging In (swapping in pages of scheduled process at once) Good but requires deep kernel modifications.
page2
page1
page3page3
Memory page1
page1
page2page3
page2
page1
page2page3
Disk
OS virtual memory management Checkpointing
page2
page1
page3free
Memory page1
free
page2page3
Disk
Our approach: user level application Subset checkpointingCheckpointing provides the same benefits than 1), 2) and 3).Works for Co-scheduling as well as for Gang-SchedulingDoes not require any kernel modification! We need a parallel application (MPI) checkpoint mechanism
Grand LargeINRIA
10
Implementation using MPICH-V Framework
node
Network
node
Dispatcher
node
CheckpointScheduler
MPICH-V framework: a set of components A MPICH-V protocol: a composition of a subset of these components
Mpi processesDaemons
1
3
2
Channel MemoriesCheckpoint
servers
Event Loggers
Fault detector
Grand LargeINRIA
11
Checkpoint protocol selection:Coordinated or uncoordinated?
6 protocols implemented in MPICH-V:1 coordinated (Chandy-Lamport)2 uncoordinated + pessimist mess. log.3 uncoordinated + causal log.
The coordinated one provides the best performance for fault free execution
Grand LargeINRIA
12
Coordinated Checkpoint: 2 ways
Flushing the network (Chandy-Lamport)
Checkpointing the communication stack (Parakeet, Meoisys, Score)
P1
In-transit Mess.
Ckpt. Image (P1) = +
P0
Flush
time
P1P0
Ckpt.Comm.Buff.
Ckpt.Sig.
P1Mess.delivery
P0
time
P1P0Rstrt.Rstrt.
Sig.
Ckpt. Image (P1) = +
B0 B1 B0 B1
B1 B0 B1
Checkpoint may last longer Restart may last longer1) We expect minor Perf. Diff. between the 2 approaches2) Checkpoint/restart of the comm. Stack requires OS modificationsSo we implemented the Chandy-Lamport approach
Mess.delivery
Grand LargeINRIA
13
MPICH-V/CL protocolCoordinated checkpointing, (Chandy-Lamport)Reference protocol for coordinated checkpointing
1) When receiving a Checkpoint tag, start checkpoint + store any incoming mess.2) Store all incoming messages in checkpoint image3) Send checkpoint tag to all neighbors in the topology.
Checkpoint is finished when a Tag has been received from all neighbors4) After a crash, all nodes retrieve checkpoint images5) Deliver stored in-transit messages to restarted processes
Grand LargeINRIA
14
Implementation detailsMPICH-V
Dispatcher
Ckpt. Sched.
Ckpt. Serv.
NetworkNodes.
Dispatchers
Ckpt. Sched.
NetworkNodes.
MasterScheduler
Daemons
Co-scheduling: Several Dispatchers (no master/checkpoint scheduler)
Gang and (Hybrid): Master Scheduler + several checkpoint schedulers1) Master Scheduler issues a checkpoint order to the Checkpoint Scheduler(s) of running
application(s)2) When receiving this order, a Checkpoint Scheduler launches a coordinated checkpoint.
Every running daemon computes the MPI process image and store it on the local disc. All daemons send a completion message to the Checkpoint Scheduler.
3) All running daemons stop the MPI process and their execution4) The Master Scheduler selects the Checkpoint Scheduler(s) of other application(s) and
sends a restart order. Every Checkpoint Scheduler receiving this order spawns new daemons restarting MPI processes from local images.
Grand LargeINRIA
15
Outline
• Introduction (related work)• A Hybride approach dedicated to out-of-core• Evaluation• Concluding remarks
Grand LargeINRIA
16
Methodology• LRI cluster:
– Athlon 1800+– 1GB memory – IDE ATA100 Disc– Ethernet 100Mbs– Linux 2.4.2
• Benchmark (MPI):– NAS BT (computation bound)– NAS CG (communication bound)
• Time measurement:– Homogeneous Applications– Simultaneous launch (scripts)– Time is measured between the first launch and the last
termination– Fairness is measured by response time standard deviation
• Gang Scheduling time slice: 200 or 600 sec– Gang sched. also implemented by checkpointing (not OS signal)
Beowulf Cluster
Grand LargeINRIA
17
Context switch overlap policy
A) Sequential store and load:1X: 1 context in memory
C) Load Prefetech:2X: 2 contexts in memory
B) Store and load in parallel:2X: 2 contexts in memory
Time slice
ExecutionContext Storage
Context Load
We can imagine several policies to switch between set contextsWhich one is the best for in-core and out-of-core situations?
t
2X 2X2X2X2X1X
1X
<3%
Policies for NAS Bench. BT –C- 25
1) Overlapping policies do not provide substantial improvements for the in-core situation2) They need 2x the memory capacity to stay in-core.
the sequential policy is the bestWe used it for the other xps.
In core Near out-of-core
Grand LargeINRIA
18
Co VS. Gang (Ckpt based)
Makespan: Execution time of N applications with Co and Gang SchedulingNAS Benchmark CG and BT
Number of BT-B-9 executed “simultaneously”
In-core Out-of-core
•Which scheduling strategy is the best for communication bound and compute bound applications?
1) Co-scheduling is the best for in-core executions (but small advantage due to ~Checkpoint overhead + tinny Comm./comp. overlap)
2) Gang scheduling outperforms co-scheduling for out-of-core (ckpt.)
Memory constraint is managed by checkpointing not by delaying jobs>>
Mor
e th
an 2
4k
Co-schedulingCheckpoint based Gang scheduling
Number of CG-C-8 executed “simultaneously”
In-core Out-of-core
0
1000
2000
3000
4000
5000
6000
7000
8000
1 6 9 12 15 18 21
Sec.
0
1000
2000
3000
4000
5000
6000
7000
8000
1 12 18 24
Co-schedulingCheckpoint based Gang scheduling
CG BT
Grand LargeINRIA
19
Ckpt Gang VS. Ckpt HybridMakespan: Execution time of N applications with Co, Gang and Hybrid Scheduling
Co-schedulingGang schedulingHybrid scheduling (set of 5)
Gang schedulingHybrid scheduling (set of 5)
Number of CG-C-8 executed “simultaneously” Number of BT-B-9 executed “simultaneously”
Tim
e (m
inut
es)
Tim
e (m
inut
es)
In-core Out-of-core In-core Out-of-core
• Gang and Hybrid scheduling outperform co-scheduling for out-of-core• Hybrid scheduling compares favorably to Gang scheduling on BT and OOC
thanks to communication and computation overlap.
Chkp overhead
Comm./comp.Overlap
>> M
ore
than
300
0
Grand LargeINRIA
20
Overhead comparisonRelative slowdown: (total time / # concur exec) / best of seq. times
• What is the performance degradation due to time sharing?
1) Gang and Hybrid scheduling add no performance penalty to CG (and also no improvement),
2) Gang scheduling add 10% performance penalty to BT,3) Hybrid scheduling improves the performance by almost 10%,4) Difference is mostly due to communication/computation overlap.
0,8
0,9
1
1,11,2
1,3
1,4
1,5
1,6
6 9 12 15 18 21
Co-scheduling
Gang-Scheduling
Hybrid-SchedulingRelative Slowdown
# concurrent CG-C-8 executions
0,8
0,9
1
1,1
1,2
12 14 16 18 20 22 24
Relative Slowdown
# concurrent BT-B-9 executions
ref
CG BT
ref
Grand LargeINRIA
21
Co-scheduling Fairness (Linux)
47.1507.4509.4......510.66510.66524.4405.78484.589
474.9704......1145.25951264.25239.58.57
app8......app4app3app2app1app0
Stdr.Deviat.
# page misses per minute (mean over all applications)
# page misses per minute experienced by each node of an application (mean)
# appls.
Page miss statistics for 7 and 9 BT C 25 (out-of-core)
• How fair is co-scheduling for in-core and out-of-core? Response time of BT 9 with modified memory sizes
0
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8 9 10 11 12 13
Slightly out-of-core
14
Threads
M=2251, SD=298, Diff=961In-core
1000
1200
1400Time(s)
0
200
400
600
800
1 2 3 4 5 6Application rank
M=1210, Diff=8
1) Fairness deficiency for slightly out-of-core seems due to virtual mem. Mgnt.2) Of course there should be some solution, but involving Kernel modification
Co-scheduling is highly unfair in out-of-core situation!
Grand LargeINRIA
22
Outline
• Introduction (related work)• A Hybrid approach dedicated to out-of-core• Evaluation• Concluding remarks
Grand LargeINRIA
23
Concluding remarks• Checkpoint based Gang Scheduling outperforms Co-scheduling and
certainly classical (OS signal based) Gang scheduling on out-of-core situation (thanks to a better memory management)
• Compared to known approaches, based on job admission control, the benefit of ckpt is that it avoids to delay some jobs
• Hybrid scheduling, combining the two approaches + checkpointing, outperforms Gang scheduling on BT (presumably thanks to overlapping communications and computations)
• More generally, Hybrid scheduling can take advantage of advanced co-scheduling approaches within a gang subset
Work in progress:• Test with other applications / benchmarks• Compare with traditional gang scheduling based on OS signals• Experiments with high speed networks• Experiments on Hybrid scheduling with Co-scheduling optimizations
Grand LargeINRIA
References[Ag03] S. Agarwal, G. Choi, C. R. Das, A. B. Yoo, and S. Nagar. Co-ordinated Coscheduling in time-sharing Clusters through a Generic Framework. In Proceedings of International Conference on Cluster Computing, December 2003.
[Ar98] A. C. Arpaci-Dusseau, D. E. Culler, and A. M. Mainwaring. Implicit Scheduling With Implicit Information in Distributed Systems. In Proceedings of the 1998 ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems, pages 233–243, June 1998.
[Ba00] Anat Batat and Dror G. Feitelson, « Gang Scheduling with Memory Considerations », in proceedings of IPDPS 2000.
[Bo03] Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, and Franck Cappello, « Coordinated checkpoint versus message log for fault tolerant MPI », In IEEE International Conference on Cluster Computing (Cluster 2003). IEEE CS Press, december 2003.
[Ch85] K. M. Chandy and L.Lamport, « Distributed snapshots: Determining global states of distributed systems » In Transactions on Computer Systems, volume 3(1), pages 63–75. ACM, February 1985.
[Fe98] D. G. Feitelson and L. Rudolph, “Metrics and Benchmarking for Parallel Job Scheduling”. In Job Scheduling Strategies for Parallel Processing, LNCS vol. 1495, pp. 1–24, Springer-Verlag, Mar 1998.
[Fr03] Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003
[Ho98] Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa, « Overhead analysis of preemptive gang scheduling », Lecture Notes in Computer Science, 1459 :217–230, April 1998.
[Ky04] Kyung Dong Ryu, Nimish Pachapurkar, Liana L. Fong, « Adaptive Memory Paging for Efficient Gang Scheduling of Parallel Applications”, in proceedings of IPDPS 2004.
[Na99] S. Nagar, A. Banerjee, A. Sivasubramaniam, and C. R. Das. Alternatives to Coscheduling a Network of Workstations. Journal of Parallel and Distributed Computing, 59(2):302–327, November 1999.
[Ni02] Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos, « Adaptive Scheduling under Memory Pressure on Multiprogrammed Clusters”, CCGRID 2002
[Sa04] Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das, “Coscheduling in Clusters: Is It a Viable Alternative?”, to appear in SC2004
[Se99] S. Setia, M. S. Squillante, and V. K. Naik. The Impact of Job Memory Requirements on Gang-Scheduling Performance. ACM SIGMETRICS Performance Evaluation Review, 26(4):30–39, 1999.
[So98] P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien. Dynamic Coscheduling on WorkstationClusters. In Proceedings of the IPPS Workshop on Job Scheduling Strategies for Parallel Processing, pages 231–256, March 1998.
[St04] Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, to appear in Cluster 2004.
[Wi03] Yair Wiseman, Dror G. Feitelson, « Paired Gang Scheduling », IEEE TPDS, June 2003
Grand LargeINRIA
26
optimizations:
-Memory management (mainly based on job admission control):•Impact of Memory Requirements on Gang-Scheduling Performance [Se99] (cont. of multiprog.) •Gang Scheduling with Memory Considerations [Ba00] (job admission control to avoid swapping)•Memory aware Co-scheduling [Ch04] (job admission control to avoid swapping).•Adaptive Memory Paging for Gang-Scheduling [Ry04] (Improving memory paging in-out)
-Communications (concerns co-scheduling):•ICS (Implicit Co-scheduling), SB (spin Blocking), CC (Coordinated Co-scheduling): self descheduling after timeout on communication [Ar98][Na99]•DCS (Dynamic Co-scheduling): incoming message triggers receiver scheduler [So98]•PB (Periodic Boost): schedule receiver from Periodic check of receiving buffer [Na99]
Related work optimization
Grand LargeINRIA
27
Is result for in-core situationKernel dependent (Linux)?
Kernel 2.4.2 was used in our experimentHow time sharing efficiency evolves with Linux kernel maturation (from 2.4 to 2.6)?
100
1000
1 2 3 4 5
3000Time (s)
# of concurrent CG A 4
Comp 2.4.2Comm. 2.4.2Exec 2.4.2Comp 2.6.2Comm 2.6.2Exec 2.6.2Comp 2.6.7Comm. 2.6.7Exec 2.6.7
10
100
1000
1 2 3 4 5
Time (s)
# of concurrent BT A 9
Yes, performance of co-sheduling (in core) depends on the kernel1) Kernel 2.6.2 is less efficient (much less for CG)2) Kernel 2.6.7 and 2.4.2 provides overall similar performance
Careful selection of kernel version OR restriction (desactivation) of Co-scheduling
In-core