![Page 1: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/1.jpg)
1/22
Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism
Speaker: Sheng DiCoauthors: Yves Robert, Frédéric Vivien, De
rrick Kondo, Franck Cappello
![Page 2: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/2.jpg)
2/22
OutlineBackground of Google Cloud Task ProcessingSystem OverviewResearch FormulationOptimization of Fault-tolerance
Optimization of the Number of CheckpointsAdaptive Optimization of Fault ToleranceLocal disk vs. Shared disk
Performance EvaluationConclusion and Future Work
![Page 3: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/3.jpg)
3/22
BackgroundGoogle trace (released in 2011.11):
670,000 jobs, 2,500,000 tasks, 12,000 nodesOne-month period (29 days)Various events, Resource request/allocation, Job/ta
sk length, Various attributes, etc.There are two types of jobs in Google trace:
sequential-task job and Bag-of-Task job4000 application types, such as map-reduce.
Failure events occur often for some tasks!Most of task lengths are short (a few or dozens of mi
nutes), so task execution is sensitive to checkpointing cost.
![Page 4: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/4.jpg)
4/22
Service Layer
Physical Infrastructure Layer
Resource Allocation Layer
User Interface (Task Parser)
Job/Task Scheduling Layer
Virtual Machine Layer
Fau
lt T
oler
ance
System OverviewUser Interface
Receive tasksTask Scheduling
Coordinate resource
competition among hostsResource Allocation
Coordinate resource usage
within a particular host
![Page 5: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/5.jpg)
5/22
System Overview (Cont’d)Task Processing Procedure
Job Submission
Cloud server
TaskJob
Physical node Running VM
Taskschedulingnotification
Resource PoolQueue
Job scheduling & Resource Isolation
Task Execution & Checkpointing
Process Restarting & Migration
Process Restart or Migration
Failed VM or Service
![Page 6: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/6.jpg)
6/22
Research Formulation Analysis of Google trace:
Task failure intervals, Task length, Job structure
Equidistant checkpointing model Checkpointing interval for a particular task is fixed
Task execution model (suppose k failures) Tw(task) = Te(task)+C(x-1)+Σk{roll-back-loss}+Σk{restart-cost}
Objective: minimizing E(Tw(task)) Random Variable: K (# of task failure events) Compute optimal # of checkpoints for a Google task
Task’s wall-clock time
Productive time Checkpoint cost Roll-back loss Restart cost
Task Entry Task Exit
![Page 7: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/7.jpg)
7/22
Theorem 1:x*: the optimal number of checkpointing intervalsTe: task execution length (productive length)E(Y): task’s expected # of failures (characterized by MNOF)C: checkpoint cost (time increment per checkpoint)
Formula (3):Example:
A task’s productive length is 18 seconds, C = 2 sec, expected # of failures = 2 in its execution
Optimal # of checkpointing intervals = sqrt(18*2/(2*2))=3The optimal checkpointing interval = 18/3 = 6 seconds
Optimization of the Number of Checkpoints: New formula
![Page 8: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/8.jpg)
8/22
Formula (3) does not depend on probability distribution, unlike Young’s formula
Young’s formula (proposed in 1977)Optimal checkpoint interval:
C: checkpointing cost Tf: mean time between failures (MTBF) Conditions: (1) Task failure intervals follows exponential distribution
(2) Checkpoint cost C is far smaller than checkpoint interval Tc
Due to Taylor series and second-order approximation
Optimization of the Number of Checkpoints : Discussion
![Page 9: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/9.jpg)
9/22
The assumption with exponential distribution makes Young’s formula unsuitable for Google task processingDistribution of Google task failure intervals based on priority
Optimization of the Number of Checkpoints : Discussion
![Page 10: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/10.jpg)
10/22
Corollary 1: Young’s formula is a special case
Two important conditions: Task failure intervals follow exponential distributionCheckpointing cost is small
Optimization of the Number of Checkpoints : Discussion
![Page 11: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/11.jpg)
11/22
Optimization of the Number of Checkpoints : Discussion Our formula (3) is easier to apply than Youn
g’s formula in practice- Young’s formula depends on MTBF, while MTBF
may not be easy to predict precisely Non-asynchronous clocks across hosts Inevitable influence of checkpointing cost Significant delay of failure detection
- By contrast, MNOF is easy to record accurately
![Page 12: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/12.jpg)
12/22
Adaptive Optimization of Chpt PositionsProblem: what if the probability distribution of failure intervals (or
failure rates) changes over time?This is possible due to changeable priority ….Objective: To design an adaptive algorithm to dynamically suit the
changing failure rates. Question: Will the optimal checkpoint positions change with
decreasing remaining workload over time?
Solution: We just need to monitor MNOF, regardless of the
decreasing remaining workload to process - because of Theorem 2
Kth chpt (K+1)th chpt
Opt chpt intervals?
Later on
means current time
![Page 13: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/13.jpg)
13/22
Adaptive Optimization of Fault Tolerance (Cont’d)Theorem 2:
Optimal # of checkpointing Intervalscomputed at (k+1)th checkpoint position
Optimal # of checkpointing intervals computed at kth checkpoint position
![Page 14: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/14.jpg)
14/22
Local disk vs. Shared disk checkpointingCharacterization based on BLCR
Operation time cost in setting a checkpoint
![Page 15: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/15.jpg)
15/22
Performance EvaluationExperimental Setting
We build a testbed based on Google trace, in a cluster with hundreds of VM instances running across 16 nodes (16*8 cores, 16*16GB memroy size, XEN4.0, BLCR)
We call it GloudSim (Google based cloud simulation system) [under review by HiPC’13]
We reproduce Google task execution as close as possible to Google trace, e.g., Task arrivals are based on the trace or some distribution Task’s memory is reproduced via Google trace Task’s failure events are reproduced via Google trace Each job is chosen from among all sample jobs in the trace
![Page 16: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/16.jpg)
16/22
Performance Evaluation (Cont’d)Experimental Results
Job’s Workload-Processing Ratio (WPR)
Checkpointing effect with precise prediction
(on MNOF and MTBF)
![Page 17: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/17.jpg)
17/22
Performance Evaluation (Cont’d)Distribution of WPR with diff. C/R formulas
a
![Page 18: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/18.jpg)
18/22
Performance Evaluation (Cont’d)MNOF & MTBF w.r.t. Priority in Google trace
MNOF is stable with task lengths, while MTBF is not stable (changing from 179 to 4199 secs)
![Page 19: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/19.jpg)
19/22
Performance Evaluation (Cont’d)Min/Avg/Max WPR with respect to diff. Priorities
Our formula outperforms Young’s formula by 3-10%
![Page 20: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/20.jpg)
20/22
Performance Evaluation (Cont’d)Wall-clock lengths of 10,000 job execution
Conclusion: Job wall-clock lengths are often incremented by 50-100 seconds under Young’s formula than ours.
![Page 21: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/21.jpg)
21/22
Performance Evaluation (Cont’d)Adaptive Algorithm vs. Static Algorithm
![Page 22: 1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick](https://reader036.vdocuments.net/reader036/viewer/2022070413/5697bfd51a28abf838cad580/html5/thumbnails/22.jpg)
22/22
Conclusion and Future WorkSelected conclusions:
Our formula (3) is better than Young’s formula by 3-10 percent, w.r.t. Google task processing
Job wall-clock lengths are incremented by 50-100 seconds under Young’s formula than ours.
Worst WPR under dynamic algorithm stays about 0.8, compared to 0.5 under static algorithm.
Future workPort our theorems to more cases like MPI over Cl
oud platforms.