advisor: dr. yves vandriessche promotor: prof. dr ... · resilient distributed concurrent...
TRANSCRIPT
![Page 1: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/1.jpg)
Resilient Distributed Concurrent CollectionsCédric BassemPromotor: Prof. Dr. Wolfgang De MeuterAdvisor: Dr. Yves Vandriessche
1
![Page 2: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/2.jpg)
Evolution of Performance inHigh Performance Computing
(source: http://www.top500.org/statistics/perfdevel/) 2
Petascale = 1015 Flop/s
Exascale = 1018 Flop/s
![Page 3: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/3.jpg)
Evolution of Failures in HPC
Main Source: Hardware Faults (~ 50%)
Source: Franck Cappello (2009)
In ExascaleSMTTI < 30 min
3
SMTTI = System Mean time to interrupt
![Page 4: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/4.jpg)
Resilience
“The collection of techniques for keeping applications running to a correct solution in a timely and efficient manner despite underlying system faults”Snir et al. (2014)
Resilience = Fault Tolerance Avizienis et al. (2004)
4
![Page 5: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/5.jpg)
Coordinated Checkpoint/Restart
5
![Page 6: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/6.jpg)
Asynchronous Checkpoint/Restart
6
![Page 7: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/7.jpg)
Requirements for Asynchronous Checkpoint/Restart
Reasoning about state: Self-aware, execution frontier
Safe restart: Deterministic computation
Data race free: Monotonically increasing state
7
![Page 8: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/8.jpg)
Resilience in CnC
Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.
8
CnC Properties:● Dependency graph● Provable deterministic computation● Single assignment data
Focused on shared memory CnC runtimes
![Page 9: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/9.jpg)
The Concurrent Collections Model
Tags
env
Fibs Results
9
0
1
2
Checkpoint
0
1
2
![Page 10: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/10.jpg)
The Concurrent Collections Model
Tags
Fibs Results
10
0
1
2
0 0:0
Checkpoint
0
1
2
0:0
![Page 11: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/11.jpg)
The Concurrent Collections Model
Tags
Fibs Results
11
0
1
2
1 1:10:0
Checkpoint
0
1
2
0:0
1:1
![Page 12: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/12.jpg)
The Concurrent Collections Model
Tags
Fibs Results
12
0
1
2
2
1:10:0
Checkpoint
0
1
2
0:0
1:1
![Page 13: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/13.jpg)
The Concurrent Collections Model
13
Checkpoint
0
1
2
0:0
1:1
Tags
Fibs Results
![Page 14: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/14.jpg)
The Concurrent Collections Model
14
Checkpoint
0
1
2
0:0
1:1
Tags
Fibs Results
![Page 15: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/15.jpg)
The Concurrent Collections Model
Tags
Fibs Results
15
2
2
1:10:0
Checkpoint
0
1
2
0:0
1:1
2:1
![Page 16: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/16.jpg)
The Concurrent Collections Model
16
env
2:1
Tags
Fibs Results
2
1:10:0 2:1
Checkpoint
0
1
2
0:0
1:1
![Page 17: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/17.jpg)
Proof of Concept ImplementationGoal: Assessing the viability of Asynchronous C/R in distributed memory CnC runtimes
17
Resilience Flavour:● Dedicated checkpoint node● Fine grained updates● Uncoordinated restart
Runtime: Intel(R) Concurrent Collections for C++(Architect: Frank Schlimbach)
![Page 18: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/18.jpg)
Dedicated Checkpoint Node &Fine grained Updates
18
Node
Node
Node
Node
Checkpoint
Updates contain:
data instances consumeddata instances producedcontrol instances producedproducersconsumers
![Page 19: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/19.jpg)
Restart
19
Node
Node
Node
Node
1
2
3
4
Restart simulation ➜ No fault tolerant MPI
Uncoordinated ➜ Step duplication
![Page 20: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/20.jpg)
Memory Management in CnC
Non-trivial: data accessed by dynamic stepsOne solution: get-counting method
20
int getCountFib( FibTag t ) {if ( t > 0 ) {
return 2;else {
return 1;}
}
![Page 21: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/21.jpg)
Solution
Extra bookkeeping in checkpoint:➢ Consider steps only once when lowering get counts
○ Hashmap of considered steps
➢ Never re-add removed data instances ○ Marking data as removed
21
![Page 22: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/22.jpg)
Modelling Overhead (Tw/Ts)Coordinated Checkpoint/Restart (Daly, 2006)
Asynchronous Checkpoint/Restart
22
![Page 23: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/23.jpg)
Evaluating Asynchronous Checkpoint/Restart
23
![Page 24: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/24.jpg)
Benchmarks - Goals
Assessing overhead factor (φ): Ok if highMethod:
Measure w/o resilience = Solve time (Ts)Measure with resilience = Wall clock time (Tw)Overhead factor = Tw/Ts
Assessing restart time (Tr): Should be lowMethod:
Measure time needed to calculate the restart set24
![Page 25: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/25.jpg)
Number of StepsFibonacci Mandelbrot
25
Overhead factor (φ): Increases with number of steps
![Page 26: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/26.jpg)
Restart Time
26
Fibonacci: Restart Time
Restart Time (Tr): Low Optimization: Shifting some of the complexity to the overhead factor
![Page 27: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/27.jpg)
Future WorkDistributed Checkpoint:
➢ Overhead high but constant➢ Restart time?
27
Tag-only logging:➢ Less communication➢ Complex restart
Checkpoint
![Page 28: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/28.jpg)
Conclusion
Asynchronous C/R distributed memory CnC runtime➢ Analyzing different cases➢ Proof of concept implementation
Asynchronous C/R is viable for systems with low SMTTI➢ Model➢ Proof of concept implementation
28
![Page 29: Advisor: Dr. Yves Vandriessche Promotor: Prof. Dr ... · Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche](https://reader033.vdocuments.net/reader033/viewer/2022042308/5ed41c6dd0f41f76d0071826/html5/thumbnails/29.jpg)
ReferencesDaly, J. T. (2006). A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22(3), 303–312.
Avizienis, A., Laprie, J., Randell, B., & Landwehr, C. E. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33.
Snir, M., Wisniewski, R. W., Abraham, J. A., Adve, S. V., Bagchi, S., Balaji, P., . . . Hensbergen, E. V. (2014). Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications, 28(2), 129–173.
Franck Cappello (2009). Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge. International Journal of High Performance Computing, 23(1), 212-226.
Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.
29