computer and computational sciences division los alamos national laboratory on the feasibility of...
TRANSCRIPT
Computer and Computational Sciences DivisionLos Alamos National Laboratory
On the Feasibility of Incremental On the Feasibility of Incremental
Checkpointing for Scientific ComputingCheckpointing for Scientific Computing Jose Carlos Sancho
with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg
Performance and Architectures Lab (PAL)
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Talk Overview
Goal
Fault-tolerance for Scientific Computing
Methodology
Characterization of Scientific Applications
Performance Evaluation of Incremental Checkpointing
Concluding Remarks
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Goal
Prove the Feasibility of Incremental Checkpointing
Frequent
Automatic
User-transparent
No changes to application
No special hardware support
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Large Scale Computers
Large component count
Strongly coupled hardware
133,120 processors 608,256 DRAM
Failure Rate
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Failures expected during the application’s execution
Running for months
Demands high capability
Scientific Computing
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Providing Fault-tolerance
Hardware replication
+
Checkpointing and rollback recovery
High cost solution!
Spare nodeCheckpointing
Recovery
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Checkpointing and Recovery
Simplicity Easy implementation
Cost-effective No additional hardware support
Critical aspect: Bandwidth requirements
Saving process state
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Reducing Bandwidth
Incremental checkpointing Only the memory modified from the previous
checkpoint is saved to stable storage
Full
Process state
Incremental
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
New Challenges
Frequent checkpoints: Minimizing rollback interval
to increase system availability
Automatic and user-transparent Autonomic computing New vision of to manage the
high complexity of large systems Self-healing and self-repairing
More bandwidthpressure
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Survey of Implementation Levels
CLIP, Dome, CCITF
Ickp, CoCheck,Diskless
Revive, Safetynet
Just a few !!
Hardware
Operating system
Run-time library
Application
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Enabling Automatic Checkpointing
Low
User intervention Checkpoint data
Low
Hardware
Operating system
Run-time library
Application
High
High
automatic
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
The Bandwidth Challenge
Does the current technology provide enough bandwidth?
• Frequent• Automatic
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Methodology
Analyzing the Memory Footprint of Scientific Codes Run-time library
stack
heap
static data
text
mmap
mprotec()
mprotec()
Application’sMemory Footprint
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Methodology
Quantifying the Bandwidth Requirements Checkpoint intervals: 1s to 20s Comparing with the current bandwidth available
900 MB/s
75 MB/s
Sustained network bandwidthQuadrics QsNet II
Single sustained disk bandwidthUltra SCSI controller
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Experimental Environment
32-node Linux Cluster
64 Itanium II processors PCI-X I/O bus Quadrics QsNet interconnection network
Parallel Scientific Codes
Sage Sweep3D NAS parallel benchmarks: SP, LU, BT and FT
Representative of the ASCI production codes at LANL
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Memory Footprint
Sage-1000MB 954.6MB
Sage-500MB 497.3MB
Sage-100MB 103.7MB
Sage-50MB 55MB
Sweep3D 105.5MB
SP Class C 40.1MB
LU Class C 16.6MB
BT Class C 76.5MB
FT Class C 118MB
Increasing memory footprint
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Talk overview
Goal
Fault-tolerance for scientific computing
Methodology
Characterization of scientific applications
Performance evaluation of Incremental Checkpointing Bandwidth Scalability
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
0
50
100
150
200
250
300
350
400
0 100 200 300 400 500
IW
S S
ize (
MB
)
Execution Time (s)
Characterization
Data initializationRegular
processing bursts
Sage-1000MB
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
0
50
100
150
200
250
300
350
400
0 100 200 300 400 500
IW
S S
ize (M
B)
Execution Time (s)
0
1
2
3
4
0 100 200 300 400 500
Data
Receiv
ed
(M
B)
Execution Time (s)
Communication
Interleaved
Sage-1000MB
Regular communication
bursts
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Fraction of the Memory Footprint Overwritten during the Main Iteration
0102030405060708090
100
Sage1000MB
Sage 500MB
Sage 100MB
Sage 50MB
Sweepd3D SP LU BT FT
Full memory footprint
Below the full memory footprint
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Bandwidth Requirements
0
50
100
150
200
250
300
1 5 10 20
Maximum Average
Ban
dw
idth
(M
B/s
)
Timeslices (s)
78.8MB/s 12.1MB/
s
Decreases with the timeslicesSage-1000MB
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
0
50
100
150
200
250
300
Sage1000MB
Sage 500MB
Sage 100MB
Sage 50MB
Sweepd3D SP LU BT FT
Maximum Average
Bandwidth Requirementsfor 1 second
Increases with memory footprint
Single SCSI disk performance
Most demanding
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Increasing Memory Footprint Size
0102030405060708090
1 5 10 20
50MB 100MB 500MB 1000MB
Ave
rage
Ban
dw
idth
(M
B/s
)
Timeslices (s)
Increases sublinearly
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Increasing Processor Count
0102030405060708090
1 5 10 20
8 16 32 64
Ave
rage
Ban
dw
idth
(M
B/s
)
Timeslices (s)
Decreases slightly with processor count
Weak-scaling
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Technological Trends
0102030405060708090
100
Processor Memory Storage Network
Performance of applications bounded by memory improvements
Increases at a faster
pace
Per
form
ance
Im
pro
vem
ent
per
year
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Conclusions
No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing
Current hardware technology can sustain the bandwidth requirements
These results can be generalized to future large scale computers
Jose C. [email protected]
Inter. Parallel & Distributed Processing Symposium
CCS-3
PAL
Santa Fe, NM
Conclusions
The process bandwidth decreases slightly with processor count
Increases sublinearly with the memory footprint size
Improvements in networking and storage will make incremental checkpointing even more effective in the future