computer and computational sciences division los alamos national laboratory on the feasibility of...

Computer and Computational Sciences DivisionLos Alamos National Laboratory

On the Feasibility of Incremental On the Feasibility of Incremental

Checkpointing for Scientific ComputingCheckpointing for Scientific Computing Jose Carlos Sancho

[email protected]

with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg

Performance and Architectures Lab (PAL)

Jose C. [email protected]

Inter. Parallel & Distributed Processing Symposium

CCS-3

PAL

Santa Fe, NM

Talk Overview

Goal

Fault-tolerance for Scientific Computing

Methodology

Characterization of Scientific Applications

Performance Evaluation of Incremental Checkpointing

Concluding Remarks



CCS-3

PAL

Santa Fe, NM

Goal

Prove the Feasibility of Incremental Checkpointing

Frequent

Automatic

User-transparent

No changes to application

No special hardware support



CCS-3

PAL

Santa Fe, NM

Large Scale Computers

Large component count

Strongly coupled hardware

133,120 processors 608,256 DRAM

Failure Rate



CCS-3

PAL

Santa Fe, NM

Failures expected during the application’s execution

Running for months

Demands high capability

Scientific Computing



CCS-3

PAL

Santa Fe, NM

Providing Fault-tolerance

Hardware replication

+

Checkpointing and rollback recovery

High cost solution!

Spare nodeCheckpointing

Recovery



CCS-3

PAL

Santa Fe, NM

Checkpointing and Recovery

Simplicity Easy implementation

Cost-effective No additional hardware support

Critical aspect: Bandwidth requirements

Saving process state



CCS-3

PAL

Santa Fe, NM

Reducing Bandwidth

Incremental checkpointing Only the memory modified from the previous

checkpoint is saved to stable storage

Full

Process state

Incremental



CCS-3

PAL

Santa Fe, NM

New Challenges

Frequent checkpoints: Minimizing rollback interval

to increase system availability

Automatic and user-transparent Autonomic computing New vision of to manage the

high complexity of large systems Self-healing and self-repairing

More bandwidthpressure



CCS-3

PAL

Santa Fe, NM

Survey of Implementation Levels

CLIP, Dome, CCITF

Ickp, CoCheck,Diskless

Revive, Safetynet

Just a few !!

Hardware

Operating system

Run-time library

Application



CCS-3

PAL

Santa Fe, NM

Enabling Automatic Checkpointing

Low

User intervention Checkpoint data

Low

Hardware

Operating system

Run-time library

Application

High

High

automatic



CCS-3

PAL

Santa Fe, NM

The Bandwidth Challenge

Does the current technology provide enough bandwidth?

• Frequent• Automatic



CCS-3

PAL

Santa Fe, NM

Methodology

Analyzing the Memory Footprint of Scientific Codes Run-time library

stack

heap

static data

text

mmap

mprotec()

mprotec()

Application’sMemory Footprint



CCS-3

PAL

Santa Fe, NM

Methodology

Quantifying the Bandwidth Requirements Checkpoint intervals: 1s to 20s Comparing with the current bandwidth available

900 MB/s

75 MB/s

Sustained network bandwidthQuadrics QsNet II

Single sustained disk bandwidthUltra SCSI controller



CCS-3

PAL

Santa Fe, NM

Experimental Environment

32-node Linux Cluster

64 Itanium II processors PCI-X I/O bus Quadrics QsNet interconnection network

Parallel Scientific Codes

Sage Sweep3D NAS parallel benchmarks: SP, LU, BT and FT

Representative of the ASCI production codes at LANL



CCS-3

PAL

Santa Fe, NM

Memory Footprint

Sage-1000MB 954.6MB

Sage-500MB 497.3MB

Sage-100MB 103.7MB

Sage-50MB 55MB

Sweep3D 105.5MB

SP Class C 40.1MB

LU Class C 16.6MB

BT Class C 76.5MB

FT Class C 118MB

Increasing memory footprint



CCS-3

PAL

Santa Fe, NM

Talk overview

Goal

Fault-tolerance for scientific computing

Methodology

Characterization of scientific applications

Performance evaluation of Incremental Checkpointing Bandwidth Scalability



CCS-3

PAL

Santa Fe, NM

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500

IW

S S

ize (

MB

)

Execution Time (s)

Characterization

Data initializationRegular

processing bursts

Sage-1000MB



CCS-3

PAL

Santa Fe, NM

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500

IW

S S

ize (M

B)

Execution Time (s)

0

1

2

3

4

0 100 200 300 400 500

Data

Receiv

ed

(M

B)

Execution Time (s)

Communication

Interleaved

Sage-1000MB

Regular communication

bursts



CCS-3

PAL

Santa Fe, NM

Fraction of the Memory Footprint Overwritten during the Main Iteration

0102030405060708090

100

Sage1000MB

Sage 500MB

Sage 100MB

Sage 50MB

Sweepd3D SP LU BT FT

Full memory footprint

Below the full memory footprint



CCS-3

PAL

Santa Fe, NM

Bandwidth Requirements

0

50

100

150

200

250

300

1 5 10 20

Maximum Average

Ban

dw

idth

(M

B/s

)

Timeslices (s)

78.8MB/s 12.1MB/

s

Decreases with the timeslicesSage-1000MB



CCS-3

PAL

Santa Fe, NM

0

50

100

150

200

250

300

Sage1000MB

Sage 500MB

Sage 100MB

Sage 50MB

Sweepd3D SP LU BT FT

Maximum Average

Bandwidth Requirementsfor 1 second

Increases with memory footprint

Single SCSI disk performance

Most demanding



CCS-3

PAL

Santa Fe, NM

Increasing Memory Footprint Size

0102030405060708090

1 5 10 20

50MB 100MB 500MB 1000MB

Ave

rage

Ban

dw

idth

(M

B/s

)

Timeslices (s)

Increases sublinearly



CCS-3

PAL

Santa Fe, NM

Increasing Processor Count

0102030405060708090

1 5 10 20

8 16 32 64

Ave

rage

Ban

dw

idth

(M

B/s

)

Timeslices (s)

Decreases slightly with processor count

Weak-scaling



CCS-3

PAL

Santa Fe, NM

Technological Trends

0102030405060708090

100

Processor Memory Storage Network

Performance of applications bounded by memory improvements

Increases at a faster

pace

Per

form

ance

Im

pro

vem

ent

per

year



CCS-3

PAL

Santa Fe, NM

Conclusions

No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing

Current hardware technology can sustain the bandwidth requirements

These results can be generalized to future large scale computers



CCS-3

PAL

Santa Fe, NM

Conclusions

The process bandwidth decreases slightly with processor count

Increases sublinearly with the memory footprint size

Improvements in networking and storage will make incremental checkpointing even more effective in the future

computer and computational sciences division los alamos national laboratory on the feasibility of...

Documents