application level fault tolerance and detection

22
Application Level Fault Tolerance and Detection Principal Investigators: C. Mani Krishna Israel Koren Presented By: Eric Ciocca Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

Upload: kishi

Post on 06-Jan-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Application Level Fault Tolerance and Detection. Principal Investigators: C. Mani KrishnaIsrael Koren Presented By: Eric Ciocca. Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003. What is ALFTD?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Principal Investigators:C. Mani Krishna Israel Koren

Presented By:Eric Ciocca

Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering

University of Massachusetts Amherst MA 01003

Page 2: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

What is ALFTD?

Application Level Fault Tolerance and Detection

ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information

significantly reduces the overall cost providing fault tolerance

ALFTD may be used alone or to supplement other fault detection schemes

Page 3: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD Overview

Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually

cease functioning Data faults cause a process to continue running

with incorrect results ALFTD is scalable

The level of fault tolerance can be traded off with invested time overhead

Page 4: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Principles of ALFTD

To provide system fault tolerance, every physical node runs its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary)

If a fault should corrupt a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality

Node 1

Node 2

Node 3

Node 4

P1 S4

P2 S1

P3 S2

P4 S3

Page 5: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Principles of ALFTD

The secondary processes can be scaled-down by reducing the resolution of input data reducing the precision of calculations heuristically predicting results from previous

iterations’ output In some applications the secondary can be

run optionally on an as-needed basis If the corresponding primary is approaching a

deadline miss If the corresponding primary has been incapacitated If the corresponding primary has produced faulty

data If faults are infrequent, an optional secondary

will incur very little additional overhead

Page 6: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD in OTIS

ALFTD was implemented into OTIS (Oribital Thermal Imaging Spectrometer) to test its viability as a fault tolerance and detection scheme

OTIS, part of the REE (Remote Exploration and Experimentation) program group from JPL, is intended to run on orbiting satellites

OTIS processes radiation data of a geographic area from a sensor array [input] and produces temperature and emissivity data [output]

Page 7: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

3. Master sends tasks

OTIS Structure

M

S

2. MPI Starts Slave and master processes

4. Slave Calculations

5. Slave Returns Results

OUTPUT

MPI

1. MPI Starts

S

S

1

2 3

4

5

Page 8: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

ALFTD in OTIS (cont’d)

ALFTD is suited for remote applications, As a software-based fault handling mechanism,

it requires no extra hardware The scaled secondaries require less power than

full software redundancy In OTIS, and other applications, ALFTD is

passive, only requiring extra runtime in a fault case.

Page 9: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

3. Master sends tasks

ALFTD OTIS Structure

M

P3

2. MPI Starts master and slaves, primary and secondary processes

4. Slave Calculations

5. Slave Returns Results

OUTPUT

MPI

1. MPI Starts

P2

P1

1

2 3

4

5

S1

S3

S2

?

Page 10: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Secondaries in OTIS

The secondary required for ALFTD is implemented to be functionally similar to the primary

Secondary scaling occurs through resolution reduction OTIS’ “natural” temperature data input exhibits

spatial locality Points not directly calculated can be

approximately estimated using interpolation between calculated points

Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less

overhead

Page 11: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Example of Secondary Resolution

(ALFTD Compensation for 10 rows in a sample dataset)

100% Secondary Resolution

50% Secondary Resolution

33% Secondary Resolution

25% Secondary Resolution

Page 12: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Detection

Output filters on the primary data determine when secondary validation is required

Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can

be considered to be the product of potentially faulty processes

OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over

small areas Absolute Bounds: temperature should not exceed

certain values

Page 13: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Detection (cont’d)

After the secondary has been run to validate a primary’s results, the “better” data is chosen according to the following logic grid:

Primary Results

Faultless

Ambiguous

Faulty

FaultlessPrimar

ySecondar

ySeconda

ry

Ambiguous

Primary

PrimarySeconda

ry

FaultyPrimar

yPrimary Primary*

Secon

dary

R

esu

lts

Page 14: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Data Sets

Three data sets were chosen for their interesting characteristics

“Blob” “Stripe” “Spots”

Broad, unchanging areas with dark spots

Relatively undynamic

except for one “stripe”

Turbulent spots may defy

“spatial locality”

predictions

Page 15: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots”

Fault Tolerance with injected faults in “Spots”

Page 16: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Spots” (cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 17: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob”

Fault Tolerance with injected faults in “Blob”

Page 18: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Blob” (cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 19: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”

No ALFTD 25% ALFTDComputation Overhead

33% ALFTD Computation Overhead

50% ALFTD Computation Overhead

No Error Max Error

Difference Plots – faulty output versus faultless output

Page 20: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Fault Tolerance Results: “Stripe”(cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead

Page 21: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Conclusion / Future Work

ALFTD has shown to be a cost-effective alternative to full redundancy

Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead

OTIS has general application characteristics that will make its implementation a springboard to other, similar programs

ALFTD should continue to be effective in any programs that have predictable data characteristics

Page 22: Application Level Fault Tolerance and Detection

Application Level Fault Tolerance and Detection

Thank You!

For additional information, please contact Eric Ciocca ([email protected]) Israel Koren ([email protected]) C. Mani Krishna ([email protected])