application level fault tolerance and detection

Application Level Fault Tolerance and Detection

Principal Investigators:C. Mani Krishna Israel Koren

Presented By:Eric Ciocca

Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering

University of Massachusetts Amherst MA 01003


What is ALFTD?


ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information

significantly reduces the overall cost providing fault tolerance

ALFTD may be used alone or to supplement other fault detection schemes


ALFTD Overview

Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually

cease functioning Data faults cause a process to continue running

with incorrect results ALFTD is scalable

The level of fault tolerance can be traded off with invested time overhead


Principles of ALFTD

To provide system fault tolerance, every physical node runs its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary)

If a fault should corrupt a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality

Node 1

Node 2

Node 3

Node 4

P1 S4

P2 S1

P3 S2

P4 S3


Principles of ALFTD

The secondary processes can be scaled-down by reducing the resolution of input data reducing the precision of calculations heuristically predicting results from previous

iterations’ output In some applications the secondary can be

run optionally on an as-needed basis If the corresponding primary is approaching a

deadline miss If the corresponding primary has been incapacitated If the corresponding primary has produced faulty

data If faults are infrequent, an optional secondary

will incur very little additional overhead


ALFTD in OTIS

ALFTD was implemented into OTIS (Oribital Thermal Imaging Spectrometer) to test its viability as a fault tolerance and detection scheme

OTIS, part of the REE (Remote Exploration and Experimentation) program group from JPL, is intended to run on orbiting satellites

OTIS processes radiation data of a geographic area from a sensor array [input] and produces temperature and emissivity data [output]


3. Master sends tasks

OTIS Structure

M

S

2. MPI Starts Slave and master processes

4. Slave Calculations

5. Slave Returns Results

OUTPUT

MPI

1. MPI Starts

S

S

1

2 3

4

5


ALFTD in OTIS (cont’d)

ALFTD is suited for remote applications, As a software-based fault handling mechanism,

it requires no extra hardware The scaled secondaries require less power than

full software redundancy In OTIS, and other applications, ALFTD is

passive, only requiring extra runtime in a fault case.


3. Master sends tasks

ALFTD OTIS Structure

M

P3

2. MPI Starts master and slaves, primary and secondary processes

4. Slave Calculations

5. Slave Returns Results

OUTPUT

MPI

1. MPI Starts

P2

P1

1

2 3

4

5

S1

S3

S2

?


Secondaries in OTIS

The secondary required for ALFTD is implemented to be functionally similar to the primary

Secondary scaling occurs through resolution reduction OTIS’ “natural” temperature data input exhibits

spatial locality Points not directly calculated can be

approximately estimated using interpolation between calculated points

Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less

overhead


Example of Secondary Resolution

(ALFTD Compensation for 10 rows in a sample dataset)

100% Secondary Resolution





Fault Detection

Output filters on the primary data determine when secondary validation is required

Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can

be considered to be the product of potentially faulty processes

OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over

small areas Absolute Bounds: temperature should not exceed

certain values


Fault Detection (cont’d)

After the secondary has been run to validate a primary’s results, the “better” data is chosen according to the following logic grid:

Primary Results

Faultless

Ambiguous

Faulty

FaultlessPrimar

ySecondar

ySeconda

ry

Ambiguous

Primary

PrimarySeconda

ry

FaultyPrimar

yPrimary Primary*

Secon

dary

R

esu

lts


Data Sets

Three data sets were chosen for their interesting characteristics

“Blob” “Stripe” “Spots”

Broad, unchanging areas with dark spots

Relatively undynamic

except for one “stripe”

Turbulent spots may defy

“spatial locality”

predictions


Fault Tolerance Results: “Spots”

Fault Tolerance with injected faults in “Spots”


Fault Tolerance Results: “Spots” (cont’d)

Faulty Output

33% ALFTD Computation Overhead 50% ALFTD Computation Overhead

Fault-Free Output

ALFTD-corrected faulty output25% ALFTD Computation Overhead


Fault Tolerance Results: “Blob”

Fault Tolerance with injected faults in “Blob”


Fault Tolerance Results: “Blob” (cont’d)

Faulty Output


Fault-Free Output



Fault Tolerance Results: “Stripe”

No ALFTD 25% ALFTDComputation Overhead

33% ALFTD Computation Overhead

50% ALFTD Computation Overhead

No Error Max Error

Difference Plots – faulty output versus faultless output


Fault Tolerance Results: “Stripe”(cont’d)

Faulty Output


Fault-Free Output



Conclusion / Future Work

ALFTD has shown to be a cost-effective alternative to full redundancy

Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead

OTIS has general application characteristics that will make its implementation a springboard to other, similar programs

ALFTD should continue to be effective in any programs that have predictable data characteristics


Thank You!

For additional information, please contact Eric Ciocca ([email protected]) Israel Koren ([email protected]) C. Mani Krishna ([email protected])

application level fault tolerance and detection

Documents

system fault tolerance

fault case

fault tolerancealftd

system faults

corresponding primary

corresponding secondary

detectionalftd otis

existing system