application level fault tolerance and detection
DESCRIPTION
Application Level Fault Tolerance and Detection. Principal Investigators: C. Mani KrishnaIsrael Koren Presented By: Eric Ciocca. Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003. What is ALFTD?. - PowerPoint PPT PresentationTRANSCRIPT
Application Level Fault Tolerance and Detection
Principal Investigators:C. Mani Krishna Israel Koren
Presented By:Eric Ciocca
Architecture and Real-Time Systems (ARTS) Lab.Department of Electrical and Computer Engineering
University of Massachusetts Amherst MA 01003
Application Level Fault Tolerance and Detection
What is ALFTD?
Application Level Fault Tolerance and Detection
ALFTD complements existing system or algorithm level fault tolerance by leveraging information available only at the application level Using such application level semantic information
significantly reduces the overall cost providing fault tolerance
ALFTD may be used alone or to supplement other fault detection schemes
Application Level Fault Tolerance and Detection
ALFTD Overview
Application Level Fault Tolerance and Detection allows for system survival of both data and system (instruction/hardware) faults. System faults cause a process to eventually
cease functioning Data faults cause a process to continue running
with incorrect results ALFTD is scalable
The level of fault tolerance can be traded off with invested time overhead
Application Level Fault Tolerance and Detection
Principles of ALFTD
To provide system fault tolerance, every physical node runs its own work (P,primary) as well as a scaled-down copy of a neighboring node’s work (S,secondary)
If a fault should corrupt a process, the corresponding secondary of that task will still produce output, albeit at a lower (but acceptable) quality
Node 1
Node 2
Node 3
Node 4
P1 S4
P2 S1
P3 S2
P4 S3
Application Level Fault Tolerance and Detection
Principles of ALFTD
The secondary processes can be scaled-down by reducing the resolution of input data reducing the precision of calculations heuristically predicting results from previous
iterations’ output In some applications the secondary can be
run optionally on an as-needed basis If the corresponding primary is approaching a
deadline miss If the corresponding primary has been incapacitated If the corresponding primary has produced faulty
data If faults are infrequent, an optional secondary
will incur very little additional overhead
Application Level Fault Tolerance and Detection
ALFTD in OTIS
ALFTD was implemented into OTIS (Oribital Thermal Imaging Spectrometer) to test its viability as a fault tolerance and detection scheme
OTIS, part of the REE (Remote Exploration and Experimentation) program group from JPL, is intended to run on orbiting satellites
OTIS processes radiation data of a geographic area from a sensor array [input] and produces temperature and emissivity data [output]
Application Level Fault Tolerance and Detection
3. Master sends tasks
OTIS Structure
M
S
2. MPI Starts Slave and master processes
4. Slave Calculations
5. Slave Returns Results
OUTPUT
MPI
1. MPI Starts
S
S
1
2 3
4
5
Application Level Fault Tolerance and Detection
ALFTD in OTIS (cont’d)
ALFTD is suited for remote applications, As a software-based fault handling mechanism,
it requires no extra hardware The scaled secondaries require less power than
full software redundancy In OTIS, and other applications, ALFTD is
passive, only requiring extra runtime in a fault case.
Application Level Fault Tolerance and Detection
3. Master sends tasks
ALFTD OTIS Structure
M
P3
2. MPI Starts master and slaves, primary and secondary processes
4. Slave Calculations
5. Slave Returns Results
OUTPUT
MPI
1. MPI Starts
P2
P1
1
2 3
4
5
S1
S3
S2
?
Application Level Fault Tolerance and Detection
Secondaries in OTIS
The secondary required for ALFTD is implemented to be functionally similar to the primary
Secondary scaling occurs through resolution reduction OTIS’ “natural” temperature data input exhibits
spatial locality Points not directly calculated can be
approximately estimated using interpolation between calculated points
Secondary processes have been tested at 20%-50% of the primary calculation overhead While 50% affords better quality, 20% has less
overhead
Application Level Fault Tolerance and Detection
Example of Secondary Resolution
(ALFTD Compensation for 10 rows in a sample dataset)
100% Secondary Resolution
50% Secondary Resolution
33% Secondary Resolution
25% Secondary Resolution
Application Level Fault Tolerance and Detection
Fault Detection
Output filters on the primary data determine when secondary validation is required
Output filters are created to check for application-specific trends in data Aberrations from normal data characteristics can
be considered to be the product of potentially faulty processes
OTIS relies on natural temperature characteristics to detect potentially faulty data Spatial Locality: temperature changes gradually over
small areas Absolute Bounds: temperature should not exceed
certain values
Application Level Fault Tolerance and Detection
Fault Detection (cont’d)
After the secondary has been run to validate a primary’s results, the “better” data is chosen according to the following logic grid:
Primary Results
Faultless
Ambiguous
Faulty
FaultlessPrimar
ySecondar
ySeconda
ry
Ambiguous
Primary
PrimarySeconda
ry
FaultyPrimar
yPrimary Primary*
Secon
dary
R
esu
lts
Application Level Fault Tolerance and Detection
Data Sets
Three data sets were chosen for their interesting characteristics
“Blob” “Stripe” “Spots”
Broad, unchanging areas with dark spots
Relatively undynamic
except for one “stripe”
Turbulent spots may defy
“spatial locality”
predictions
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Spots”
Fault Tolerance with injected faults in “Spots”
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Spots” (cont’d)
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
Fault-Free Output
ALFTD-corrected faulty output25% ALFTD Computation Overhead
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Blob”
Fault Tolerance with injected faults in “Blob”
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Blob” (cont’d)
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
Fault-Free Output
ALFTD-corrected faulty output25% ALFTD Computation Overhead
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Stripe”
No ALFTD 25% ALFTDComputation Overhead
33% ALFTD Computation Overhead
50% ALFTD Computation Overhead
No Error Max Error
Difference Plots – faulty output versus faultless output
Application Level Fault Tolerance and Detection
Fault Tolerance Results: “Stripe”(cont’d)
Faulty Output
33% ALFTD Computation Overhead 50% ALFTD Computation Overhead
Fault-Free Output
ALFTD-corrected faulty output25% ALFTD Computation Overhead
Application Level Fault Tolerance and Detection
Conclusion / Future Work
ALFTD has shown to be a cost-effective alternative to full redundancy
Improvements on the scheme will increase fault coverage and decrease secondary calculation overhead
OTIS has general application characteristics that will make its implementation a springboard to other, similar programs
ALFTD should continue to be effective in any programs that have predictable data characteristics
Application Level Fault Tolerance and Detection
Thank You!
For additional information, please contact Eric Ciocca ([email protected]) Israel Koren ([email protected]) C. Mani Krishna ([email protected])