the dependability solution provider tm ww technology group © copyright 2015 all rights reserved....

40
The Dependability Solution Provider TM WW Technology Group WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures Chris J. Walter WW Technology Group [email protected] (410) 418-4353

Upload: cecily-long

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Designing Fault Management in Spaceflight Architectures

Chris J. WalterWW Technology Group [email protected]

(410) 418-4353

Page 2: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Challenges• NASA architectures affected by trends in current

computing architectures– Network centric– Security vulnerabilities– Lower voltages– SWAP– Code reuse

• NASA demands– Higher onboard processing– Reusable missions and fault tolerance

Page 3: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Future Spacecraft Onboard Computing Needs

Computation Category

Mission Need Objective of Computation

Flight Architecture Attribute

Vision-based Algorithms with Real-Time Requirements

• Terrain Relative Navigation • Hazard Avoidance • Entry, Descent & Landing • Pinpoint Landing

• Conduct safe proximity operations around primitive bodies• Land safely and accurately• Achieve robust results within available timeframe as input to control decisions

• Severe fault tolerance and real-time requirements• Fail-operational • High peak power needs

Model-Based Reasoning Techniques for Autonomy

• Mission planning, scheduling & resource management• Fault management in uncertain environments

• Contingency planning to mitigate execution failures • Detect, diagnose and recover from faults

• High computational complexity • Graceful degradation • Memory usage (data movement) impacts energy management

High Rate Instrument Data Processing

• High resolution sensors, e.g., SAR, Hyper-spectral

• Downlink images and products rather than raw data • Opportunistic science

• Distributed, dedicated processors at sensors • Less stringent fault tolerance

- Results from NASA study on High Performance Space Computing (HPSC)

Page 4: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Future Spacecraft Onboard Computing Needs

Computation Category Flight Architecture Attribute

Vision-based Algorithms with Real-Time Requirements

• Severe fault tolerance and real-time requirements• Fail-operational • High peak power needs

Model-Based Reasoning Techniques for Autonomy

• High computational complexity • Graceful degradation • Energy management

High Rate Instrument Data Processing

• Distributed, dedicated processors at sensors • Less stringent fault tolerance

Page 5: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

5

Large Scale “System-of-Systems”

Communication Link

Processing Node

Constellation Cluster

Processing Cluster

Page 6: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

6

WWTG has Evolved a Vision for Highly Reliable Distributed Systems

• Our vision defines a system framework coupled with a middleware infrastructure that facilitates the deployment of robust, autonomous distributed systems.

• Features of our approach include:Scalability - System Size, Complexity and Dependability

Flexibility - System Composition and System Functionality

Integrity - Analyzable and Verifiable System

Heterogeneity - Diversity in hardware and software components

• These properties are provided by a cluster-based infrastructure that is applicable to many domains

• Embedded Control Systems• Distributed information Systems

Page 7: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Scalable “Systems Approach”• Compositional so that a specified set of methods,

algorithms, and components can be used for construction in a customizable manner.

• Espouses the use of forethought rather than afterthought in anticipating requirements for real-time and dependable computing properties.

• Contains a architectural framework with – well defined levels of abstraction– clear and clean interfaces between layers.

• A general fault/error model to provide robust fault tolerance properties that enhance flexibility and scalability.

• Well-defined error containment regions– flexible, tailorable, quantifiable, analyzable

Page 8: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Scalable “Systems Approach”

• Provides a integrated view of component interactions beyond healthy process-level interactions

• failure semantics and tolerance/detection algorithms

• Uses system level abstractions that can be recursively applied

• application programs• distributed OS• board, multi-board • chip, multi-chip

Page 9: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

A Scalable Clustering Approach

• Clustering technique can be used to group system resources into composable units

• System Framework provides a set of guidance to system developers– Allows for reasoned trade-offs between competing system

aspects• Performance, Fault Tolerance, Flexibility, Determinism

– Provides a structured approach for assembling required system services; resulting in a system that is:

• Analyzable• Verifiable• Testable

Page 10: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

10

Reliable Platform Services

LocalResource

Management

System Capability

Management Element Discovery

Initial Formation

Startup Sequencer

App Services APIsSys Organization API

Reliable Platform Interface (RPI)Health Monitoring

API

Local Resource

Health Monitoring

System Capability

Health Monitoring

Application Service Monitoring

RPS Component Monitoring

Native Hardware, Operating System, and Vendor Device Drivers

Cluster Services(Synchronization, Application

Service Management)

Local Services (Scheduler, Networking, OS Services)

Application Services

FrameScheduler

Service

DataIntegrityService

ProcessGroup

Service

System Applications

Page 11: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

11

Adaptiveness in Error Domain

Page 12: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

12

Property Based Fault Tolerance• Non-Functional properties are qualitative in nature and define

characteristics associated with the delivered service– reliability, availability, safety, security– scalability, flexibility, integrity, interoperability

• Functional properties are quantitative in nature and define what services the system delivers

– communication, – resource discovery– synchronization, – detection and reconfiguration

– process group management, – health monitoring, – scheduling, – etc.

Page 13: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

13

Property Compositions

• BASIC PROPERTIES– Functional (services delivered) – Non-Functional (-ilities)

• COMPOSITE PROPERTIES– Properties of the system as a

whole rather than taken individually– Composite (Emergent) properties

are a consequence of the relationships between system components

– Can assess/measure only after composition of components/services integrated into a system

P3P1P1 P2

basic properties

CP3

composite properties

CP2

Page 14: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

14

Structured Service Hierarchy

Discovery Services

Asynch Group Services

Synchronous Services

Data Integrity Services

Fault Management Services

Scheduling Services

Application Mgt Services

Asynchronous Messaging

Idealized Design Space

Building Blocks

Theories of Time & Failure Models

System Models .

Communication Primitives

Voting/Convergence Functions

Building Blocks Specification & Verification

Consistency of Specification Across Building Blocks

Synergistic Formulation Of Dependable Distributed Operations

Resource Discovery

Page 15: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Framework Contains Services That Establish System Properties

• Establishes the necessary properties of bounded behavior for real-time and dependable computing– Timeliness

• synchrony of operations• deadline agreement

– Correctness• group formations• group management

– Resilience• errors that can be tolerated

• Components that are used to implement the properties (COTS) can be exchanged, as long as properties are maintained

Page 16: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

16

Example: System of Distributed Spacecraft

Reorganization of spacecraft for accomplishing different mission goals

Page 17: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

17

Cluster A

Fault Tolerant Element Discovery

FT-ED “Cold-Start” Facilitates Dependable Initial

Organization Formation

21

3

4

5

FT-ED “Warm-Start” Facilitates Dependable

Organization Augmentation

Cluster A

21

3

4

5 7

6

Page 18: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Use Case: High Dependability Multi-Clustered System

• An instantiation of the framework– Supports a multi-cluster

system – Each cluster performs

high dependability processing

– Clusters are interfaced to support :

• Highly dependable cluster interfaces

• Hierarchical Processing across cluster boundaries Local Services (Scheduler, Networking, ..)

HM CM

Intra-Cluster Synchronization

Process Interfaces Groups

Data Integrity Services

Application Management

Apps

HM CM

Inter-Cluster Synchronization

Process Interfaces Groups

Data Integrity Services

Application Management

Apps

Page 19: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

19

Distributed Containment Regions• Once properties identified, DECRs established and tailored to

provide the necessary degrees of dependability.• Can establish support of DECRs with different levels of criticality

Page 20: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

20

Distributed Containment Regions• These regions can be organized in a variety of ways

– leader-follower– peer-to-peer– hierarchical– combination of above

• Examples:– define hardware v. software error containment

regions– define regions of different criticality

• Approach is effective in dealing with COTS issues– contain unknown or unspecified behaviors and failure

semantics

Page 21: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

21

Premeditated Composability• Design Space is considered before composition• Framework exists to support methodical construction at

run-time• Capable of adapting

Operating

Space

DESIGN SPACE

Operating

Space

Operating

Space

Operating

Space

Page 22: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

22

Strategy• Creation of idealized design space

– encompasses CSR goals– accomodates single system to multi-cluster system– comprehensive error model that is tailorable to specific use case

• Establishing useful abstractions and relationships– ECRs, Clusters, System-of-System– components couplings and dependencies

• Composable service architecture– inheritance of underlying established properties

• time (boundedness & accuracy)• data (integrity & fault tolerance)

– streamlines the organization of layers• system users/developers can work at most meaningful abstraction

layer

Page 23: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

23

Example Use Case 1:COTS Based Dependable Cluster

COTS CPU 1 COTS CPU 4COTS CPU 3COTS CPU 2

NetworkInfrastructure

COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform

RPS MiddlewareProcesses

RPS MiddlewareProcesses

RPS MiddlewareProcesses

RPS MiddlewareProcesses

R e l i a b l e P l a t f o r m

Hosted App Space

A-1

RPI

A-2

RPI

B-1

RPI

C-1

RPI

B-2

RPI

C-2

RPI

A-3

RPI

C-3

RPI

RPS-Enabled Virtual Platform

Space

Replicated App A

Replicated App B

Replicated App CC-1

RPI

Page 24: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

24

Improving Performance of Individual Node• Reduce the lifetime operating and support costs of FPGA

based systems, specifically the signal processing components. Related needs include:– Reduction in cost of hardware selection– Reduction in cost of hardware modification

(e.g., minimize cost and schedule impact due to COTS Technology Refresh Evolutions)

• Reduce the development costs of FPGA based applications. Related needs include:– Abstracted interfaces to external resources– Cost effective application growth– Solutions that will adapt to future changes and improvements to

the underlying FPGA technology

Page 25: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Reconfigurable Fault Tolerance

Page 26: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Fault Tolerance Triggers

Radiation Hazard Triggers

Power Mgt Triggers

Load Monitoring Triggers

Performance Triggers

User Demand Triggers

RLOReconfiguration

Triggers

Mission Modes

Page 27: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Tools for Analysis and Certification

Page 28: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Fault Management Challenges• We can see there are many types of flexible system

architectures to consider• In order to make best use of resources there is a need to

employ dynamic redundancy techniques• This requires intimate understanding of faults and errors

– use a strategy of possibilistic instead of probabilistic• “Nearly impossible” means possible.

– Emphasize arbitrary errors rather than specific types– Utilize concepts related to Byzantine Agreement– Focus on narrowing windows of error arrival and

accumulation so that fault tolerant complexities do not grow exponentially

Page 29: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

EDICT Tools• Model-based

engineering platform • Coherent aspect

specific views of organization and behavior

• Integration of architectural and analytical models of systems and their constituent components/services

Safety

BehaviorStructure

DependabilityPerformance

Simulink

AADL

Security

UML/SysML

Augmentations

EDICT

Aspects

Architecture andAnalysis Views

Page 30: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Structural Architecture VisualizationArchitecture Browser provides a graphical view of

architecture models

Component Hierarchy

Component Connections

SoftwareComponents

HardwareComponents

Externals

Page 31: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Structural and Behavioral Views

• Architecture Browser provides many views

• Views show data of concern in the context of the overall architecture– Data elements and usage– Data/Control flows and

interaction sequences– Property assignments

• Aspect specific augmentations are also shown– Safety criticalities

Page 32: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Pilot Blackout due to excessive accelleration

Control System Failure

Sensor Feedback Error

Control Law Failure

Sensor Produces Incorrect

Value

Sensor Fails to Produce a

Value

Control Law Design Error

Control Law Run-time

Error

EDICT Tools Support Many Modeling and Analysis Features for Verification

• Architecture Modeling– Architectural Flows– Timelines and Events

• Error Propagation Analysis• Safety Tagging and Visualization• Performance and Schedulability

Analysis• Requirements Tagging and

Architectural Tracing• Simulink Integration for

Application VerificationDisplay

User Input Processing

System Control

Sensor Filtering

Data Recording

Device Control

Device Actuator

User Display

Input Device

Sensor Device

Network

NetworkNetwork

Network

Asym

Asym

Asym

Asym

Page 33: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Example Analysis:Fault Aware Fault Trees

Page 34: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Challenge• Fault-trees one of the most widely used FM mechanisms by

practitioners as a visualization/communication media, as well as a quantitative analysis tool for building mission-critical systems.

• Fault tree analysis is often conducted in an ad hoc manner and is unable to provide us with high-confidence results.

• The major problem is that with manual fault tree construction, the resulting trees can be incomplete and failure-event relationships misrepresented.

• As systems and their interface complexities grow rapidly, the problem has only worsened. In a remarkably large number of the failure events, fault management (FM) inappropriately applied to mitigate the effect of anomaly actually increased the severity. Therefore we must pay meticulous attention to the misuse of FM methods.

Page 35: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

35

Goal: Fault-Class-Aware Fault-tree Generation & Analysis

• Go beyond mechanical translation and extend method to consider impact of:– Awareness of fault class and Fault Management (FM)

coverage limitation during tree generation. – Prioritize fault-class-oriented decomposition over pure

architectural decomposition.• Go beyond faults in application systems

– Model-based FM scheme checking to assess whether appropriate

– Vigilant about critical faults in the use of FM schemes.– Impact assessment to the exposure of the faults that are

not covered due to inappropriate FM application.

Page 36: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

36

Multifaceted Fault Reference Model

Faultdevelopment faultby phase of occurrence

● ●

● ● operational fault

internal fault by system boundaryexternal fault

hard

war

e fa

ult

by d

imen

sion

soft

war

e fa

ult

permanent fault

by persistence

transient fault●

physical fault

by cause

design faultmalicious f

ault

by objecti

ve●

benign fa

ult

deliberate fault

by intent

Non-deliberate faultac

cidental fa

ult

by cap

abilit

y

incompetence

fault

● ●

● ●

● ●

Page 37: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Misleading Fault Tree w/o Fault Awareness (ARIANE-5)

37

FT Inertial System failure

ADIRU device failure

Air data software failure

Primary SRI failure

Secondary SRI failure

ADIRU device failure

Air data software failure

410dataSWP 410dataSWP 410ADIRUP 410ADIRUP

2 8(1 (1 )(1 )) 4 10inertialSys ADIRU dataSWP P P

Page 38: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Fault Tree with Fault Awareness

38

410ADIRUP 410ADIRUP 410dataSWP

421 (1 )(1 ) 1 10inertialSys ADIRU dataSWP P P

FT Inertial System failure

ADIRU device failure

Primary SRI failure

Secondary SRI failure

ADIRU device failure

Air data software failure

Page 39: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

Fault Tree with Augmentation

39

FT Inertial System failure

ADIRU device failure

Primary ADIRU failure

Secondary ADIRU failure

Air data software failure

Primary version failure

Secondary version failure

82 21 (1 )(1 ) 2 10inertialSys ADIRU dataSWP P P

410ADIRUP 410ADIRUP 410dataSWP 410dataSWP

Page 40: The Dependability Solution Provider TM WW Technology Group © Copyright 2015 All rights reserved. Designing Fault Management in Spaceflight Architectures

The Dependability Solution Provider TM

WW Technology Group

WW Technology Group © Copyright 2015 All rights reserved.

40

Summary of Major Points Fault Management in Spaceflight architectures is a many

dimensional problem

Reliable Platform (RP) property based architecture with hierarchical clustering shown to be effective

RP FM Strategies can be implemented in many ways

Reconfigurable Fault Tolerance can accelerate performance and provide adaptive fault tolerance

• Clusters can be distributed and arranged in various hierarchical configurations

• Local fault management can be flexible and customizable

Modeling fault effects and impact on system reliability to avoid incorrect assessments of dependability Need good modeling and analysis tools (use EDICT!)