the dependability solution provider tm ww technology group © copyright 2015 all rights reserved....
TRANSCRIPT
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Designing Fault Management in Spaceflight Architectures
Chris J. WalterWW Technology Group [email protected]
(410) 418-4353
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Challenges• NASA architectures affected by trends in current
computing architectures– Network centric– Security vulnerabilities– Lower voltages– SWAP– Code reuse
• NASA demands– Higher onboard processing– Reusable missions and fault tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Future Spacecraft Onboard Computing Needs
Computation Category
Mission Need Objective of Computation
Flight Architecture Attribute
Vision-based Algorithms with Real-Time Requirements
• Terrain Relative Navigation • Hazard Avoidance • Entry, Descent & Landing • Pinpoint Landing
• Conduct safe proximity operations around primitive bodies• Land safely and accurately• Achieve robust results within available timeframe as input to control decisions
• Severe fault tolerance and real-time requirements• Fail-operational • High peak power needs
Model-Based Reasoning Techniques for Autonomy
• Mission planning, scheduling & resource management• Fault management in uncertain environments
• Contingency planning to mitigate execution failures • Detect, diagnose and recover from faults
• High computational complexity • Graceful degradation • Memory usage (data movement) impacts energy management
High Rate Instrument Data Processing
• High resolution sensors, e.g., SAR, Hyper-spectral
• Downlink images and products rather than raw data • Opportunistic science
• Distributed, dedicated processors at sensors • Less stringent fault tolerance
- Results from NASA study on High Performance Space Computing (HPSC)
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Future Spacecraft Onboard Computing Needs
Computation Category Flight Architecture Attribute
Vision-based Algorithms with Real-Time Requirements
• Severe fault tolerance and real-time requirements• Fail-operational • High peak power needs
Model-Based Reasoning Techniques for Autonomy
• High computational complexity • Graceful degradation • Energy management
High Rate Instrument Data Processing
• Distributed, dedicated processors at sensors • Less stringent fault tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
5
Large Scale “System-of-Systems”
Communication Link
Processing Node
Constellation Cluster
Processing Cluster
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
6
WWTG has Evolved a Vision for Highly Reliable Distributed Systems
• Our vision defines a system framework coupled with a middleware infrastructure that facilitates the deployment of robust, autonomous distributed systems.
• Features of our approach include:Scalability - System Size, Complexity and Dependability
Flexibility - System Composition and System Functionality
Integrity - Analyzable and Verifiable System
Heterogeneity - Diversity in hardware and software components
• These properties are provided by a cluster-based infrastructure that is applicable to many domains
• Embedded Control Systems• Distributed information Systems
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Scalable “Systems Approach”• Compositional so that a specified set of methods,
algorithms, and components can be used for construction in a customizable manner.
• Espouses the use of forethought rather than afterthought in anticipating requirements for real-time and dependable computing properties.
• Contains a architectural framework with – well defined levels of abstraction– clear and clean interfaces between layers.
• A general fault/error model to provide robust fault tolerance properties that enhance flexibility and scalability.
• Well-defined error containment regions– flexible, tailorable, quantifiable, analyzable
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Scalable “Systems Approach”
• Provides a integrated view of component interactions beyond healthy process-level interactions
• failure semantics and tolerance/detection algorithms
• Uses system level abstractions that can be recursively applied
• application programs• distributed OS• board, multi-board • chip, multi-chip
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
A Scalable Clustering Approach
• Clustering technique can be used to group system resources into composable units
• System Framework provides a set of guidance to system developers– Allows for reasoned trade-offs between competing system
aspects• Performance, Fault Tolerance, Flexibility, Determinism
– Provides a structured approach for assembling required system services; resulting in a system that is:
• Analyzable• Verifiable• Testable
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
10
Reliable Platform Services
LocalResource
Management
System Capability
Management Element Discovery
Initial Formation
Startup Sequencer
App Services APIsSys Organization API
Reliable Platform Interface (RPI)Health Monitoring
API
Local Resource
Health Monitoring
System Capability
Health Monitoring
Application Service Monitoring
RPS Component Monitoring
Native Hardware, Operating System, and Vendor Device Drivers
Cluster Services(Synchronization, Application
Service Management)
Local Services (Scheduler, Networking, OS Services)
Application Services
FrameScheduler
Service
DataIntegrityService
ProcessGroup
Service
System Applications
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
11
Adaptiveness in Error Domain
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
12
Property Based Fault Tolerance• Non-Functional properties are qualitative in nature and define
characteristics associated with the delivered service– reliability, availability, safety, security– scalability, flexibility, integrity, interoperability
• Functional properties are quantitative in nature and define what services the system delivers
– communication, – resource discovery– synchronization, – detection and reconfiguration
– process group management, – health monitoring, – scheduling, – etc.
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
13
Property Compositions
• BASIC PROPERTIES– Functional (services delivered) – Non-Functional (-ilities)
• COMPOSITE PROPERTIES– Properties of the system as a
whole rather than taken individually– Composite (Emergent) properties
are a consequence of the relationships between system components
– Can assess/measure only after composition of components/services integrated into a system
P3P1P1 P2
basic properties
CP3
composite properties
CP2
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
14
Structured Service Hierarchy
Discovery Services
Asynch Group Services
Synchronous Services
Data Integrity Services
Fault Management Services
Scheduling Services
Application Mgt Services
Asynchronous Messaging
Idealized Design Space
Building Blocks
Theories of Time & Failure Models
System Models .
Communication Primitives
Voting/Convergence Functions
Building Blocks Specification & Verification
Consistency of Specification Across Building Blocks
Synergistic Formulation Of Dependable Distributed Operations
Resource Discovery
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Framework Contains Services That Establish System Properties
• Establishes the necessary properties of bounded behavior for real-time and dependable computing– Timeliness
• synchrony of operations• deadline agreement
– Correctness• group formations• group management
– Resilience• errors that can be tolerated
• Components that are used to implement the properties (COTS) can be exchanged, as long as properties are maintained
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
16
Example: System of Distributed Spacecraft
Reorganization of spacecraft for accomplishing different mission goals
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
17
Cluster A
Fault Tolerant Element Discovery
FT-ED “Cold-Start” Facilitates Dependable Initial
Organization Formation
21
3
4
5
FT-ED “Warm-Start” Facilitates Dependable
Organization Augmentation
Cluster A
21
3
4
5 7
6
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Use Case: High Dependability Multi-Clustered System
• An instantiation of the framework– Supports a multi-cluster
system – Each cluster performs
high dependability processing
– Clusters are interfaced to support :
• Highly dependable cluster interfaces
• Hierarchical Processing across cluster boundaries Local Services (Scheduler, Networking, ..)
HM CM
Intra-Cluster Synchronization
Process Interfaces Groups
Data Integrity Services
Application Management
Apps
HM CM
Inter-Cluster Synchronization
Process Interfaces Groups
Data Integrity Services
Application Management
Apps
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
19
Distributed Containment Regions• Once properties identified, DECRs established and tailored to
provide the necessary degrees of dependability.• Can establish support of DECRs with different levels of criticality
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
20
Distributed Containment Regions• These regions can be organized in a variety of ways
– leader-follower– peer-to-peer– hierarchical– combination of above
• Examples:– define hardware v. software error containment
regions– define regions of different criticality
• Approach is effective in dealing with COTS issues– contain unknown or unspecified behaviors and failure
semantics
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
21
Premeditated Composability• Design Space is considered before composition• Framework exists to support methodical construction at
run-time• Capable of adapting
Operating
Space
DESIGN SPACE
Operating
Space
Operating
Space
Operating
Space
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
22
Strategy• Creation of idealized design space
– encompasses CSR goals– accomodates single system to multi-cluster system– comprehensive error model that is tailorable to specific use case
• Establishing useful abstractions and relationships– ECRs, Clusters, System-of-System– components couplings and dependencies
• Composable service architecture– inheritance of underlying established properties
• time (boundedness & accuracy)• data (integrity & fault tolerance)
– streamlines the organization of layers• system users/developers can work at most meaningful abstraction
layer
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
23
Example Use Case 1:COTS Based Dependable Cluster
COTS CPU 1 COTS CPU 4COTS CPU 3COTS CPU 2
NetworkInfrastructure
COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform COTS RTOS Platform
RPS MiddlewareProcesses
RPS MiddlewareProcesses
RPS MiddlewareProcesses
RPS MiddlewareProcesses
R e l i a b l e P l a t f o r m
Hosted App Space
A-1
RPI
A-2
RPI
B-1
RPI
C-1
RPI
B-2
RPI
C-2
RPI
A-3
RPI
C-3
RPI
RPS-Enabled Virtual Platform
Space
Replicated App A
Replicated App B
Replicated App CC-1
RPI
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
24
Improving Performance of Individual Node• Reduce the lifetime operating and support costs of FPGA
based systems, specifically the signal processing components. Related needs include:– Reduction in cost of hardware selection– Reduction in cost of hardware modification
(e.g., minimize cost and schedule impact due to COTS Technology Refresh Evolutions)
• Reduce the development costs of FPGA based applications. Related needs include:– Abstracted interfaces to external resources– Cost effective application growth– Solutions that will adapt to future changes and improvements to
the underlying FPGA technology
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Reconfigurable Fault Tolerance
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Fault Tolerance Triggers
Radiation Hazard Triggers
Power Mgt Triggers
Load Monitoring Triggers
Performance Triggers
User Demand Triggers
RLOReconfiguration
Triggers
Mission Modes
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Tools for Analysis and Certification
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Fault Management Challenges• We can see there are many types of flexible system
architectures to consider• In order to make best use of resources there is a need to
employ dynamic redundancy techniques• This requires intimate understanding of faults and errors
– use a strategy of possibilistic instead of probabilistic• “Nearly impossible” means possible.
– Emphasize arbitrary errors rather than specific types– Utilize concepts related to Byzantine Agreement– Focus on narrowing windows of error arrival and
accumulation so that fault tolerant complexities do not grow exponentially
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
EDICT Tools• Model-based
engineering platform • Coherent aspect
specific views of organization and behavior
• Integration of architectural and analytical models of systems and their constituent components/services
Safety
BehaviorStructure
DependabilityPerformance
Simulink
AADL
Security
UML/SysML
Augmentations
EDICT
Aspects
Architecture andAnalysis Views
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Structural Architecture VisualizationArchitecture Browser provides a graphical view of
architecture models
Component Hierarchy
Component Connections
SoftwareComponents
HardwareComponents
Externals
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Structural and Behavioral Views
• Architecture Browser provides many views
• Views show data of concern in the context of the overall architecture– Data elements and usage– Data/Control flows and
interaction sequences– Property assignments
• Aspect specific augmentations are also shown– Safety criticalities
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Pilot Blackout due to excessive accelleration
Control System Failure
Sensor Feedback Error
Control Law Failure
Sensor Produces Incorrect
Value
Sensor Fails to Produce a
Value
Control Law Design Error
Control Law Run-time
Error
EDICT Tools Support Many Modeling and Analysis Features for Verification
• Architecture Modeling– Architectural Flows– Timelines and Events
• Error Propagation Analysis• Safety Tagging and Visualization• Performance and Schedulability
Analysis• Requirements Tagging and
Architectural Tracing• Simulink Integration for
Application VerificationDisplay
User Input Processing
System Control
Sensor Filtering
Data Recording
Device Control
Device Actuator
User Display
Input Device
Sensor Device
Network
NetworkNetwork
Network
Asym
Asym
Asym
Asym
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Example Analysis:Fault Aware Fault Trees
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Challenge• Fault-trees one of the most widely used FM mechanisms by
practitioners as a visualization/communication media, as well as a quantitative analysis tool for building mission-critical systems.
• Fault tree analysis is often conducted in an ad hoc manner and is unable to provide us with high-confidence results.
• The major problem is that with manual fault tree construction, the resulting trees can be incomplete and failure-event relationships misrepresented.
• As systems and their interface complexities grow rapidly, the problem has only worsened. In a remarkably large number of the failure events, fault management (FM) inappropriately applied to mitigate the effect of anomaly actually increased the severity. Therefore we must pay meticulous attention to the misuse of FM methods.
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
35
Goal: Fault-Class-Aware Fault-tree Generation & Analysis
• Go beyond mechanical translation and extend method to consider impact of:– Awareness of fault class and Fault Management (FM)
coverage limitation during tree generation. – Prioritize fault-class-oriented decomposition over pure
architectural decomposition.• Go beyond faults in application systems
– Model-based FM scheme checking to assess whether appropriate
– Vigilant about critical faults in the use of FM schemes.– Impact assessment to the exposure of the faults that are
not covered due to inappropriate FM application.
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
36
Multifaceted Fault Reference Model
Faultdevelopment faultby phase of occurrence
● ●
●
● ● operational fault
internal fault by system boundaryexternal fault
hard
war
e fa
ult
by d
imen
sion
soft
war
e fa
ult
●
permanent fault
by persistence
transient fault●
physical fault
by cause
●
design faultmalicious f
ault
by objecti
ve●
benign fa
ult
deliberate fault
by intent
●
Non-deliberate faultac
cidental fa
ult
by cap
abilit
y
●
incompetence
fault
● ●
●
● ●
●
● ●
●
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Misleading Fault Tree w/o Fault Awareness (ARIANE-5)
37
FT Inertial System failure
ADIRU device failure
Air data software failure
Primary SRI failure
Secondary SRI failure
ADIRU device failure
Air data software failure
410dataSWP 410dataSWP 410ADIRUP 410ADIRUP
2 8(1 (1 )(1 )) 4 10inertialSys ADIRU dataSWP P P
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Fault Tree with Fault Awareness
38
410ADIRUP 410ADIRUP 410dataSWP
421 (1 )(1 ) 1 10inertialSys ADIRU dataSWP P P
FT Inertial System failure
ADIRU device failure
Primary SRI failure
Secondary SRI failure
ADIRU device failure
Air data software failure
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
Fault Tree with Augmentation
39
FT Inertial System failure
ADIRU device failure
Primary ADIRU failure
Secondary ADIRU failure
Air data software failure
Primary version failure
Secondary version failure
82 21 (1 )(1 ) 2 10inertialSys ADIRU dataSWP P P
410ADIRUP 410ADIRUP 410dataSWP 410dataSWP
The Dependability Solution Provider TM
WW Technology Group
WW Technology Group © Copyright 2015 All rights reserved.
40
Summary of Major Points Fault Management in Spaceflight architectures is a many
dimensional problem
Reliable Platform (RP) property based architecture with hierarchical clustering shown to be effective
RP FM Strategies can be implemented in many ways
Reconfigurable Fault Tolerance can accelerate performance and provide adaptive fault tolerance
• Clusters can be distributed and arranged in various hierarchical configurations
• Local fault management can be flexible and customizable
Modeling fault effects and impact on system reliability to avoid incorrect assessments of dependability Need good modeling and analysis tools (use EDICT!)