spider formal models–where are we now?

29
Langley Research Center SPIDER Formal Models– Where are we now? Paul S. Miner paul .s.miner@ nasa . gov In collaboration with: Alfons Geser (NIA), Jeff Maddalon, and Lee Pike Internal Formal Methods Workshop NASA Langley Research Center June 28, 2022

Upload: shania

Post on 19-Mar-2016

45 views

Category:

Documents


2 download

DESCRIPTION

SPIDER Formal Models–Where are we now?. Paul S. Miner [email protected] In collaboration with: Alfons Geser (NIA), Jeff Maddalon, and Lee Pike Internal Formal Methods Workshop NASA Langley Research Center September 11, 2014. What is SPIDER?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SPIDER Formal Models–Where are we now?

Langley Research Center

SPIDER Formal Models–Where are we now?

Paul S. [email protected]

In collaboration with:Alfons Geser (NIA), Jeff Maddalon, and Lee Pike

Internal Formal Methods WorkshopNASA Langley Research Center

April 24, 2023

Page 2: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 2

Langley Research Center

What is SPIDER?• A family of fault-tolerant IMA architectures

– Architecure concept due to Paul Miner, Mahyar Malekpour, and Wilfredo Torres-Pomales

• Inspired by several earlier designs– Main concept inspired by Palumbo’s Fault-tolerant processing

system (U.S. Patent 5,533,188)• Developed as part of Fly-By-Light/Power-By-Wire project

– Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; …

Page 3: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 3

Langley Research Center

SPIDER Architecture• N general purpose Processing Elements (PEs) logically

connected via a Reliable Optical BUS (ROBUS)– A PE could be a general purpose processor, remote data concentrator,

sensor, actuator, or any other device that needs to reliably communicate with other PEs

• SPIDER must be sufficiently reliable to support several aircraft functions– Persistent loss of single function could be catastrophic

• The ROBUS is an ultra-reliable unit providing basic fault-tolerant communication services

• ROBUS contains no software

Page 4: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 4

Langley Research Center

Logical view of SPIDER(Sample Configuration)

ROBUS

0 4 21 3 56 7

Page 5: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 5

Langley Research Center

Design Objectives• FT-IMA Architecture proven to survive a bounded

number of physical faults– Both permanent and transient– Must survive Byzantine faults

• Capability to survive or quickly recover from massive correlated transient failure (e.g. in response to HIRF)

Page 6: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 6

Langley Research Center

Byzantine Faults• Characterized by asymmetric error manifestations

– different manifestations to different fault-free observers– including dissimilar values

• Can cause redundant computations to diverge• If not properly handled, single Byzantine fault can defeat

several layers of redundancy• Many architectures neglect this class of fault

– Assumed to be rare or even impossible

Page 7: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 7

Langley Research Center

Byzantine faults are real• Several examples cited in Byzantine Faults: From Theory to

Reality, Driscoll, et al. (to appear in SAFECOMP 2003)– Byzantine failures nearly grounded a large fleet of aircraft – Quad-redundant system failed in response to a single fault– Typical cases are faulty transmitters (resulting in indeterminate voltage

levels at receivers) or faults that cause timing violations (so that multiple observers perceive the same event differently)

• Heavy Ion fault-injection results for TTP/C (Sivencrona, et al.)– more than 1 in 1000 of observed errors had Byzantine manifestations

Page 8: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 8

Langley Research Center

SPIDER Advantages• Fault-Tolerance independent of applications• Tolerates more failures

– including any single Byzantine fault (and some combinations)– including many combinations of less severe failures– Hybrid fault model: good, asymmetric, symmetric, benign

• Does not require that nodes fail silent– But can take advantage when they do

• Simpler, stronger protocols with stronger assurance• Can gracefully evolve to accommodate parts obsolescence

– Off-the-shelf processors and low-level communication

Page 9: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 9

Langley Research Center

Failures contained by ROBUS

• Arbitrary failure in any attached Processing Element– Hardware or Software– Converts potential asymmetric error manifestations to

symmetric– ROBUS provides a partitioning mechanism between PEs

• Must also operate correctly if a bounded number of internal hardware devices fail

• Cannot tolerate design error within ROBUS

Page 10: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 10

Langley Research Center

Design Assurance Strategy

• Fault-tolerance protocols and reliability models use the same fault classifications

• Reliability analysis using SURE (Butler & White)– Calculates P(enough good hardware)

• Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation

Page 11: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 11

Langley Research Center

Strength of Formal Verification• Proofs equivalent to testing the protocols

– for all specified ROBUS configurations– for all combinations of faults that satisfy the maximum fault

assumption for each specified ROBUS configuration– for all specified message values

• The PVS proofs provides verification coverage equivalent to an infinite number of test cases.– Provided that the PVS model of the protocols is faithful to the

VHDL design

Page 12: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 12

Langley Research Center

ROBUS Characteristics• All good nodes agree on communication schedule

– Currently bus access schedule statically determined• similar to SAFEbus, Time-Triggered Architecture (TTA)

– Architecture supports on-the-fly schedule updates• similar to FTPP• Preliminary capability will be in our next prototype

• Some fault-tolerance capabilities must be provided by processing elements– Analogous to Fault Tolerance Layer in TTA

• Processing Elements need not be uniform– Some support for dissimilar architectures

Page 13: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 13

Langley Research Center

Logical View of ROBUS• ROBUS operates as a time-division multiple access

broadcast bus • ROBUS strictly enforces write access

– no babbling idiots (prevented by ROBUS topology)• Processing nodes may be grouped to provide differing

degrees of fault-tolerance– PEs cannot exhibit Byzantine errors (prevented by ROBUS

topology)– Simple N-modular redundancy strategies sufficient for PEs– Redundancy management for these groupings done by the

PEs

Page 14: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 14

Langley Research Center

SPIDER Topology

PE 1

PE 2

PE 3

ROBUSN,M

BIU N

BIU 3

BIU 2

BIU 1

RMU M

RMU 2

RMU 1

PE N

Page 15: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 15

Langley Research Center

First ROBUS Prototype

Page 16: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 16

Langley Research Center

PE & BIU 1

PE & BIU 2

PE & BIU 3 RMU 3

RMU 2

RMU 1

First SPIDER Prototype

Picture provided by Derivation Systems, Inc. (www.derivation.com)

Page 17: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 17

Langley Research Center

ROBUS Requirements• All fault-free PEs receive identical message sequences

– If the source is also fault-free, they receive the message sent • ROBUS provides a reliable time source (RTS)

– The PEs are synchronized relative to this RTS• ROBUS provides correct and consistent ROBUS

diagnostic information to all fault-free PEs• For 10 hour mission, P(ROBUS Failure) < 10-10

Page 18: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 18

Langley Research Center

Other Requirements• Primary focus is on fault-tolerance requirements

– Other requirements unspecified• Message format/encoding• Performance

– These are implementation dependent• Product Family

– capable of range of performance– trade-off performance and reliability– Formal analysis valid for any instance

Page 19: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 19

Langley Research Center

ROBUS Protocols• Interactive Consistency (Byzantine Agreement)

– loop unrolling of classic Oral Messages algorithm– Inspired by Draper FTP

• Distributed Diagnosis (Group Membership)– Initially adapted MAFT algorithm to SPIDER topology

• Depends on Interactive Consistency protocol – Verification process suggested more efficient protocol

• Improved protocol due to Alfons Geser• Suggested further generalizations

• Clock Synchronization– adaptation of Srikanth & Toueg protocol to SPIDER topology– Corresponds to Davies & Wakerly approach

Page 20: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 20

Langley Research Center

Recap from last year• All SPIDER fault-tolerance requirements may be

realized using a repeated execution of single abstract protocol

• Basic operation is single stage middle value select– Useful for readmission of failed nodes

• Two stage middle value select ensures validity and agreement properties for Interactive Consistency, Distributed Diagnosis, and Clock Synchronization

Page 21: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 21

Langley Research Center

Single Stage Middle Value Selectx

y

z

mvs(x,y,z)

mvs(x,y,z)

mvs(x,y,z)

mvs(a,b,c) selects middle value from set {a, b, c}

Page 22: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 22

Langley Research Center

Single Stage Middle Value SelectProperties

• Validity: If there is a majority of good sources, then all good receivers select a value in the range of the good sources

• Agreement Propagation: If all good sources agree, and form a majority, then all good receivers will agree

• Agreement Generation: If there are no asymmetric-faulty sources, then all good receivers will agree

Page 23: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 23

Langley Research Center

Single Stage Middle Value Select(Validity)

x

Any Fault

z

mvs(x,a,z)

mvs(x,b,z)

mvs(x,c,z)

min(x,z) mvs(x,?,z) max(x,z)No guarantee of agreement! Demo

Page 24: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 24

Langley Research Center

Single Stage Middle Value Select(Agreement Propagation)

x

Any fault

x

mvs(x,?,x) = x

mvs(x,?,x) = x

mvs(x,?,x) = x

Page 25: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 25

Langley Research Center

Single Stage Middle Value Select(Agreement Generation)

x

Symmetric

z

mvs(x,a,z)

mvs(x,a,z)

mvs(x,a,z)

Page 26: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 26

Langley Research Center

Current Efforts• Constructing new PVS proofs of all protocols based on

generalized middle value select– Have to address conflict between mathematical generality

and engineering utility– Exploiting structure to further generalize diagnosis protocol

• Support a flexible group membership policy• Non-existence of ideal policy established this summer by Beth

Latronico (NIA Intern)

• Adding transient fault recovery capabilities– to protocols, reliability model, and formal proofs– to lab prototypes

Page 27: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 27

Langley Research Center

Current Efforts (2)• Evaluating commercial embedded real-time

operating systems for use on SPIDER Processing Elements

• Evolving requirements for Processing Elements– Adapt/extend existing embedded real-time operating

system• Time and Space Partitioning • Fault-tolerance middleware

– Dynamic computation of communication schedules

Page 28: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 28

Langley Research Center

Current Efforts (3)• Building up PVS library of reusable fault-tolerance results

– SPIDER protocols expressed within this framework• Framework supports other network topologies

– Improved generic clock synchronization properties• Improved accuracy results (tighter bounds)• Cleaner structure for precision results

– Proof framework for general approximate agreement protocols (clock synchronization is special case)

– Results generalized to accomodate weaker fault assumptions (including Azadmanesh & Kieckhafer model of strictly omissive asymmetric faults)

• Preliminary support for wireless fault models

Page 29: SPIDER Formal Models–Where are we now?

October 22, 2003 SPIDER Update 29

Langley Research Center

Additional Resources• A Conceptual Design for a Reliable Optical BUS (ROBUS); Paul

Miner, Mahyar Malekpour, and Wilfredo Torres; in Proceedings 21st Digital Avionics Systems Conference (DASC) 2002

• A New On-Line Diagnosis Protocol for the SPIDER Family of Byzantine Fault Tolerant Architectures, Alfons Geser and Paul Miner, NASA/TM-2003-212432

• A Comparison of Bus Architectures for Safety-Critical Embedded Systems, John Rushby, NASA/CR-2003-212161