a verified design of a fault-tolerant clock ... · pdf filea verified design of a...
TRANSCRIPT
NASA Technical Memorandum107568
A Verified Design of a Fault-Tolerant ClockSynchronization Circuit: PreliminaryInvestigations
Paul S. Miner
March 1992
National Aeronautlcs and
Space Administration
Langley Research CenterHampton, VA 23665
(NASA-Tq-10756_) A VERIF_ OESIGN OF A
FAULT-TOLERANT CLOCK SYflCH_ON!ZATIONCIRCUIT: PR_LIHINARY INVESTIGATIONS (NASA)
IOO p CSCL 1ZA
G3159
N92-23559
Uncl as
0085775
https://ntrs.nasa.gov/search.jsp?R=19920014316 2018-05-15T17:18:55+00:00Z
Abstract
Schneider [I] demonstrates that many faull-tolerant clock synchronization
algorithms can be represented as refinements of a single proven correct
paradigm. Shankar [2] provides a mechanical proof (using EHD_I [3]) that
Schneider's schema, achieves Byzantine fault-tolerant clock synchronization
provided that eleven constraints are satisfied. Some of the constraints are
assumptions about physical properties of the system and ca.n not be estab-
lished formally. Proofs are given (ill EHI)M) that the fault-toleranl midpoint
convergence function satisfies three of these constraints. This paper presents
a hardware design, implementing the fault-tolerant midpoint function, which
will be shown to satisfy the remaining constraints. The synchronization cir-
cuit will recover completely from transiellt faults provided the ma:
Contents
1 Introduction 1
2 Description of the Reliable Computing Platform 2
3 Clock Definitions 4
3.1 Sha_lkar's Notation ........................ 5
3.2 Sha_l'l_r's Conditions ....................... 6
4 Fault-Tolerant Midpoint as an Instance of Schneider's Schema
4.1 Translation Invariance ...................... 9
4.2 Precision Enhancement ..................... 9
4.3 Accuracy Preservation ...................... 12
4.4 EHDM Proofs of Convergence Properties ............ 13
5 Proposed Verification 14
5.1 Informal Description ....................... 145.2 Correctness Criteria ........................ 1?
6 Transient Recovery 17
6.1 Single Fault Scenario ....................... 186.2 Genera] Case ........................... 18
6.3 Comparison with Other Approaches .............. 18
7 Initial Synchronization l0T.1 Mechanisms for Initialization .................. 19
7.2 Comparison to Other Approaches ................ 20
8 Concluding Remarks 20
A Proof Summary 22
B
C
BTEX printed EHDM Modules 23
Proof Chain Status 79
C.1 Translation Invariance ...................... 79
C.2 Precision Enhancement ..................... 81
C.3 Accuracy Preservation ...................... 88
ill
gnL__ |#II_MIIONALLYBLA_
pRF...CEDiNG PAC_ BLANK NOT F[LMIED
?
Introduction
NA.qA La||gley Research Cenler is currenlly involved in lhe developmen! of
a forn-lally verified Re]iabie Computing Plal]Th'm (R(!P) for real-lime digital
fllghl control systems [4, 5.6]. J_tll Oft,Oil quoled requiremenl for critical sys-
|0111g (_liiployed for ctvll air lransporl is a probability of cataslrophic failure
less ihan i0 -_ for a 10 hour flight [7]. _hwe t'alqure rates for dlgilal devices
are on the ordgt" of 10-" per hour [s_], hardware redundancy is required to
achieve the deslred level of reliabilily. V','hile there are many ways of incor-
poratlng rc,dundanl hardware, the approach taken in the RCP is the use of
identical redundanl channels with exact malch voting (see [-1, 5] and [6]).
A critical function in a fault-toleranl syslem is thai of synchronizing
th# clocks of the redunda.nt computing elements. The clocks musl be syn-
chronized in order to provide coordinated action among the redundant sites.
Although perfec! synchronization is no! possible, clocks can be synchronized
whhin a ._mall skew. The purpose of this work is lo provide a mechanically
tter_ged design ot" a faull-tolerant clock sync]tronization circuit.
The faull-tolerant clock sy||chronizaiion circuit is intended to be part
of a verified hardware base for the RCP. The prhnary intent of the RCP isto pl'ovide a Verified fault-tblerant system which is proven to recover from
a bounded number of translen_ faults. The current model of the system
_tggtlltlog (among other things) that tile clocks are synchronized within a
bolmded skow [.5]. It is crucial that the clock synchronization circuitry also
be able tO reeovgr from transient faults. Originally, Lamport and Melliar-
gm_th's Interactive Convergence Algorithm (ICA) [9] was to be the basis
l'of tile clock sVlic]|rol_lzat]on hardware, the primary reason being the ex-
lsteii_ of a mechanical proof that the algorithm is correct [10]. However,
modii_catlons to ICA to achieve transient fault recovery are unnecessarily
_Oltil)llcaied. The fauh-toleranl midpoint ',algorithm of [11] is more readily
adapted to transient recovery.
The synchrvnizaiion circ|Jlt is designed lo tolerate arbitrarily malicious
pdffrl_ill&it, ]nteftn]ttent and tral_sien! hardware faults. A fault is defined
ttg tl phy._ical perturbation ahering the function implemented by a physical
dgvi_$, intermittent faults are permanent physical faults which do not con-
glaiitiy alter the fulletion of a device (e.g. a loose wire}. A transient fault is
a _ile shot short duration physical perturbation of a device (e.g. caused by
a _t_smic ray el_ other eleCtrOrhagnetic effect). Once the source of the fault
is fg/fi6_ed, the device can function correctly.
Most pf6cil_s of fauli-tolerant clock synchronization algorithms are by
inductionon tile number of synchronizalion intervals. Usually. tile basecase of the induction, the initial skew. is assumed. The descriptions in
[1. 2. 9, 10] all assume initial synchronization with no mention of how it
is achieved. Others. including [11, 12. 13] and [14] address the issue of
iaitial synchronization and give descriptions of how it is achieved in varying
degrees of detail. In proving an inaplementation correct, the details of initial
synchronization cannel be ignored. If the initializati01 _ scheme is robust
enough, it can also serve as a recovery mechanism from multiple correlated
transient failures (as is noted in [14]). _ ..... ;_ _
Scl,neider [1] de moltstrates that many faull-tolerant clock synchroniza-tion algorithms can be represented as refinements of a single proven correct
paradigm. Shankar [2] provides a mechanical proof (using EaDbl [3]) that
Schneider's schema achieves Byzantine fault-lolerant clock synchronization,
provided that eleven constraints are.... satisfied. Some of the co_nstraints are
a.ssumptions about phvsical properties of the svstem and can..no! be es-tabllshed formally. This paper propo__ses_a=19a!'dware so!_ ution t__ .o tI)e clock ....
synchronization problem which will be sh0_wn t sa.tlsfy flh e rem_nlng conz: -
straints. . o -- -=-:_-- ._
This paper discusses preliminary results in t!m verification o.f the-d-.e-- ......
sign. The fault-toler_nl midpoinl funclion is formally proven (in EHDM)-to--
satisfy the properties of translation invariance_ precisio)_e_nhancement _ a_d
accuracy preservation. 1 A register transfer level design is prescLlted which
implements the synchronization algorithm. An argument for transient re-
covery from a single fault is presented and issues relating to the more general
case are raised. Finally, the approach for achieving initial synchronization
is discussed. The notation used here is from Shankar [2].
2 Description of the Reliable Computing Plat-
form
This section summarizes the key details of the Reliable Computing Platform
to establish a context, for the clock synchronization circuit, it is included
here for completeness. The material in this section is paraphrased from
Butler and DiVito [5]. The interested reader should consult [5] for nmredetailed information.
_The_e properties will be defined in the section describing the fault-tolerant midpointconvergence function. .....
[ Uniproees._or ,S'y._tem M_Mc I ( US')I
[Fault-tolerant Replicated S'ynchronous MMel (RS')
I[ Fa,,lt-tohtYmt Distributed ,qynchro,wus Model (DS)J
1[Fault-tolet.nt Distributed .4 sy,whro,,ous Mod,l (DA)]
I[ Hardwart/Soflwor_ l,,tpl, m_r,,tatio,,]
Figure 1: Hierarchical Specification of tile Reliable Computing Platform.
NASA-Langley is currently involved in the development of a formally
verified Reliable Computing Platform for reai-time control [4, .5]. A pri-
mar)' goal is to provide a fault-tolerant computing base that appears to
the application programmer as a single ultra-reliable computer. To achieve
this. it is necessary to conceal implementation details of the system. Some
characteristics of the system are as follows [5]:
"the system is non-reconfigurable
the system is frame-synchronous
the scheduling is static, non-preemptive
internal voting is used to recover the state of a processor affected bya transient fault"
A hierarchy of models is introduced which provides different levels of ab-
straction (figure 1, taken from [5]). The top level is the view presented
to the applications programmer, i.e. an ultra-reliable uniprocessor system.The details of fault-tolerance are introduced in the lower levels. The next
two levels, replicated synchronous and distributed synchronous, introduce
the redundancy and voting requir