www.tttech.com Page 1
Ensuring Reliable Networks
Discussion of Failure Mode Assumptions for IEEE 802.1Qbt
Wilfried Steiner, Corporate [email protected]
www.tttech.com Page 2
Ensuring Reliable Networks
Clock Synchronization is a core building block of many RT Systems
TTE
1588
1588
Eth
TTE
TTE
Eth
TTE
TTETTE
TTE
TTE
TTE
Eth
Grand Master
The local clocks in a distributed system can accurately be synchronized to each other.
www.tttech.com Page 3
Ensuring Reliable Networks
Basic Questions in Fault-Tolerant Clock Synchronization
TTE
1588
1588
Eth
TTE
TTE
Eth
TTE
TTETTE
TTE
TTE
TTE
Eth
Grand MasterLoss of Grand Master clock requires a changeover
- How long does the changeover take?- Is the changeover fault-tolerant?- Is a malicious failure behavior of the
Grand Master clock tolerated?
www.tttech.com Page 4
Ensuring Reliable Networks
Fault-Tolerance through Redundancy
Situation:What is the color of the house?
Green
No Failure
Don’t Know
Fail-Silence Failure
Green
Fail-Consistent Failure
Red Green
Green
www.tttech.com Page 5
Ensuring Reliable NetworksFailure Mode: Fail-Silence
When the current grandmaster clock fails then gPTPensures that another clock becomes the new grandmaster
• if there exists such a clock in the system, which we will assume in the following
This means that there is some fail-over time after which the system is running stable again – synchronized and syntonized to the new grandmaster clock.
The fail-silence failure mode is tolerated• when the original grand master clock fails permanently.
www.tttech.com Page 6
Ensuring Reliable NetworksFailure Mode: Fail-Silence
What happens when the original grandmaster clock fails transiently or intermittent?
• e.g., the original grandmaster clock periodically reboots
�Will the network oscillate between the original and a secondary grandmaster clock?
www.tttech.com Page 7
Ensuring Reliable NetworksModel-Based Development i
Development of fault-tolerant clock synchronization algorithms is non-trivial:
• synchronization proof is hard for certain failure modes• completeness has to be proven as well
• i.e., we need to prove that we have covered all possible failurescenarios
Therefore, formal methods are used in the development and in the verification of such algorithms.
• Theorem Proving is the process of developing a deductive proof, typically interactive with a proof assistant.
• Model Checking is an automatized approach.
www.tttech.com Page 8
Ensuring Reliable Networks
INIT(1)
LISTEN(2)
COLDSTART
(3)
ACTIVE(4)
1.1 2.1
2.2
3.1
3.2
INIT(1)
LISTEN(2)
1.1
2.1 STARTUP(3)
TentativeROUND
(5)
ACTIVE(7)
ProtectedSTARTUP
(6)
2.2
3.2
SILENCE(4)
3.1
4.1
5.1
5.2
6.1
6.26.3
2.3
ok
Model Checkerno, because…
Model-Based Development iie.
g., I
EE
E 8
02.1
AS
bte.
g., f
ail-s
ilenc
e
e.g., system will sync
www.tttech.com Page 9
Ensuring Reliable NetworksExample: SAE AS6802
First Byzantine fault-tolerant clock synchronization algorithm verified by model-checking only.
Basic algorithm addresses only synchronization of the clocks.
Extension for syntonization (we call it clock-rate correction) has been modeled and studied as well.
www.tttech.com Page 10
Ensuring Reliable Networks
Fault-Tolerant Clock Synchronization
TTE
1588
1588
Eth
TTE
TTE
TTE
TTE
TTE
TTE
TTE
TTE
TTE
TTE
TTE
Eth
Grand Master
Grand Master
Grand Master
Fault-tolerant synchronization services are needed for establishing a safe and highly available synchronized time.
Grand Master
www.tttech.com Page 11
Ensuring Reliable Networks
SAE AS6802 Clock Synchronization Algorithm(case of five SM is updated in the standard)
Algorithm Specification
www.tttech.com Page 12
Ensuring Reliable NetworksByzantine Failure Tolerance
Occurrence of a Byzantine failure is a combination of a fail-arbitrary synchronization master (end station) and an inconsistent-omission faulty compression master (bridge).
www.tttech.com Page 13
Ensuring Reliable Networks
Rate-Correction with Stable Clock Drifts
Store 1 st state-correction term
Store 2 nd state-correction term
Calculate and apply rate-correction term
www.tttech.com Page 14
Ensuring Reliable Networks
Rate-Correction with Unstable Clock Drifts
Store 1 st state-correction term
Store 2 nd state-correction term
Calculate and apply rate-correction term
Coincidently also the speed of the oscillator changes
www.tttech.com Page 15
Ensuring Reliable Networks
What are the failure modes of IEEE 802.1ASbt
Permanent fail-silence?Transient/Intermittent fail-silence?Fail-consistent faulty?
• e.g., a grandmaster providing faulty time
Inconsistent faulty bridges?• e.g., a bridge forwarding time information only on some
ports
Byzantine faulty grandmaster clocks?
www.tttech.com Page 16
Ensuring Reliable Networks
www.tttech.com
Wilfried Steiner, Corporate [email protected]
www.tttech.com Page 17
Ensuring Reliable Networks
www.tttech.com
Backup
www.tttech.com Page 18
Ensuring Reliable NetworksStatic vs. Dynamic Systems
Situation:What is the color of the house?
Static Situation – one Truth
Situation:What is the color of the ball ?
Dynamic Situation – >one Truth
www.tttech.com Page 19
Ensuring Reliable NetworksOrigins: Byzantine Failures
HOT COLD
N2
HOTHOT N3
COLDCOLD
N1
Faulty
N1: COLD
N2: HOT
N3: COLD
==========
COLD
A distributed system that measures the temperature of a vessel shall raise an alarm when the temperature exceeds a certain threshold. The system shall tolerate the arbitrary failure of one node.How many nodes are required?How many messages are required? T
em
pera
ture
In general, three nodes are insufficient to tolerate the arbitrary failure of a single node.The two correct nodes are not always able to agree on a value . A decent body of scientific literature exists that address this problem of dependable systems, in particular dependable communication.
www.tttech.com Page 20
Ensuring Reliable NetworksByzantine Clocks
Time
N2
00:01
N3
00:04
N1
Faulty
00:0400:01
00:04
00:01N1: 00:04
N2: 00:01
N3: 00:04
==========
00:04
N1: 00:01
N2: 00:01
N3: 00:04
==========
00:01
Per
fect
Clo
ck
Real Time
Slow Clock
Fast Clock
R.int R.int
A distributed system in which all nodes are equipped with local clocks, all clocks shall become and remain synchronized.The system shall tolerate the arbitrary failure of one node.How many nodes are required?How many messages are required?
In general, three nodes are insufficient to tolerate the arbitrary failure of a single node.The two correct nodes are not always able to bring their clocks into close agreement . A decent body of scientific literature exists that address this problem of fault-tolerant clock synchronization.