clocking and timing in fault-tolerant systems-on-chip

Clocking and Timing in Fault-Tolerant Systems-on-Chip

Andreas Steininger

Outline

• The Clock as a Blessing• The Clock as a Curse• Alternative Synchronization Schemes

- GALS- fully asynchronous- the DARTS approach

• Conclusion

2

Contributors to this Work

The DARTS project team

TU Vienna Gottfried FuchsMatthias FueggerUlrich SchmidThomas Handl

RUAG Space Gerald KempfManfred SustWolfgang Zangerl

3

The Need for Fault Tolerance

miniaturization is key to progress in VLSI=> smaller structures=> lower voltage swing=> smaller critical charge=> higher operating frequencies

…result in higher susceptibility to faults (SET, EMI,…)

=> cannot avoid faults, need to tolerate them

4

The Role of Time

“The only reason for time is so that everything doesn’t happen at once”, Albert Einstein

5

The Need for Clocking

activities need to be co-ordinated• on system level (braking of wheels, …)• on algorithmic level (consensus, …)• on communication level• on logic level (state machine switching,…)

co-ordination in the time domain (synchronization) is an efficient way to attain this=> need a global notion of time (discrete „ticks“)

6

The Quality of Synchronization

real time

local time (number of ticks)

precision π

7

Typical Precision Values

on system level: ms … mson algorithm level: ms … mson communication level: ns … mson logic level: ps … ns

8

Synchronization Requirements

9

phase synchronisation(for „hardware clock“

on logic level)

clock synchronisation(for distributed time base

on algorithmic level)

1ms is excellent precision for distributed clock

at 1GHz this means 360.000° phase shift

Globally Synchronous Design

• whole design is „isochronic“ („perfect“ precision)• time conveyed by clock transitions• perfect co-ordination of all activities

• very efficient design• can assume consistent states• high level of abstraction

• very efficient implementation:• single crystal oscillator• single control line (clock net)

10

„Isochronic“ Regions ?

speed of light (in medium) = 2 x 108 m/s = 20cm/ns

11

2cm

Ref

1GHz

4GHz

8GHz

The Variation Problem

12

Designer

system model

projected conditions

User

actual conditions

actual system

worst case

safety margins

?(unknown)

?(imperfections)

Timing completely fixed after designNo way to react to actual conditions & system („PVT variations“)

Fault-Tolerant Architectures

Duplication & Comparison

Triple-Modular Redundancy

13

FU

FU=?

ERR

FU

FU

vo-ter

YFU

Lock-Step Operation

single clock

14

„3“ „4“

„3“ „4“

single point of failure good replica determinism

FU

FU

vo-ter

YFU

„3“ „4“

Lock-Step Operationindependent clocks

15

„3“ „4“

„3“ „4“

single fault tolerant bad replica determinism

FU

FU

vo-ter

YFU

„3“ „4“

Fault-Tolerant HW-Clocking

16

FU

FU

vo-ter

YFU

v

v

v

Fault-Tolerant HW-Clocking

17

FU

FU

vo-ter

YFU

v

v

v

D

D

?

?

The Charme of SoCs

billions of transistors fit on one die=> structuring into (IP) modules

„System-on-Chip“BUT:• large clock distribution networks => „isochronic“??• FT clocking does not work with large skew• may need individual clocks for function modules

=> clock-synchrony neither attainable nor desirable

18

Co-ordination of Data Exchange

19

SRC SNK f(x)

When it is valid and consistent

When SNK has consumed the previous one

When can SNK use its input?

When can SRC apply the next input?

The Synchronous Approach

20

SRC SNK f(x)

co-ordination based on (global) time

Alternative: Asynchronous Design

21

SRC SNK f(x)

co-ordination based on handshaking

REQ: „Data word valid, you can use it“

ACK: „Data word consumed, send the next“

Async. Design – Advantages

• closed-loop control makes timing much more robust and adaptive to PVT variations

• no need for worst-case timing• local handshakes replace global clock• activity only when needed• beneficial for EMI• tends to stop operation in case of fault

22

Async. Design – Disadvantages

• Need to handle race between REQ and data

23


• Need to handle race between REQ and data

24

SRC SNK f(x)



• Need to handle race between REQ and dataSolution 1: „Bundled Data“

25

SRC SNK f(x)



• Need to handle race between REQ and dataSolution 2: „Delay Insensitive“ (Coding)

26

SRC SNK f(x)


Completion detection


• Need to handle race between REQ and data• significant HW overhead (coding, delay elements)• „adaptive“ timing not as predictable• more difficult to design• classical fault-tolerance schemes not applicable• tends to stop operation in case of fault

27

Best of Both Worlds

GALS: Globally Asynchronous Locally Synchronous

28

retain efficiency of synchronous design wherever possible:„intra-module“

use asynchronousprinciple whereclock distributiontoo cumbersome:„inter-module“

First mention in PhD thesis by Chapiro / Stanford 84

A GALS Example

29

CPU2GHz

PCI-IF533MHz

DSP2,7GHz

USB-IF24MHz

Communication in GALS

Shared Memoryproducer writes to memory, consumer reads from therepro: control flow stays independent• shared single-port memory • true dual-port memory

Direct Messages (Data words)move data word from producer‘s output register to consumer‘s input register• non-buffered / buffered (FIFO-queues)• clock fixed, data-driven or pausible

30

Shared Memory

decoupling of clock domains by memory acting as a third party => high area overhead => unusual

for single port memory arbitration required• arbitration problem (unbounded delay…)• one side may block the other at the arbiter

for multiport memory problems are confined to access to the same cell• busy flag may become metastable• blocking still possible for one specific address

31

Shared Memory

32

CPU2GHz

shared memory

Arbi-tration

0xff14

DSP2,7GHz

• perfect decoupling of data path

• potential metastability problems at arbitration logic

• potential blocking through arbitration

Direct Messagesclock domain boundary is between producer‘s output register

and consumer‘s input register

in general a synchronizer is needed at consumer‘s input• definitely for conventional (fixed) clock• can be avoided by data-driven / pausible clocking

control flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other party

buffers/queues/FIFOs can • mitigate, but not avoid this problem (full/empty)• compensate variations in the data rate on both sides, but not

different average data rates33

Direct Messages

data moving over clock domain boundarymetastability problems=> need to insert handshake…with synchronizers

34

S

0xff14

CPU2GHz

DSP2,7GHz

S

and (optional) buffers

Arbiter: Principle

purpose: ○ manage concurring requests to shared resource

method: ○ handle pairs of request_in / grant_out ○ requests may arrive in any order ○ arbiter must activate only one grant_out at a

time (respond to the first requester)

Mutual Exclusion (MUTEX)

problem: ○ resolve concurrent requests=> metastability problem

35

Arbiter: Circuit

36

„Metastability filter“: e.g., hi-threshold inverter

[from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley]

MUTEX-element: SR-latch

G1’

G2’

R1

R2

G1

G2

Vout,FF

t

Vth,inv

Vmeta

Arbiter: Operation

37

R1

G1

R2

G2

G1’

G2’

R1

R2

G1

G2

Muller C-Element

38

RS

reset

set

a

b

y

IF a = bTHEN y = aELSE hold yC

a b

y

Ca

by

Muller C-Element: Circuit

39

[Alan Martin, Caltech]

Data-Driven Clocking

Principle:○ as soon as new data arrive => start clocking○ determine number k of clock cycles

required to process new data

○ stop clocking after k cycles, wait for next data

Properties: ○ need to switch clock on and off => beware spurious clock pulses!

○ no metastability problem: data stable as soon

as consumer clock starts○ potential for power saving○ useful for specific applications only (no

pipe!)

40

Data-Driven Clock: Circuit / 1

41

CLK out

D

CLK out

CLK half period determined by D

D

Data-Driven Clock: Circuit / 2

42

D

C

REQ

ACK

CLK out

REQ

ACK

transition on REQ answered by transition on CLK out

min CLK half period deter-mined by D

CLK out

D

Pausible Clocking

Principle:○ producer requests consumer‘s clock to pause○ data provided to input register during idle

time○ consumer‘s clock may resume

- free running („pausible clock“)- with one cycle only („stoppable clock“)

Properties: ○ need to switch clock on and off => beware spurious clock pulses!=> beware of clock tree delays!

○ producer controls consumer‘s clock (blocking!)

○ applications must cope with paused clock43

Pausible Clock: Circuit / 1

44

D

C

REQ

ACK

CLK out

REQ

ACK

inverter generates next REQ from ACK

self-oscillation

CLK out

D


45

D

C

REQ’ACK’ external unit can

safely stop CLK by activating REQ’

… and gets ACK’ as a response

CLK out

CLK out

REQ’

ACK’

Arb

D


46

D

C

REQ1ACK1

for more external sources arbiters can be added and “anded” before the Muller C-Element

the two inverters can be eliminated by using a Muller C-Element with inverting output

CLK outArb REQn

ACKn

Arb

Advantages of GALS

• synchronous islands can be designed efficiently• modules operate independently• can use module specific-clock & timing• clocking is no single point of failure

47

Problems with GALS

• operation of modules not (inherently) co-ordinatedsynchrony for communication but not on system / algorithm level

• communication has to cross clock boundaries• potential for metastability

=> performance penalty through synchronizers OR => module must handle irregular clocking

48

The DARTS Idea

49

phase synchronisation

tick synchronisation

clock synchronisation

Distributed Algorithms for Robust Tick Synchronization

TG-AlgsFu1

Data Bus

Fu3

Fu2

TG-Net

The DARTS Approach

Concept: Multiple synchronized tick generators Method: Distributed algorithm for fault-tolerant

tick generation implemented in (asynchronous) digital logic

Advantages- No crystal oscillator(s)- No critical clock tree- Clock is no single point of failure! - Reasonable synchrony

50

The DARTS Principle

51

Every function unit Fui augmented with simple local clock unit (TG-Alg)

TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signals

Up to f TG-Algs can be Byzantine faulty need n ≥ 3f + 2 TG-Algs

Fu1

Fu2

Fu3

data bus

Clock tree

TG-Algs

TG-Net

DARTS clocksStandard synchronous clocking

Formally proven

synchronization properties

A Comparison

52

TG-AlgsFu1

Data Bus

Fu3

Fu2

TG-Net

tick(3) tick(4)

Fu1 clk

Fu2 clk52

global synchrony (< 1 tick)

synchronous SoC GALSDARTS

Fu1Data Bus Fu3

Fu2

Oscillator

Oscillator

Oscillator

Clo

ck

Tree

Oscillator

Fu1

Data Bus Fu3

Fu2

single point of failure

global synchrony (potentially 1 tick)

no single point of failure

no single point of failure NO (inherent) global synchrony

The Distributed Algorithm

(1) Initially:(2) send tick(0) to all; clock:= 0;(3) “Relay Rule”(4) If received tick(m) from at least f+1 remote nodes and m > clock:(5) send tick(clock+1),…, tick(m) to all [once]; clock:= m;(6) “Increment Rule”(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(8) send tick(m+1) to all [once]; clock:= m+1;

[Srikanth & Toueg, 87]

TG-Alg 1

TG-Alg 6

TG-Alg 5

TG-Alg 4

TG-Alg 3

TG-Alg 2

TG-Net

Implementation Challenges

54

(1) Initially:(2) send tick(0) to all; clock:= 0;(3) “Relay Rule”(4) If received tick(m) from at least f+1 remote nodes and m > clock:(5) send tick(clock+1),…, tick(m) to all [once]; clock:= m;(6) “Increment Rule”(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:(8) send tick(m+1) to all [once]; clock:= m+1;

Replacement by zero-bit messages

k-bit messagesk unbounded Atomicity of actions

To be ensured by the architecture and delay constraints

Thresholds functions for fault tolerance

Glitch-free asynchronous implementation

TICK(k)

TICK(k-1)

...

TICK(1)

TICK(0)

k-bit msg vs. zero-bit tick

Software-based algorithm

The DARTS Prototype

55

ASIC design:

• radhard 180nm technology

• 2 designs:- flexible- fast

Prototype board:8 chips plus fixed & programmable interconnect

Proof of Concept

56

Frequency Stability (Warm-up)

57

0 2 4 6 8 10 12 14 16 1853.15

53.2

53.25

53.3

53.35

53.4

53.45

time in [hours]

frequ

ency

in [M

Hz]

Frequency Stability (detail)

58

0 5 10 1551.94

51.96

51.98

52.0

time in [min]

frequ

ency

in [M

Hz]

0 5 10 151.7968

1.7970

1.7972

1.7974

core

vol

tage

in [V

]

DARTS – General Properties

Fully asynchronous implementation NO oscillators

Tolerates up to three Byzantine faulty nodes(configurable number of TG-Algs; 5 to 12)

Adapts to operating conditions (asynchronous logic)

59

Still Room for Improvements

o Transient faults are permanently stored in the elastic pipelines

o No on-the-fly integration of TG-Algo Relatively low clock speedo Interfacing to traditional synchronous designso Scaling with number of faults is costly

60

Summary: Trends & Needs

• Preceding miniaturization necessitates fault tolerance

• Co-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levels

• SoCs are large modular designs on a single die

61

Summary: SoC Clocking

• globally synchronous clock:+ ideal synchrony, efficient in design & implementation- isochrony unrealistic, single point of failure

• DARTS clock+ best attainable global synchrony, adaptive timing, FT- high implementation efforts, frequency not stable

• GALS+ uses best of syn & asyn, indep. & module-specific clock- no global synchrony, metastability issues

• asynchronous design+ power-efficient, robust against faults & PVT- high overheads, difficult to design, timing hard to predict

62

More information on DARTS

http://ti.tuwien.ac.at/ecs/research/projects/darts

63



clocking and timing in fault-tolerant systems-on-chip

Documents