anish arora ohio state university mikhail nesterenko kent state university local tolerance to...

Post on 17-Jan-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

lack of spatial bound  arbitrary number of processes can be faulty  cannot rely on limited scope of fault or number of faulty processes lack of temporal bound  faulty process behaves incorrectly arbitrarily long  cannot wait until fault stops  contain correctness and tolerance instead of faults  use execution models that simplify such containment Difficulties Containing Unbounded Faults

TRANSCRIPT

Anish AroraOhio State University

Mikhail NesterenkoKent State University

Local Tolerance to Unbounded

Byzantine Faults

large system size presents unique challenges and opportunitiesto ensuring dependability

• problem faults:

– occur often– affect multiple components– interact unpredictably

asynchronous execution model faults are spatially/temporally unbounded, complex &

undetectable

• opportunity a fault directly affects a region rather than whole system if faults are contained, rest of the system continues to

function

Faults in System of Large Scale

affected

faultyunaffected

lack of spatial bound arbitrary number of processes can

be faulty cannot rely on limited scope of

fault or number of faulty processes

lack of temporal bound faulty process behaves incorrectly arbitrarily long

cannot wait until fault stops

contain correctness and tolerance instead of faults use execution models that simplify such containment

Difficulties Containing Unbounded Faults

Outline• containing correctness and tolerance:

strict fault containment and strict stabilization

• execution models and example programs

reactive program: dining philosophers

transformational execution models and programs– output dependent: -independent set selection– output independent: lightweight spanner construction

address specification first• what does it mean for a system

to be correct when its arbitrary portion is faulty?

• spec defines correct sequences for each process P

• sequence involves states of Pand possibly others

a program is locally containing of faults of class F if constant l (containment radius) such thatevery P conforms to its spec if faulty processes are at least

l hops away from P

problem: correctness of P depends onevery process in the system conforming to spec or F

Containing Correctness

fault of class F

containment radius l

containment locality

Strict Fault Containment

strict fault containing (SFC) program is locally containing of unboundedByzantine faults

a process satisfies spec regardlessof actions of processes outsidelocality

SFC-program is containing ofbounded and unbounded faults of any class

for each P the spec can only mention processes inside locality a problem lacking such specs (e.g. routing) does

not have SFC-solutions

Byzantinefault

Strict Stabilization

additional tolerance properties to faults within locality for a strictly-fault containing program

strict stabilization – stabilization from transient faults: regardless of actions outside locality, each P eventually satisfies spec

Outline• containing correctness and tolerance:

strict fault containment and strict stabilization

• execution models and example programs

reactive program: dining philosophers

transformational execution models and programs– output dependent: k-independent set selection– output independent: lightweight spanner construction

Dining Philosophers Problemdefinition

network of processes, each may request to eat

properties– mutual exclusion – no

two neighbors eat together– liveness – each requesting

process eats eventually

execution model interleaving communication via shared registers high-atomicity

thinking (T)

hungry (H)

eating (E)

cycle forrequesting process

Solution to Dining Philosopherspriority based

actions

• if T & higher priority neighbors thinkingbecome hungry

• if H & no neighbors are eating eat (ensures MX)

• E & done think & give priority to neighbors

(ensures liveness)

waiting chain ≤ 3 optimal containment

radius of 2

E TH any

decreasing priority

Fault Containment andInformation Propagation

• fault containment leverages limit on information propagation

• idea: abstract fromthe process of information propagation and highlight the result

a

b

c

d

process: sends info to b

sends a’s info to c

sends a’s info to d

result: d reads from a

Execution Modelstransformation program – given input computes output (e.g.

leader election)

models for transformation programs – each process reads from processes within range (finite distance)

• output dependent – each process reads all information within range: input and (atomically) output

• output independent – each process reads only input within rangeevery program in this model is

strictly fault containing

Preads

input&output

range

Preads

input only

k-Independent Set Selection (cf. [HHJS01])problem: select a maximal subset

of processes S such that• for each process in S each other

process of S is at least k hops away

solution actions• if no member of S less than k-hops away join S• if exists member of S less than k-hops away leave S

observe:• only faulty node P can make

another process Q to leave S• if Q leaves S, it can make

another process R join Scontainment radius is 2k

1-independent set

joins S leaves S joins S

P Q R

k k

Outline• containing correctness and tolerance:

strict fault containment and strict stabilization

• execution models and example programs

reactive program: dining philosophers

transformational execution models and programs– output dependent: k-independent set selection– output independent: lightweight spanner construction

• practical problem: fast routing tree construction in sensor networks• spanner construction with double range• spanner optimization with larger ranges

Experimental Platform: Wireless Sensors

• 4 MHz Amtel processor• 8 Kb of programming memory• 512B of data memory• 916 MHz single-channel, low-power radio• 10 Kbps of raw bandwidth• uniform antenna length & orientation • TinyOS as the runtime system• fresh AA batteries

Experiment: Fast Routing Tree Construction By Flooding [G+02]

• 156 nodes are arranged in a 13x12 grid on an open parking lot, with grid spacing of 2 feet.

• the base station is placed in the middle of the base of the grid and starts the flooding

• each receiving node rebroadcast the flood message immediately upon receipt and then squelches further broadcasts the sender is selected as parent, thus routing tree to the

base station is formed• expectation: a routing tree with relatively regular structure:

# of children, link length, path size, etc.

Backward

Link

Long Link

Straggler

Clustering

1 hop 2 hops

3 hops final

Problems and Solution Approach

problem: routing tree constructed fast over“raw” topology is inadequate uneven clustering (some nodes have too many

neighbors) long links (possibly unreliable) unoptimal paths (backward links)

idea: pre-process the topology to mitigate the problem weigh links (by length, error rate, node degree, etc.) locally construct a connected but lightweight spanner

– link weight may be reflexive (depend on the spanner, ex: node degree)

Lightweight Spanner Construction Using2k-Range• spanner – connected subgraph

that includes all nodes (ex: spanning tree)

• k-local spanner – there is a path within distance ≤ k to each neighbor

problem: given a weighted graph(all weights unique) and 2k-rangebuild a lightweight k-local spanner

solution: each process P computes the minimum spanning tree for eachprocess Q in distance no more than k and selects the union of incident edges

kk

P

Q

P can compute MSTfor each process Q

in this region

MST for Q’s region

Spanner Optimization Using Ranges > 2• each P computes spanner’s

topology in neighborhood with radius range-k P knows complete spanner in

this region

• P iteratively repeats theprocedure on the resultant spanner

kk

P

Q

P can compute MST

for each process Qin this region

k

Conclusion• complexity and scale of large systems

forces unorthodox approaches to faults

• we explored spatial dimension of fault tolerance to complex unbounded faults, used lack of global info propagation stated necessary conditions and impossibility results gave first examples of programs

• question: how to solve problems that do have global info propagation? is it possible to contain problems before they spread?

top related