pasc fault tolerance

34
Taming Data Corruptions in Distributed Systems Marco Serafini (Yahoo! Research BCN)

Upload: marco-serafini

Post on 02-Dec-2014

366 views

Category:

Technology


2 download

DESCRIPTION

A new generic and rigorous approach to the tolerance of data corruptions. Presentation of the paper "Practical Hardening of Crash-Tolerant Systems" published at USENIX ATC 2012. See video at http://bit.ly/LNc5mc

TRANSCRIPT

Page 1: PASC fault tolerance

Taming Data Corruptions in Distributed Systems

Marco Serafini (Yahoo! Research BCN)

Page 2: PASC fault tolerance

Infrastructure dependability

o Service availability, data durabilityo In presence of hardware faultso Current approaches tolerate crashes

Page 3: PASC fault tolerance

Crashes

oAssumptionso A server (process) suddenly stopso Until then, only correct steps

Time

Crash

Page 4: PASC fault tolerance

Data corruptions

oWhat if there are data corruptions?o The state of a process may be corruptedo The process may make incorrect steps before stopping

Time

Datacorruptions

Page 5: PASC fault tolerance

Data corruptions

oWhat if there are data corruptions?o The state of a process may be corruptedo The process may make incorrect steps before stopping

Time

Datacorruptions

NOT COVERED!

Page 6: PASC fault tolerance

Sources of data corruptions

o Commodity disks are known to be unreliableo Faulty firmware, bad sectors etc.

oRAM: ECC errors are frequento Production machines only see detected errors

Coverage not knowno Interconnects and CPUs also fail

o Faulty drivers or bit flips

Page 7: PASC fault tolerance

A horror storyAn 8-hour system-wide outage due to a single hardware fault

Page 8: PASC fault tolerance

What happened?

oQuoted from the Amazon service health dashboardo “A handful of messages had a single bit corrupted”o “The message was still intelligible, but the system state

information was incorrect”o “We used MD5 checksums throughout the system (but

not) for this particular internal state information”o “(The corruption) spread throughout the system causing

the symptoms described above”

Page 9: PASC fault tolerance

Error propagation

u

v

mout

Event handling

min

min

x

y

Eventhandling

Process i Process j

Page 10: PASC fault tolerance

Common practice

oManual placement of ad-hoc error detection checkso Application knowledgeo Time consuming

oHard to structure without fault model

oNo error isolation guarantee

Page 11: PASC fault tolerance

Research: Byzantine faults

oByzantine modelo Faulty nodes controlled by an adversaryo Worst-case model

11

Time

Byzantinefault

Page 12: PASC fault tolerance

Byzantine fault model

oBlack-box model of faulty processes: adversarialoHardening for error isolation [Nysiad NSDI 2008]

o Based on state machine replicationo Replication and performance costs

Servers

Client

Agreement on requests

Page 13: PASC fault tolerance

Byzantine faults

oByzantine hardening covers attacks and bugs…o… assuming, e.g., design diversity of replicas

o Unpractical in most systems no real adoption

Attacks

Security

Bugs

V & V

Data corruptions

ASC Hardening

Page 14: PASC fault tolerance

A new approach to error isolation

u

v

mout

Event handling

min

min

x

y

Eventhandling

Process i Process j

1. General model of process behavior2. Arbitrary State Corruption (ASC) fault model3. Guarantee error isolation through hardening

Page 15: PASC fault tolerance

A new approach to error isolation

u

v

mout

Event handling

min

min

x

y

Eventhandling

Process i Process j

1. General model of process behavior2. Arbitrary State Corruption (ASC) fault model3. Guarantee error isolation through hardening

with M. Correia, D. Ferro and F. Junqueira2012 Usenix Annual Technical Conference

Page 16: PASC fault tolerance

Process and fault modelsDefining Arbitrary State Corruptions

Page 17: PASC fault tolerance

Process model

Upon receive message <REQ, r> doif v > 5 then

u = r + v + 5;

elseu = r + v;

v = u;send <WRITE, v> to

process p

min

mout

1) Event Dispatching

2) Event Handling

3) Message sending

State

Page 18: PASC fault tolerance

ASC fault model

oAn Arbitrary State Corruption can make a process o Crasho Assign an arbitrary value to any variableo Start the execution from an arbitrary instruction

v 5

z 10

PC 20

v 12

z 7

PC 320

Page 19: PASC fault tolerance

Fault frequency

oOne fault for every processed input message

Upon receive message <REQ, r> doif v > 5 then

u = r + v + 5;

elseu= r + v;

v = u;send <WRITE, v> to

process p

min

mout

1) Event Dispatching

2) Event Handling

3) Message sending

State

Page 20: PASC fault tolerance

Fault diversity

oA corrupted variable is different from its replica

oOnly holds immediately after the faulto Can be invalidated if instructions modify the variable

v 5

z 10

PC 20

v 12

z 7

PC 320

5

10

5

41

original replica original replica

Page 21: PASC fault tolerance

Error propagation

o Fault diversity does not holdoHardening preserves diversity

u

v ?

Original ReplicaFault diversity

Page 22: PASC fault tolerance

ASC hardeningFrom ASC faults to crashes and message omissions

Page 23: PASC fault tolerance

From ASC to crashesoTransparent: to the hardened processo Local: no process replication on multiple machinesoUntrusted: can have faults while executing hardening

HARDENING RUNTIME

u

v

mout

Event handling

min

Page 24: PASC fault tolerance

PASC runtime

EH1 EH2 EH3

Process state

PASC checks

PASC library

User- defined

Transparent

github.com/yahoo/pasc

Replica state

Page 25: PASC fault tolerance

Evaluation

Page 26: PASC fault tolerance

Hardening an echo server

o Little computation, network bound, no overheado PBFT is a reference (Nysiad not available)

Page 27: PASC fault tolerance

Hardening State Machine Replication

+ 70 %- 15 %

Page 28: PASC fault tolerance

Zookeeper (core)

Page 29: PASC fault tolerance

Memory overhead

Page 30: PASC fault tolerance

Scalability

o SimpleKV: eventually consistent store, no replicationo Scales similarly with hardeningo No server “wasted” for replication

1 3 5 70

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

PASC sKVUnprot. sKV

Number of servers

Ma

x.

thro

ug

hp

ut

(ko

ps/

sec)

Page 31: PASC fault tolerance

PASC fault coverageo Injected random bit flips in Paxos

o Code corruptions: bytecode and binary codeo State corruptions: pointers and primitive values

Code corruptions State corruptions

Unprot PASC Unprot PASC

Undet. 3 0 93 0

Det. - 1 - 330

Crash 1640 1663 2301 2066

Not manif. 1213 1193 2843 2841

Total 2856 2856 5237 5237

Page 32: PASC fault tolerance

Wrap up

oHardware data corruptions are a real dangero Proposed new systematic approach

o BFT not realistico Ad-hoc approaches are not systematic

oHardening algorithm for error isolation o Local: does not require replicationo Efficient: PASC-Paxos has up to 70% more throughput

than PBFTo High fault coverage

Page 33: PASC fault tolerance

Directions

o Systematic protection of Yahoo! infrastructure against data corruptions

oASC just scratched the surface – some todoso Reduce memory footprinto Support for external memory (disks/SSDs)o Hardening of legacy codeo Theoretical foundations

Page 34: PASC fault tolerance

Thank you

[email protected]