reasoning with mad distributed systems -...

1
UpRight Practical and configurable BFT replication Applying BFT replication to real services BFT HDFS BFT ZooKeeper Refining the architecture and APIs Introduce Request Quorum stage before agreement Clean application API for: processing requests taking application state snapshots Configurable replication u: services are Up (live) despite u failures r: services are Right (safe) despite r commission failures Replication costs expressed as a function of u and r Synchronous, no failures (high performance) Zyzzyva Speculative Byzantine Fault Tolerance Simplifies the design of BFT replication One protocol to rule them all latency throughput cost of replication What is a MAD system? Any system that spans Multiple Administrative Domains (e.g. peer-to-peer services, cloud/outsourced storage, Internet routing, and wireless mesh routing. BAR-Gossip BAR-tolerant Nash for P2P live streaming Gossip is attractive infrastructure for P2P live streaming But gossip protocols perform poorly if many peers behave selfishly Tolerating arbitrary faults MAD Systems Dependable Storage Reasoning with MAD distributed systems Lorenzo Alvisi The University of Texas at Austin Just in! Local social defenses against Sybil attacks Current sybil defenses can distinguish honest from forged identities in social graphs that are fast mixing (equivalently, have constant conductance) Alas, many social graphs are not fast mixing! We are developing a new approach that provides better protection without relying on global graph properties, such as constant conductance, but rather leverages the social graph’s community structure. Lorenzo Alvisi is a Professor in the Department of Computer Science at UT Austin, where he is a co-director of the Laboratory for Advanced Systems Software (LASR). He holds a Ph.D. and M.S. in Computer Science from Cornell University, and a Laurea summa cum laude in Physics from the University of Bologna, Italy. He is a Fellow of the ACM and the recipient of an Alfred P. Sloan Fellowship and of the NSF CAREER Award. EVE Replicating multithreaded servers Replication state-of-the-art Agree on order of requests, then execute them Requires deterministic execution Practically this means single-threaded execution Eve First execute requests nondeterministically, without agreeing on order Then verify if state and responses match among replicas Efficient rollback on divergence Benefits Allows multithreaded execution Up to 12x speedup on a 16-core machine 25% slower than an unreplicated server Bad things do happen to good systems ... 0 200 400 600 800 1000 1200 1400 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Throughput (requests/sec) # execution threads Throughput scalability sequential replicated unreplicated unsafe replicated Eve BFT? Yes Zyzzyva Command Voter Execution Command Agreement Voter Execution To From Synchronous, with faults (minimal liveness guarantees) Asynchronous (minimal liveness guarantees) Synchronous, with or without failures (high performance) Byzantine (FT) Empire Byzantine (FT) Empire BAR-B A BAR-tolerant cooperative backup system Peers assign each other chunks to store on their behalf Assignment need not be symmetric Deterministic retrieval guarantee despite Byzantine and Rational peers When assigning work, challenge is handling ``he said / she said” Problem could be solved by interposing an acquiescent witness W between A and B But, just in FT distributed computing it is not prudent to assume that any particular node will be correct, we donʼt want to assume that any W that we may use will not either fail or turn selfish. Solution: use BAR-tolerant State Machine Replication to build the abstraction of an acquiescent W out of node each of which may be Byzantine or selfish Depot Cloud storage with minimal trust Removes trust from providers Not the same thing as making providers more trustworthy! What is so special about MAD systems? Traditional threshold FT does not apply! Nodes can be selfish: cooperation requires incentives Sybil attacks can overwhelm any threshold mechanism If this were not enough, each domain is a black box to its peers: what basis is there for trust? The BAR Model Three classes of MAD nodes: Byzantine: deviate arbitrarily, for any reason Acquiescent: follow the assigned protocol obediently Rational: deviate iff doing so increases their utility BAR Tolerant Systems No more than n/3 Byzantine nodes No bound on number of rational nodes before UpRight after UpRight A I sent work to B but never received an answer! B I never received work from A! A B W A B W Asynchronous (minimal liveness guarantees) fil ile fi le file Chunks are erasure-coded Traditional Gossip Traditional Gossip Avg. upload b/w of altruistic peer (Kbps) Percentage of peers acting rationally Data stream BAR Gossip relies on Balance Exchange, a provably incentive compatible protocol: no selfish node has unilateral incentives to deviate from it Reliability with BAR Gossip is way up... But not all news are good: 99% Balanced exchange % rounds jittered % of peers Balanced Exchange 1 anomaly per minute 512 Kbps ($40/month home connection) Kbps % of peers data stream Balanced Exchange Peak bandwidth Jitter (1 anomaly/minute) is unacceptably high Peak bandwidth too high for home use! Flightpath Approximate equilibria for practical live streaming The price for proving BAR Gossip a Nash equilibrium is lack of flexibility: peers cannot join streaming mid-way communication patterns are inflexible extra overhead Flightpath balances obedience with choice through approximate equilibria not Nash, but -Nash: selfish node deviate only if doing so increases their utility by more than a factor of Flightpath supports dynamic membership; provides stable performance despite flash crowds; minimizes jitter; and lowers peak bandwidth below home-use threshold ε ε Balanced Exchange FlightPath % of peers % rounds jittered Jitterdämmerung! 512 Kbps Balanced Exchange Kbps FlightPath data stream % of peers Peak bandwidth below 512Kbps Rounds Average Bandwidth (Kbps) join event new epoch Banwidth is robust to churn 50 peers join 50 100 peers join 50 200 peers join 50 400 peers join 50 Add metadata to update Eliminates trust for safety Check metadata on receipt of update Can always write Can read if at least one node has data Can always trade updates Minimizes trust for liveness Teapot Minimal trust for today’s cloud Same guarantees of Depot, but using unmodified Amazon S3 servers In progress! Collaborators on the efforts noted here include my UT Austin colleagues Mike Dahllin and Mike Walfish; Allen Clement (MPI-SWS); Rama Kotla (MSR Silicon Valley); Alessandro Panconesi (Sapienza University Rome), Silvio Lattanzi (Google); and Edmund Wong, Manos Kapritsos, and Yang Wang, all Ph.D. students at UT.

Upload: vuongkien

Post on 09-Mar-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Reasoning with MAD distributed systems - DIMACSdimacs.rutgers.edu/Workshops/China-USSoftware/Slides/AlvisiLorenzo.… · What is a MAD system? Any system that spans Multiple Administrative

UpRightPractical and configurable BFT replication • Applying BFT replication to real services

‣ BFT HDFS‣ BFT ZooKeeper

• Refining the architecture and APIs‣ Introduce Request Quorum stage before agreement‣ Clean application API for:

• processing requests• taking application state snapshots

• Configurable replication‣ u: services are Up (live) despite u failures‣ r: services are Right (safe) despite r commission failures‣ Replication costs expressed as a function of u and r

Synchronous, no failures

(high performance)

ZyzzyvaSpeculative Byzantine Fault Tolerance • Simplifies the design of BFT replication

‣One protocol to rule them all ✓latency ✓throughput ✓cost of replication

What is a MAD system?Any system that spans Multiple Administrative Domains (e.g. peer-to-peer services, cloud/outsourced storage, Internet routing, and wireless mesh routing.

BAR-Gossip BAR-tolerant Nash for P2P live streaming• Gossip is attractive infrastructure for P2P live streaming

• But gossip protocols perform poorly if many peers behave selfishly

Tolerating arbitrary faults MAD Systems

Dependable Storage

Reasoning with MAD distributed systemsLorenzo Alvisi

The University of Texas at Austin

Just in! Local social defenses against Sybil attacks• Current sybil defenses can distinguish honest from forged identities in social

graphs that are fast mixing (equivalently, have constant conductance)• Alas, many social graphs are not fast mixing!• We are developing a new approach that provides better protection without

relying on global graph properties, such as constant conductance, but rather leverages the social graph’s community structure.

Lorenzo Alvisi is a Professor in the Department of Computer Science at UT Austin, where he is a co-director of the Laboratory for Advanced Systems Software (LASR). He holds a Ph.D. and M.S. in Computer Science from Cornell University, and a Laurea summa cum laude in Physics from the University of Bologna, Italy. He is a Fellow of the ACM and the recipient of an Alfred P. Sloan Fellowship and of the NSF CAREER Award.

EVEReplicating multithreaded servers• Replication state-of-the-art

• Agree on order of requests, then execute them• Requires deterministic execution

• Practically this means single-threaded execution• Eve

• First execute requests nondeterministically, without agreeing on order• Then verify if state and responses match among replicas• Efficient rollback on divergence

• Benefits• Allows multithreaded execution• Up to 12x speedup on a 16-core machine• 25% slower than an unreplicated server

Bad things

do happen to good

systems...

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Thro

ughp

ut (r

eque

sts/

sec)

# execution threads

Throughput scalabilitysequential replicated

unreplicatedunsafe replicated

Eve

BFT?

Yes

Zyzzyva

Command

Voter

Execution

Command

Agreement

Voter

Execution

ToFrom

Synchronous, with faults

(minimal liveness guarantees)

Asynchronous(minimal liveness

guarantees)

Synchronous, with or

without failures(high performance)

Byzantine (FT) Empire

Byzantine (FT) Empire

BAR-B A BAR-tolerant cooperative backup system• Peers assign each other chunks to store on their behalf

• Assignment need not be symmetric

• Deterministic retrieval guarantee despite Byzantine and Rational peers

When assigning work, challenge is handling ``he said / she said”

Problem could be solved by interposing an acquiescent witness W between A and B

But, just in FT distributed computing it is not prudent to assume that any particular node will be correct, we donʼt want to assume that any W that we may use will not either fail or turn selfish.

Solution: use BAR-tolerant State Machine Replication tobuild the abstraction of anacquiescent W out of node each of which may be Byzantine or selfish

DepotCloud storage with minimal trust• Removes trust from providers• Not the same thing as making providers more trustworthy!

What is so special about MAD systems?Traditional threshold FT does not apply!

Nodes can be selfish: cooperation requires incentivesSybil attacks can overwhelm any threshold mechanism

If this were not enough, each domain is a black box to its peers: what basis is there for trust?

The BAR ModelThree classes of MAD nodes:

Byzantine: deviate arbitrarily, for any reason

Acquiescent: follow the assigned protocol obediently

Rational: deviate iff doing so increases their utility

BAR Tolerant SystemsNo more than n/3 Byzantine nodes

No bound on number of rational nodes

before UpRight

after UpRight

A

I sent work to B but never received an answer!

B

I never received work from A!

A BW

A BW

Asynchronous(minimal liveness guarantees)

file

file

file

file

file

Chunks are erasure-coded

Traditional Gossip Traditional Gossip

Avg. u

ploa

d b/

w o

f altr

uist

ic p

eer

(Kbp

s)

Percentage of peers acting rationally

Data stream

• BAR Gossip relies on Balance Exchange, a provably incentive compatible protocol: no selfish node has unilateral incentives to deviate from it

• Reliability with BAR Gossip is way up...

• But not all news are good:

99%Balanced exchange

% rounds jittered

% o

f pe

ers

Balanced Exchange1 anomaly per minute

512

Kbps

($4

0/mon

th h

ome

conn

ection

)

Kbps

% o

f pe

ers

data

str

eam

Balanced ExchangePeak bandwidth

Jitter (1 anomaly/minute) is unacceptably high

Peak bandwidth too high for home use!

Flightpath Approximate equilibria for practical live streaming• The price for proving BAR Gossip a Nash equilibrium is lack of flexibility:‣ peers cannot join streaming mid-way‣ communication patterns are inflexible ‣ extra overhead

• Flightpath balances obedience with choice through approximate equilibria‣ not Nash, but -Nash: selfish node deviate only if doing so increases their

utility by more than a factor of • Flightpath supports dynamic membership; provides stable performance despite flash crowds; minimizes jitter; and lowers peak bandwidth below home-use threshold

εε

Balanced Exchange

FlightPath

% o

f pe

ers

% rounds jittered

Jitterdämmerung!

512

Kbps

Balanced Exchange

Kbps

FlightPath

data

str

eam

% o

f pe

ers

Peak bandwidth below 512Kbps

Rounds

Aver

age

Band

wid

th (Kb

ps)

join event new epoch

Banwidth is robust to churn

50 peers join 50100 peers join 50200 peers join 50400 peers join 50

Add metadata to update

Eliminates trust for safetyCheck metadata on

receipt of update

Can always write

Can read if at least one node has data

Can always trade updates

Minimizes trust for liveness

TeapotMinimal trust for today’s cloud• Same guarantees of Depot, but using unmodified

Amazon S3 servers In progress!

Collaborators on the efforts noted here include my UT Austin colleagues Mike Dahllin and Mike Walfish; Allen Clement (MPI-SWS); Rama Kotla (MSR Silicon Valley); Alessandro Panconesi (Sapienza University Rome), Silvio Lattanzi (Google); and Edmund Wong, Manos Kapritsos, and Yang Wang, all Ph.D. students at UT.