dsn 2008 1 byzantine replication under attack yair amir, jonathan kirsch, john lane johns hopkins...

26
DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

Upload: samantha-horton

Post on 04-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 1

Byzantine Replication Under Attack

Yair Amir, Jonathan Kirsch, John LaneJohns Hopkins University

Brian CoanTelcordia Technologies

Page 2: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 2

Byzantine Replication Under Attack

Yair Amir, Jonathan Kirsch, John LaneJohns Hopkins University

Brian CoanTelcordia Technologies

Page 3: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 3

• Society depends on large-scale, distributed computer systems for critical infrastructure.

• Insider attacks are a real threat, even for systems designed with security in mind.

• Byzantine replication provides fault tolerance by protecting against partial system compromises.– Attacker must compromise more than some threshold fraction

of the system to cause inconsistency or prevent the system from functioning.

– Systems perform well in fault-free or benign fault runs.– What about performance when under attack?

Motivation

Page 4: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 4

The Downside of Asynchrony

• Existing correctness criteria: safety and liveness– Safety: servers remain consistent.– Liveness: each update is eventually executed.

• Protocols are designed to be safe in all executions.– Do not rely on synchrony for safety!– Guarantee liveness only when the network is sufficiently stable.

• Real systems are not completely asynchronous.– Systems can satisfy much stronger performance guarantees

than liveness during stable periods.

• Consequence: Performance attacks!– An attacker can exploit the gap between what is promised

during stable periods (liveness) and what is possible.

Page 5: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 5

Performance Attacks:A First-Hand Look

• Red-team attack on Steward [DSN 06]. • Goal was to violate safety or liveness.• Steward survived all of the attacks!

• Most did not affect performance.

• The system was slowed down in one experiment.– Speed of update ordering was slowed down by a factor of 5.

• Big problem: – A better attack could slow the system down by a factor of 100. – But the system is still considered live!

• Liveness is a necessary but insufficient correctness criterion for practical systems on wide-area networks.

Page 6: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 6

Byzantine Performance Failures

• If the adversary cannot violate safety and liveness, the next best thing is to slow down the system beyond usefulness.

• Performance failures: send correct messages slowly but without triggering timeouts.

Failure TypeFailure

BehaviorMitigated by

Value DomainSending incorrect,

conflicting, or

invalid messages

Cryptography,

agreement protocols

Time DomainMessages arrive after timeouts or

not at all

Timeouts,

view change

Previously Considered Byzantine Failures

Page 7: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 7

A New Problem: Performance Under Attack

• Existing systems are vulnerable to performance attacks.– A small number of faulty servers can cause the system to make

progress at an extremely slow rate -- indefinitely!

• Leader-based protocols are vulnerable to performance attacks by a malicious leader.– Problem is magnified in wide-area networks, where it is difficult

to predict the performance that should be expected of the leader.

• Main challenges:– Developing meaningful performance metrics for evaluating

Byzantine replication protocols.– Designing protocols that perform well according to these

metrics, even when the system is under attack.

Page 8: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 8

• Motivation

• Byzantine Performance Failures

• Relevant Prior Work

• Case Study: BFT Under Attack

• The Prime Replication System• Bounded-Delay

• Protocol Overview

• Experimental Results

• Summary

Outline

Page 9: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 9

Relevant Prior Work

• Leader-based Byzantine replication– BFT [Castro, Liskov 99]– Separating agreement from execution [Yin et al. 03]– Fast Byzantine Consensus [Martin, Alvisi 05]– Zyzzyva [Kotla et al. 07]

• Randomized Byzantine replication– SINTRA [Cachin, Portiz 02]– RITAS [Moniz et al. 06]

• Quorum-based Byzantine replication– Q/U [Abd-El-Malek et al. 05]– HQ [Cowling et al. 06]

Page 10: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 10

Case Study: BFT Under Attack [Castro and Liskov 99]

Client

0

1

2

request pre-prepare prepare reply

3

commit

(Leader)

• Attack 1: Pre-Prepare Delay– Malicious leader can add delay into the ordering path by

withholding its Pre-Prepare.– Non-leaders maintain a FIFO queue of pending updates.

• Use timeouts to monitor the leader.• Timeout placed on execution of first update in queue.

– Malicious leader can stay in power by ordering one update per queue per timeout period!

Page 11: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 11

Case Study: BFT Under Attack [Castro and Liskov 99]

Client

0

1

2

request pre-prepare prepare reply

3

commit

(Leader)

• Attack 2: Timeout Manipulation– Timeout doubles every time the leader is replaced.– Use a denial of service attack to increase the timeout,

then stop on a malicious leader.

• Each update is eventually executed, but performance is much worse than if there were only correct servers.

Page 12: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 12

• Performance-Oriented Replication in Malicious Environments– Leader-based protocol providing Bounded-Delay, a stronger

guarantee than liveness, when the network is stable.

• System components:– Prime Ordering Protocol (Preordering phase, Global ordering phase)

– Suspect-Leader Protocol for detecting malicious leaders.

• Main Ideas:– Resources needed by the leader to do its job are bounded and

independent of system throughput.• Leader has “no excuse” for not sending timely messages.

– Non-leader servers compute a threshold level of acceptable performance that the leader should meet.

• Upper-bounded by a function of the latency between correct servers after the network stabilizes.

The Prime Replication System

Page 13: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 13

Bounded Delay• Prime-Stability: There is a time after which the following

condition holds for a set of at least 2f+1 correct servers (the stable servers):

• For each pair of stable servers r and s, there exists a value Min_Lat(r,s), unknown to the servers, such that if r sends a message to s, it will arrive with delay , where

• Bounded-Delay: There exists a time after which the update latency for any update initiated by a stable server is upper-bounded.

Page 14: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 14

Prime: Ordering Protocol

• Preordering (PO) Phase: – Each server, o, disseminates its updates to the other servers

(PO-Request).– Agreement protocol binds update u to preorder identifier (o, i), where

u is the ith update originated by server o (PO-ACK).– Each server cumulatively acknowledges the updates it preorders

(PO-ARU).

No

Att

ac

k

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

Page 15: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 15

Prime: Ordering ProtocolN

o A

tta

ck

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

PreorderingProtocol

ua, ub, ucServer 1

Server 2

Server 3

Server 4ug, uh, ui

(1, 1, ua), (1, 2, ub), (1, 3, uc)

(4, 1, ug), (4, 2, uh), (4, 3, ui)

ud, ue (2, 1, ud), (2, 2, ue)

uf(3, 1, uf)

3 2 1 3

PO-ARU

Page 16: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 16

Prime: Ordering Protocol

• Global Ordering Phase:– Similar to BFT (Pre-Prepare, Prepare, Commit)– Leader periodically sends a Pre-Prepare containing a proof matrix

(vector of PO-ARU messages). – Each globally ordered Pre-Prepare maps to a batch of preordered

updates based on contents of proof matrix.– Final total order is obtained by deterministically ordering the

updates in each batch based on preorder identifier.

No

Att

ac

k

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

Page 17: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 17

Prime: Ordering ProtocolN

o A

tta

ck

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

Global OrderingProtocol

Pre-Prepare 1 Pre-Prepare 2

PP1 PP2

Final Total Order

PO-ARU1

PO-ARU2

PO-ARU3

PO-ARU4

PO-ARU1’

PO-ARU2’

PO-ARU3’

PO-ARU4’

Page 18: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 18

Attack AnalysisN

o A

tta

ck

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

• Key Points:– Preordering phase for updates sent by correct servers cannot be

slowed down by faulty servers.– Once all correct servers receive a Pre-Prepare, global ordering

cannot be slowed down by faulty servers.

• Possible Attacks:– 1. Leader sends its Pre-Prepare to only some correct servers.– 2. Leader sends a Pre-Prepare with out-of-date PO-ARUs.– 3. Leader delays its Pre-Prepare.

Page 19: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 19

Addition 1: Pre-Prepare Flooding

O

Att

ac

k

• Intuition: 1. The leader must withhold the Pre-Prepare from all correct servers to significantly impact

latency. 2. If we can force the leader to send timely, up-to-date Pre-Prepares to at least one correct

server, we can ensure timely ordering!

No

Att

ac

k

L

O

L = Leader

O = Originator

= Aggregation Delay

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

L

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

Page 20: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 20

O

Att

ac

k

L

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

• Each server periodically sends a Proof-Matrix message, containing the latest PO-ARU messages it has received, to the leader.– A correct server expects a leader to include, in its next Pre-

Prepare, PO-ARU messages that are at least as up-to-date as those in the Proof-Matrix message.

• Why is this expectation justified?– A correct leader can simply adopt any PO-ARU messages that are

more up to date than what it currently has.

PROOFMATRIX

Addition 2: Proof Matrix Messages

Page 21: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 21

Key Idea: Turn-Around Time

O

Att

ac

k

L

POREQUEST

POACK

POARU

PREPREPARE PREPARE COMMIT

PROOFMATRIX

• Turn-around time– Time between sending a Proof-Matrix message, PM, and receiving a Pre-

Prepare “covering” all of the PO-ARU messages in PM.

• Key Observation:– The resources required by the leader to send a Pre-Prepare (bandwidth, CPU)

are bounded and independent of system throughput. – We can use turn-around time as a measure by which to judge the leader!

• Intuition: Force the leader to be timely by ensuring that it provides a fast enough turn-around time to at least one correct server.

Page 22: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 22

Suspect-Leader Protocol• Protocol Strategy:

– Dynamically determine an acceptable turn-around time based on roundtrip measurements (TAT_acceptable).

– Use turn-around times measured in the current view to compute a measure of the current leader’s performance (TAT_leader).

– Suspect the leader if TAT_leader > TAT_acceptable.

• Design Challenges: – Malicious servers can lie to try to lower expectation of acceptable

performance.

• Leader could remain in power while going slowly.

– Malicious servers can lie to make a correct leader look bad.• Would lead to continuous view changes.

Page 23: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 23

• Any server that retains a role as leader must provide a TAT to at least one correct server that is no more than

– Maximum update latency:

• There exists a set of at least f+1 correct servers that will not be suspected by any correct server if elected leader.– Aggressive but not overly aggressive.

= Maximum delay between correct servers

= Aggregation delay

Bounded-Delay!

Suspect-Leader: Key Properties

Page 24: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 24

Experimental Results• 7 servers (f = 2)

• Symmetric network– 50ms diameter, 10 Mbps links

• Leader performs just well enough to stay in power.

• BFT: aggressive timeout (300ms)

• BFT: Pre-Prepare delay

• Prime: – Leader adds as much delay as

possible.– Non-leader servers force as

much reconciliation as possible.

Update Throughput vs. Clients50ms Diameter, 10Mbps Links

0

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500

Number of Clients

Update Throughput (updates/sec)

BFT - No Attack

Prime - No Attack

Prime - Under Attack

BFT - Under Attack

Update Latency vs. Clients50ms Diameter, 10Mbps Links

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 100 200 300 400 500

Number of Clients

Update Latency (s)

BFT - No Attack

Prime - No Attack

Prime - Under Attack

BFT - Under Attack

Page 25: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 25

Summary

• Existing leader-based Byzantine replication protocols are vulnerable to performance attacks.– Liveness is not a meaningful performance metric for

evaluating Byzantine replication protocols.

• Bounded-Delay: a new performance metric.– Can we provide stronger guarantees?– Can we guarantee a minimum throughput?

• Prime: a new Byzantine replication protocol. – Achieves Bounded-Delay when the network is sufficiently

stable.

Page 26: DSN 2008 1 Byzantine Replication Under Attack Yair Amir, Jonathan Kirsch, John Lane Johns Hopkins University Brian Coan Telcordia Technologies

DSN 2008 26

Questions?•