byzantine fault-tolerant state machine replication · 2016-04-14 · byzantine fault-tolerant state...

15
1 INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved. Byzantine Fault-Tolerant State Machine Replication MSI – MEI – MI 2015/2016 INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved. 5 Replicas as State Machines Characteristics confinement - atomic commands/operations fault tolerance – “easy” replication Execution model Initial state - All correct servers start in the same state Agreement - All correct servers execute the same input commands Total order - All correct servers execute the commands in the same order Determinism - The same command executed in the same initial state in any two correct servers, generates the same final state and same outputs Programming message-based interactions requires deterministic execution reduces concurrency if commands are long m2 m3 OUTPUT STATE MACHINE m1 m4 INPUT QUEUE Client1 Client2

Upload: others

Post on 06-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  1

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Byzantine Fault-Tolerant

State Machine Replication"

MSI – MEI – MI 2015/2016

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

5

Replicas as State Machines"•  Characteristics

–  confinement - atomic commands/operations –  fault tolerance – “easy” replication

•  Execution model –  Initial state - All correct servers start in the same state –  Agreement - All correct servers execute the same input

commands –  Total order - All correct servers execute the commands

in the same order –  Determinism - The same command executed in the

same initial state in any two correct servers, generates the same final state and same outputs

•  Programming –  message-based interactions –  requires deterministic execution –  reduces concurrency if commands are long

m2m3

OUTPUT

STATEMACHINEm1

m4

INPUTQUEUE

C lie nt1 C lie nt2

Page 2: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  2

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

6

Replicated State Machine (active replication)"

•  replicated state machine: –  all replicas execute “at same time” –  achieves error masking –  determinism mandatory

•  replica quorums: –  benign communication failures –  omissive process failures - f+1 replicas

»  client waits for a single reply –  affirmative process failures - 2f+1 replicas

»  client waits for f+1 matching replies

•  message ordering: –  total order of commands to replicas –  same commands in same order implies

the same results

m2m3

m1

m2m3

m1

OUTPUT(consolidated)

REPLICATEDSTATEMACHINE

INPUT(disseminated)

C lie nt1 C lie nt2

m2m3

m1

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

8

Replicated State Machine (passive replication)"

•  passive replication –  only primary executes the commands –  in the order it decides –  supports preemption and non-

determinism (active rep. doesn’t) –  does not support value faults

•  state transferred to backup(s) –  inter-replica deferred state-level

synchronization (checkpoints) –  backup(s) log commands until

checkpoint received –  Primary fails: backup assumes –  potentially long takeover-glitch

•  message ordering: –  non-ordered message diffusion

m2m3

m3m2

Checkpoint S(m1)

OUTPUT

P1 - PRIMARY P2- BACKUP

m1

m1

m4

MessageLOG

Execute(m1)

EmptyLOG

Can it be used for intrusion tolerance?

Page 3: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  3

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Since passive replication does not tolerate value faults, active (state machine) replication

is the way to go!

Byzantine Fault-Tolerance (BFT) Replication!

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

16

Byzantine State Machine Replication"

m56m3

m24

m2

m1

OUTPUT(consolidated)

REPLICATEDSTATE

MACHINE

INPUT(disseminated)

Client1 C lient2

m2m565

m13

Byzantine communication

Byzantine processing

What can happen here?

What can happen here?

Page 4: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  4

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

18

Byzantine State Machine Replication"

•  input requirements: –  communication failures can be arbitrary –  commands delivered by Byzantine atomic broadcast protocol

•  execution requirements: –  failures of servers can be arbitrary –  masked by replicated execution –  result consolidation may be a problem

•  given N number of servers, maximum number of servers that can fail is:

•  or in other words: –  this limit is actually imposed by the protocol used to disseminate

messages (i.e., Byzantine atomic broadcast) –  Byzantine state machine replication requires only 2f+1 replicas –  ex: N=4 for f=1, N=7 for f=2, and so on

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

20

PBFT – Practical Byzantine Fault Tolerance"

•  Primary-based BFT SMR algorithm –  system evolves in views, numbered sequentially –  in each view v, one server is the primary, the others are the backups

»  “Primary of view v” = v mod N

•  Efficient and fast –  uses message authentication codes (MACs), instead of asymmetric

crypto signatures

•  Prevention-tolerance mix –  clients requests are authenticated –  each client shares a key with each server –  request contains an authenticator – a vector of MACs, one per

server-shared-key –  no forging/corruption - servers discard messages with invalid

authenticator

(Castro and Liskov, 1999, 2002)

Page 5: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  5

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT - System Model"

•  Asynchronous distributed system •  Network can lose, delay, reorder and duplicate

messages; but cannot do that indefinitely –  i.e., they require fair links to implement reliable channels

•  Byzantine fault model –  with fault independence (i.e., no common mode faults)

•  Cryptography –  PK signatures to facilitate the protocol presentation – MAC (each pair of processes share a key) – Digests (hashes)

•  Adversary cannot break cryptographic primitives

21

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – Service Properties"

•  Deterministic replicated service •  Requires 3f+1 replicas to tolerate f faults •  Service’ safety:

–  The replicated service should behave as its centralized counterpart (Linearizability)

– Malicious replicas can compromise their states

•  Service’ liveness (require synchrony assumptions): – A command issued by a correct client will eventually be

executed (Wait-freedom) if the network transmission delay doesn’t grow faster than real time

–  This condition is satisfied by the eventually synchronous system model

22

Page 6: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  6

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

23

PBFT – Algorithm"

•  algorithm essentials: –  two operation modes: normal operation and view-change –  system evolves in views (if some operation cannot be totally ordered, a new

view is started) –  a checkpoint and state transfer protocol is also executed periodically

•  algorithm outline: –  all messages are signed (with authenticators) –  client multicasts a request with a service command and a timestamp to all

servers –  servers reach agreement about the sequence number of the request –  client waits for at least f+1 replies with the same result (at least one correct

server executed the operation and produced the result)

Normal Operation View Change

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – Normal Operation I"

•  pre-prepare phase: –  primary receives a correctly signed request m –  it assigns a sequence number n to the message and sends this

number, a digest of the request D(m) and its current view number v to all backups (other replicas) in a PRE-PREPARE message

–  backup replicas receive the message and test its validity, i.e., if n was not assigned to another request in its current view v

–  if a replica has m and a valid PRE-PREPARE for it, it proceeds to the prepare phase (m is pre-prepared) 24

v is the view number; n is the sequence number of m

Page 7: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  7

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – Normal Operation II"

•  prepare phase: –  replicas store the received PRE-PREPARE message –  each replica sends a PREPARE message to other replicas containing

v, n and the digest D(m) of the message –  all servers that receive 2f PREPARE message from other replicas

with the same v, n and D(m), proceed to the commit phase –  when a replica finishes the prepare phase for m, we say that m is

prepared on this replica 25

v is the view number; n is the sequence number of m

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – Normal Operation III"

•  commit phase: –  each replica multicasts a COMMIT message containing v and n –  the request m for which n was assigned is executed when:

1.  a replica receives 2f COMMIT messages with the same v and n from other replicas, and

2.  all requests with sequence number lower than n are executed –  when the replica i finishes the commit phase we say that m is

committed in i

26

v is the view number; n is the sequence number of m

Page 8: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  8

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT - Some Protocol Invariants"

•  <m,n,v> is prepared in a correct replica → 2f+1 replicas pre-prepared <m,n,v> → at least f+1 of them are correct → (f+1) + (2f+1) > 3f+1 (any 2f+1 quorum of the system will contain at least one of these correct replicas) → it is impossible to have <m’,n,v> (m’ ≠ m) prepared on some correct replica (a correct replica will not pre-prepare two messages with the same n and v)

•  <m,n,v> is committed in a correct replica → 2f+1 replicas prepared <m,n,v> → at least f+1 of them are correct → any 2f+1 quorum of this system will contain at least one of these correct replicas (that can show that <m,n,v> is prepared)

27

v is the view number; n is the sequence number of m

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – Checkpoint"

•  Every protocol message is only accepted (and logged) if the assigned sequence number falls on a certain interval marked by two values: h and H = h + L (maximum log size)

•  Periodically (after every K request executions), the replicas exchange CHECKPOINT messages to advance h and H by K

•  CHECKPOINT messages contain a digest of system’ state before the checkpoint and the sequence number n of the last executed request to reach this state (n mod K = 0)

•  Replicas store 2f+1 CHECKPOINT messages as a proof that no other checkpoint for n is possible

–  (2f+1) + (2f+1) > 3f+1; there is always correct replica(s) in the intersection

•  All messages regarding requests with sequence number smaller than n can be discarded from the log

•  Late replicas can update themselves fetching states that can be proved correct with 2f+1 CHECKPOINT messages

28

Page 9: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  9

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – View Change I"

•  a backup replica triggers the view change protocol if it stays with some pending message m for more than a certain time limit (request timeout expires)

•  At this point, replica i stops accepting messages for v and sends a VIEW-CHANGE message containing:

–  the next view number v+1 –  the sequence number n of the last stable checkpoint –  a set C of 2f+1 signed CHECKPOINT messages that validate n –  a set P of messages prepared on i in views v’ ≤ v –  a set Q of messages pre-prepared on i in views v’ ≤ v 29

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – View Change II"

•  VIEW-CHANGE messages are accepted if C validates n and all messages in P and Q are from views ≤ v

•  for each accepted VIEW-CHANGE message, a replica sends a VIEW-CHANGE-ACK to the primary of the next view (v+1)

•  the new primary only accepts a VIEW-CHANGE from a replica if it receives 2f-1 VIEW-CHANGE-ACKs for it from other replicas (the 1999’ paper does not contain this phase, but it requires PK signatures on view changes)

30

Page 10: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  10

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – View Change III"

•  the new primary uses the information on accepted VIEW-CHANGE messages to define new view’s h as the highest sequence number found on a valid checkpoint

•  for each sequence number n such that h < n ≤ h + L –  if there is some message m prepared with n in 2f+1 replicas (possibly

committed in some of them), the sequence number n must be assigned to m –  otherwise, n must be assigned to a null operation (to fill gaps)

•  these assignments must be sent to other replicas in a NEW-VIEW message together with a digest from each accepted VIEW-CHANGE message used to define them 31

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

PBFT – View Change IV"

•  each backup replica that receive the NEW-VIEW obtains the VIEW-CHANGE messages used to build it

–  they can have it already or they can fetch them from other replicas

•  with these messages, each <message, sequence number> assignment contained on the NEW-VIEW message can be verified (with the same procedure used by the primary used to choose these assignments)

–  If some assignment is invalid, a VIEW-CHANGE for v+2 is sent to all replicas –  otherwise, a PREPARE message is sent for each assignment and the protocol resumes to

its normal behavior, as if the assignment was a PRE-PREPARE message 32

What happens

now?

Page 11: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  11

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

33

Why PBFT works? (safety)!

•  A Byzantine primary can not “create” its own requests because: –  backup replicas only process authenticated requests from clients

•  A Byzantine primary can not give the same seq. number to two different messages (violating the agreement property) because:

–  a correct backup sends a PREPARE message only for the first request it receives for a certain sequence number n

–  a correct backup replica sends a commit message only if it receives PREPARE messages from 2f other replicas

–  there can not be two different quorums of 2f+1 (out-of 3f+1) replicas that send PREPARE messages for the same n and different requests

»  These (f-dissemination) quorums would overlap on at least f+1 replicas »  Thus, one correct replica should have sent contradictory messages »  Which is impossible!

•  Consequently, if a correct replica executes a request associated with a sequence number, all other replicas that execute such request, do that with the same number

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

34

Why PBFT works? (liveness)!

•  A Byzantine replica can decide not to send PRE-PREPARE messages for some requests or to skip order numbers:

–  however, when a backup replica receives a request from a client it starts a timer, which is stopped when the request is executed

–  if the timer expires, the backup trigger the view change protocol –  when enough backup replicas trigger a view change, a new primary

is defined and a new view is installed

•  When a timer expires, the expiration time is doubled •  Liveness can be ensured as long as eventually a

timeout value will suffice to finish the protocol execution with a correct primary

Page 12: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  12

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Optimizations"

•  Rationale for optimizations:

“Faults, concurrency and asynchrony are very rare”

•  Is it true for intrusion tolerance?

•  Anyway, one of the key contributions of PBFT are its optimizations

35

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Optimizations"

•  MAC vectors instead of digital signatures –  Main reason for the high performance of the protocol –  MAC vectors are weaker than digital signatures, so the former

cannot always be used to substitute the later

•  Digest replies –  Instead of all replicas sending the whole reply for a request, the

client can choose just one to send it –  The others will only send a digest of the reply to allow voting –  If the received reply is wrong, the client can ask for a full reply from

other replicas

•  Batching –  Instead of running the agreement protocol for every request to be

executed, it can be done for request sets (batches) –  This technique dramatically increase the protocol throughput

36

Page 13: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  13

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Optimizations"

•  Read-only requests –  Read-only requests generally does not require ordering because

they don’t change the system’ state –  All replicas can immediately reply to the client and it can finishes the

read if there are 2f+1 matching replies (instead of f+1 – why?) –  Otherwise (due to faulty replicas or concurrency), the client retries

the request using the normal protocol

37

No 2f+1 matching replies!

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Optimizations"

•  Tentative execution –  Replicas can tentatively execute a request when it is prepared and

they have committed all requests with lower sequence number –  This reduces the protocol end-to-end latency from 5 to 4

communication steps –  The client needs to wait for 2f+1 matching replies from different

replicas to be sure that the execution order will eventually commit –  If the client don’t receive these replies and timer expires, it resends

the request without asking for tentative execution

38

Page 14: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  14

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Experimental Setup: Unless stated otherwise, all ex-periments ran with three (CFT) and four (BFT) replicashosted in separate machines. Up to 1600 client processeswere distributed uniformly across another four machines.

Clients and replicas were deployed in JRE 1.7.0_21on Ubuntu Linux 10.04, hosted in Dell PowerEdge R410servers. Each machine has 32 GB of memory and two quad-core 2.27 GHz Intel Xeon E5520 processor with hyper-threading, i.e., supporting 16 hardware threads. All machinescommunicate through an isolated gigabit Ethernet network.

Micro-benchmarks: We start by reporting the results ofa set of micro-benchmarks commonly used to evaluate statemachine replication systems. Such benchmarks consist of an“empty” service implemented with BFT-SMART to performraw throughput calculations at the server side and latencymeasurements at the client side. Throughput measurementswere gathered from the leader replica, while latency resultsfrom one of the clients (always the same).

Figure 4 presents results for both BFT and CFT setupsof BFT-SMART considering different request/reply sizes:0/0, 100/100, 1024/1024 and 4096/4096 bytes. In the figureit is possible to see that the CFT protocol consistentlyoutperforms its BFT counterpart. This happens due to thesmaller number of messages exchanged in the CFT setup,which results in less work per client request for the replicas.Furthermore, as expected, as the payload size increases,BFT-SMART overall performance decreases.

Byz-0BCrash-0BByz-100B

Crash-100B

Byz-1kBCrash-1kB

Byz-4kBCrash-4kB

Figure 4. Latency vs. throughput configured for f = 1.

Fault-scalability: Our next experiment considers theimpact of the number of replicas on the throughput of thesystem with different payloads. Figure 5 reports the results.

For all configurations, the results show that the perfor-mance of BFT-SMART degrades graciously as f increases,both for CFT and BFT setups. This happens because: (1) itexploits the many cores of the replicas (which our machineshave plenty) to calculate MACs; (2) only the n� 1 PRO-POSE messages of the consensus protocol contain batchesof messages (the other 2n(n�1) messages exchanged duringconsensus only contain the hash of the batches); and (3)

we avoid the use of IP multicast, which is know to causeproblems with many senders (e.g., multicast storms) [17].

It is also interesting to see that, with relatively big requests(1024 bytes), the difference between BFT and CFT tends tobe very small, regardless of the number of tolerated faults.

(a) 0/0 (b) 0/1024

(c) 1024/0 (d) 1024/1024

Figure 5. Throughput of BFT-SMART (Kops/s) for CFT (n =2 f +1) and BFT (n= 3 f +1) for different workloads and f = 1...3.

Signatures and Multi-core Awareness: Our next exper-iment considers the performance of the system when clientsignatures are enabled. In this setup, the clients sign everyrequest to the replicas that first verify its authenticity beforeordering it. There are two fundamental service-throughputoverheads associated with 1024-bit RSA signatures. First,the messages are 112 bytes bigger than when SHA-1 MACsare used. Second, the replicas need to verify the signatures,which is a relatively costly computational operation.

Figure 6 shows the throughput of BFT-SMART withdifferent number of hardware threads being used to verifysignatures. As the results show, the architecture of BFT-SMART exploits the existence of multiple cores with hyper-threading. This happens because the signatures are verifiedby the Netty thread pool, which uses a number of threadsproportional to the number of hardware threads in themachine (see Figure 3).

Figure 6. Throughput of BFT-SMART (in Kops/sec) using 1024-bit RSA signatures for 0/0 payload and n = 4.

Comparison with others: We compared BFT-SMARTagainst some representative SMR systems considering the0/0 benchmark. More precisely, we compared BFT-SMART(both in BFT and CFT setups) with PBFT [2], UpRight [4]and JPaxos [16] (a modern multi-core CFT replication

5

What about Performance?(how close BFT SMR is from)"

•  PBFT numbers (on the papers) are outdated… •  Here’s a modern comparison using BFT-SMaRt (2014)

39

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

Things to Remember"

•  State machine replication requires –  Same initial state –  Replica determinism –  Agreement on the sequence of operations to be executed

•  Primary-backup replication is difficult to use for intrusion tolerance

•  PBFT: first practical Byzantine fault-tolerant protocol –  Requires 3f+1 and an eventually synchronous system model

•  Key optimizations: –  MAC vectors avoid the use of public-key signatures –  Read-only requests does not require consensus –  Request batching improves throughput

40

Page 15: Byzantine Fault-Tolerant State Machine Replication · 2016-04-14 · Byzantine Fault-Tolerant State Machine Replication" MSI – MEI – MI 2015/2016 ... OUTPUT STATE MACHINE m1 m4

•  15

INTOL ©2010-16 P. Veríssimo and A. Bessani – All rights reserved.

(some) Bibliography"

•  F. Schneider. Implementing Fault-tolerant Services using the State Machine approach: A tutorial. ACM Computing Surveys, 22, 4. 1990.

•  M. Castro, B. Liskov. Practical Byzantine Fault Tolerance. OSDI 99.

•  M. Castro, B. Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. on Computer Systems, 20, 4. 2002.

•  A. Bessani, J. Sousa, E Alchieri. State Machine Replication for the Masses with BFT-SMaRt. DSN 14.

41