reaching reliable agreement in an unreliable...
TRANSCRIPT
![Page 1: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/1.jpg)
Reaching reliable agreement in an unreliable
worldHeidi Howard
[email protected] @heidiann
Research Students Lecture Series 13th October 2015
1
![Page 2: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/2.jpg)
Introducing Alice
Alice is new graduate from Cambridge off to the world of work.
She joins a cool new start up, where she is responsible for a key value store.
2
![Page 3: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/3.jpg)
Single Server System
Server
Client 2
A 7B 2
Client 1 Client 3
3
![Page 4: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/4.jpg)
Single Server System
Server
Client 2
A 7B 2
Client 1 Client 3
A?
7
4
![Page 5: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/5.jpg)
Single Server System
Server
Client 2
A 7B 2
Client 1 Client 3
B=3 OK
3
5
![Page 6: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/6.jpg)
Single Server System
Server
Client 2
A 7B 2
Client 1 Client 3
B?3
3
6
![Page 7: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/7.jpg)
Single Server SystemPros
• linearizable semantics
• durability with write-ahead logging
• easy to deploy
• low latency (1 RTT in common case)
• partition tolerance with retransmission & command cache
Cons
• system unavailable if server fails
• throughput limited to one server
7
![Page 8: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/8.jpg)
Backups
aka Primary backup replication
Primary
Client 2
Client 1Backup 1
Backup 1
Backup 1
A 7B 2
A?7
A 7B 2
A 7B 2
A 7B 2
8
![Page 9: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/9.jpg)
Backups
aka Primary backup replication
Primary
Client 2
Client 1Backup 1
Backup 1
Backup 1
A 7B 1
B=1
A 7B 2
A 7B 2
A 7B 2
A 7B 1
9
![Page 10: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/10.jpg)
Backups
aka Primary backup replication
Primary
Client 2
Client 1Backup 1
Backup 1
Backup 1
A 7B 1
OK
A 7B 1
A 7B 1
A 7B 1
OK
OK
OK
10
![Page 11: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/11.jpg)
Big GotchaWe are assuming total ordered broadcast
11
![Page 12: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/12.jpg)
Totally Ordered Broadcast
(aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes.
12
![Page 13: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/13.jpg)
Requirements• Scalability - High throughout processing of
operations.
• Latency - Low latency commit of operation as perceived by the client.
• Fault-tolerance - Availability in the face of machine and network failures.
• Linearizable semantics - Operate as if a single server system.
13
![Page 14: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/14.jpg)
Doing the Impossible
14
![Page 15: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/15.jpg)
CAP TheoremPick 2 of 3:
• Consistency
• Availability
• Partition tolerance
Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15]
15
Eric Brewer
![Page 16: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/16.jpg)
FLP Impossibility
It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85]
16
![Page 17: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/17.jpg)
Consensus is impossible[PODC’89]
Nancy Lynch17
![Page 18: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/18.jpg)
Aside from Simon PJ
Don’t drag your reader or listener through your blood strained path.
Simon Peyton Jones18
![Page 19: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/19.jpg)
PaxosPaxos is at the foundation of (almost) all distributed consensus protocols.
It is a general approach of using two phases and majority quorums.
It takes much more to construct a complete fault-tolerance distributed systems. Leslie Lamport
19
![Page 20: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/20.jpg)
Beyond Paxos
• Replication techniques - mapping a consensus algorithm to a fault tolerance system
• Classes of operations - handling of read vs writes, operating on stale data, soft vs hard state
• Leadership - fixed, variable or leaderless, multi-paxos, how to elect a leader, how to discover a leader
• Failure detectors - heartbeats & timers
• Dynamic membership, Log compaction & GC, sharding, batching etc
20
![Page 21: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/21.jpg)
Consensus is hard
21
![Page 22: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/22.jpg)
A raft in the sea of confusion
22
![Page 23: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/23.jpg)
Case Study: Raft
Raft is the understandable replication algorithm.
Provides linearisable client semantics, 1 RTT best case latency for clients.
A complete(ish) architecture for making our application fault-tolerance.
Uses SMR and Paxos
23
![Page 24: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/24.jpg)
State Machine Replication
Server
Client
ServerServer A 7B 2
A 7B 2
A 7B 2
B=3
24
![Page 25: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/25.jpg)
State Machine Replication
Server
Client
ServerServer A 7B 2
A 7B 2
A 7B 2
B=3
25
![Page 26: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/26.jpg)
State Machine Replication
Server
Client
ServerServer A 7B 2
A 7B 2
A 7B 2
B=3
3
3 3
26
![Page 27: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/27.jpg)
Leadership
Follower Candidate Leader
Startup/ Restart
Timeout Win
Timeout
Step down
27
Step down
![Page 28: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/28.jpg)
OrderingEach node stores is own perspective on a value known as the term.
Each message includes the sender’s term and this is checked by the recipient.
The term orders periods of leadership to aid in avoiding conflict.
Each has one vote per term, thus there is at most one leader per term.
28
![Page 29: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/29.jpg)
Ordering
29
ID: 1
ID: 2
ID: 3
1
1
1
2
2 2
2
![Page 30: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/30.jpg)
ID: 1 Term: 0 Vote: n
ID: 2 Term: 0 Vote: n
ID: 5 Term: 0 Vote: n
ID: 4 Term: 0 Vote: n
ID: 3 Term: 0 Vote: n
30
![Page 31: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/31.jpg)
Leadership
Follower Candidate Leader
Startup/ Restart
Timeout Win
Timeout
Step down
31
Step down
![Page 32: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/32.jpg)
ID: 1 Term: 0 Vote: n
ID: 2 Term: 0 Vote: n
ID: 5 Term: 0 Vote: n
ID: 4 Term: 1
Vote: me
ID: 3 Term: 0 Vote: n
Vote for me in term 1!32
![Page 33: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/33.jpg)
ID: 1 Term: 1 Vote: 4
ID: 2 Term: 1 Vote: 4
ID: 5 Term: 1 Vote: 4
ID: 4 Term: 1
Vote: me
ID: 3 Term: 1 Vote: 4
Ok!33
![Page 34: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/34.jpg)
ID: 1 Term: 1 Vote: 4
ID: 2 Term: 1 Vote: 4
ID: 5 Term: 1 Vote: 4
ID: 4 Term: 1
Vote: me
ID: 3 Term: 1 Vote: 4
Leader fails34
![Page 35: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/35.jpg)
ID: 1 Term: 2 Vote: 5
ID: 2 Term: 1 Vote: 4
ID: 5 Term: 2
Vote: me
ID: 4 Term: 1
Vote: me
ID: 3 Term: 1 Vote: 4
Vote for me in term 2!
35
![Page 36: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/36.jpg)
ID: 1 Term: 2 Vote: 5
ID: 2 Term: 2
Vote: me
ID: 5 Term: 2
Vote: me
ID: 4 Term: 1
Vote: me
ID: 3 Term: 2 Vote: 2
Vote for me in term 2!
36
![Page 37: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/37.jpg)
ID: 1 Term: 2 Vote: 5
ID: 2 Term: 2
Vote: me
ID: 5 Term: 2
Vote: me
ID: 4 Term: 1
Vote: me
ID: 3 Term: 2 Vote: 2
37
![Page 38: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/38.jpg)
Leadership
Follower Candidate Leader
Startup/ Restart
Timeout Win
Timeout
Step down
38
Step down
![Page 39: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/39.jpg)
ID: 1 Term: 3 Vote: 2
ID: 2 Term: 3
Vote: me
ID: 5 Term: 3 Vote: 2
ID: 4 Term: 3 Vote: 2
ID: 3 Term: 3 Vote: 2
Vote for me in term 3!
39
![Page 40: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/40.jpg)
ID: 1 Term: 3 Vote: 2
ID: 2 Term: 3
Vote: me
ID: 5 Term: 3 Vote: 2
ID: 4 Term: 3 Vote: 2
ID: 3 Term: 3 Vote: 2
40
![Page 41: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/41.jpg)
ReplicationEach node has a log of client commands and a index into this representing which commands have been committed
41
![Page 42: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/42.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 2
A 4B 2
Simple Replication
42
![Page 43: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/43.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 2
A 4B 2
B=7
Simple Replication
43
![Page 44: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/44.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 2
A 4B 2
B=7
B=7
B=7
Simple Replication
44
![Page 45: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/45.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 2
B=7
B=7
B=7
Simple Replication
45
![Page 46: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/46.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 7
A 4B 7
A 4B 7
B=7
B=7
B=7
Simple Replication
46
![Page 47: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/47.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 7
A 4B 7
A 4B 7
B=7
B=7
B=7
B?
Simple Replication
47
![Page 48: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/48.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 7
A 4B 7
A 4B 7
B=7
B=7
B=7
B?
B?
B?
Simple Replication
48
![Page 49: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/49.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 7
A 4B 7
A 4B 7
B=7
B=7
B=7
B?
B?
B?
Simple Replication
49
![Page 50: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/50.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 7
B=7
B=7
B?
B?
Catchup
A=6
50
![Page 51: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/51.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:(
51
![Page 52: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/52.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:(
52
![Page 53: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/53.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
:)
53
![Page 54: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/54.jpg)
ID: 1
ID: 2
ID: 3
A=4
A=4
A=4
A 4B 2
A 4B 7
A 4B 7
B=7
B=7
B?
B?
Catchup
A=6
A=6
B=7 B? A=6
54
![Page 55: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/55.jpg)
Evaluation
• The leader is a serious bottleneck -> limited scalability
• Can only handle the failure of a minority of nodes
• Some rare network partitions render protocol in livelock
55
![Page 56: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/56.jpg)
Beyond Raft
56
![Page 57: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/57.jpg)
Case Study: Tango
Tango is designed to be a scalable replication protocol.
It’s a variant of chain replication + Paxos.
It is leaderless and pushes more work onto clients
57
![Page 58: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/58.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 2 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 1
0A=4
0A=4
B=5
58
![Page 59: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/59.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 2 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
Next?
1
59
B=5
![Page 60: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/60.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 2 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
B=5 @ 1
OK
1B=5
60
B=5
![Page 61: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/61.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 2 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
1B=5
1B=5
61
B=5 @ 1 OK
![Page 62: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/62.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 2 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
1B=5
1B=5
1B=5
62
B=5 @ 1
OK
![Page 63: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/63.jpg)
Simple Replication
Client 1
A 7B 2
Client 2
A 4B 5 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
1B=5
1B=5
1B=5
63
![Page 64: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/64.jpg)
Fast Read
Client 1
A 7B 2
Client 2
A 4B 5 Sequencer
Server 1 Server 2 Server 3
0A=4
Next: 2
0A=4
0A=4
1B=5
1B=5
1B=5
Check?
1
B?
64
![Page 65: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/65.jpg)
Handling FailuresSequencer is soft-state which can be reconstructed by querying head server.
Clients failing between receiving token and first read leaves gaps in the log. Clients can mark these as empty and space (though not address) is reused.
Clients may fail before completing a write. The next client can fill-in and complete the operation
Server failures are detected by clients, who initiate a membership change, using term numbers.
65
![Page 66: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/66.jpg)
Evaluation
Tango is scalable, the leader is not longer the bottleneck.
Dynamic membership and sharding come for free with design.
High latency of chain replication
66
![Page 67: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/67.jpg)
Next Steps
67
![Page 68: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/68.jpg)
68
wait… we’re not finished yet!
![Page 69: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/69.jpg)
Requirements• Scalability - High throughout processing of
operations.
• Latency - Low latency commit of operation as perceived by the client.
• Fault-tolerance - Availability in the face of machine and network failures.
• Linearizable semantics - Operate as if a single server system.
69
![Page 70: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/70.jpg)
Many more examples• Raft [ATC’14] - Good starting point, understandable
algorithm from SMR + multi-paxos variant
• VRR [MIT-TR’12] - Raft with round-robin leadership & more distributed load
• Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses CR + multi-paxos variant
• Zookeeper [ATC'10] - Primary backup replication + atomic broadcast protocol (Zab [DSN’11])
• EPaxos [SOSP’13] - leaderless Paxos varient for WANs
70
![Page 71: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/71.jpg)
Can we do even better?• Self-scaling replication - adapting resources to
maintain resilience level.
• Geo replication - strong consistency between wide area links
• Auto configuration - adapting timeouts and configure as network changes
• Integrated with unikernels, virtualisation, containers and other such deployment tech
71
![Page 72: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/72.jpg)
Evaluation is hard
• few common evaluation metrics.
• often only one experiment setup is used.
• different workloads
• evaluation to demonstrate protocol strength
72
![Page 73: Reaching reliable agreement in an unreliable worldhh360.user.srcf.net/slides/consensus_lecture.pdf · Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk](https://reader035.vdocuments.net/reader035/viewer/2022081600/602296621f8eb736204fae80/html5/thumbnails/73.jpg)
Lessons Learned
• Reaching consensus in distributed systems is do able
• Exploit domain knowledge
• Raft is a good starting point but we can do much better!
Questions?73