there is more consensus in egalitarian parliaments€¦ · iulian moraru, david andersen, michael...
TRANSCRIPT
There Is More Consensus in Egalitarian Parliaments
Iulian Moraru, David Andersen, Michael Kaminsky
Carnegie Mellon University Intel Labs
Friday, November 8, 13
Faulttolerance Redundancy⇒
Friday, November 8, 13
State Machine Replication
3
Friday, November 8, 13
State Machine Replication
3
Execute the same commands in the same order
Friday, November 8, 13
State Machine Replication
3
Execute the same commands in the same order
Friday, November 8, 13
State Machine Replication
3
Paxos
Execute the same commands in the same order
Friday, November 8, 13
State Machine Replication
3
Paxos
• No external failure detector required
Execute the same commands in the same order
Friday, November 8, 13
State Machine Replication
3
Paxos
• No external failure detector required
• Fast fail-over (high availability)
Execute the same commands in the same order
Friday, November 8, 13
Paxos is important in clustersChubby, Boxwood, SMARTER, ZooKeeper
• Synchronization• Resource discovery• Data replication
4
High throughputHigh availability
Friday, November 8, 13
Paxos is important in the wide-areaSpanner, Megastore
• Bring data closer to clients
• Tolerate datacenter outages
5
Low latency
Friday, November 8, 13
Paxos is important in the wide-areaSpanner, Megastore
• Bring data closer to clients
• Tolerate datacenter outages
5
Low latency
Friday, November 8, 13
Paxos is important in the wide-areaSpanner, Megastore
• Bring data closer to clients
• Tolerate datacenter outages
5
Low latency
85ms130ms
Friday, November 8, 13
Paxos overview
6
Friday, November 8, 13
Paxos overview• Agreement protocol
6
Friday, November 8, 13
Paxos overview• Agreement protocol
• Tolerates F failures with 2F+1 replicas (optimal)
• No external failure detector required
6
Friday, November 8, 13
Paxos overview• Agreement protocol
• Tolerates F failures with 2F+1 replicas (optimal)
• No external failure detector required
• Replicas can fail by crashing (non-Byzantine)
6
Friday, November 8, 13
Paxos overview• Agreement protocol
• Tolerates F failures with 2F+1 replicas (optimal)
• No external failure detector required
• Replicas can fail by crashing (non-Byzantine)
• Asynchronous communication
6
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A BC
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A BC
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A BCvote C
vote B
vote A
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A BCvote C
vote B
vote A
ack B ack B
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A
B
C
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
A
B
C
Friday, November 8, 13
1 2 3 4 ...
Using Paxos to order commands
7
AB
C
Friday, November 8, 13
1 2 3 4 ...
DUsing Paxos to order commands
7
AB
C
Friday, November 8, 13
1 2 3 4 ...
DUsing Paxos to order commands
7
AB
C
Friday, November 8, 13
1 2 3 4 ...D
Using Paxos to order commands
7
AB C
Friday, November 8, 13
1 2 3 4 ...D
Using Paxos to order commands
7
AB C
• Choose commands independently for each slot
Friday, November 8, 13
1 2 3 4 ...D
Using Paxos to order commands
7
AB C
• Choose commands independently for each slot
• At least 2 RTTs per slot:
Friday, November 8, 13
1 2 3 4 ...D
Using Paxos to order commands
7
AB C
• Choose commands independently for each slot
• At least 2 RTTs per slot:1. Take ownership of a slot
Friday, November 8, 13
1 2 3 4 ...D
Using Paxos to order commands
7
AB C
• Choose commands independently for each slot
• At least 2 RTTs per slot:1. Take ownership of a slot
2. Propose command
Friday, November 8, 13
1 2 3 4 ...
Multi-Paxos
8
Friday, November 8, 13
1 2 3 4 ...
Multi-Paxos
8
Friday, November 8, 13
1 2 3 4 ...
DMulti-Paxos
8
ABC
Friday, November 8, 13
1 2 3 4 ...
DMulti-Paxos
8
A
BC
Friday, November 8, 13
1 2 3 4 ...
DMulti-Paxos
8
A B
C
Friday, November 8, 13
1 2 3 4 ...
DMulti-Paxos
8
A B C
Friday, November 8, 13
1 2 3 4 ...D
Multi-Paxos
8
A B C
Friday, November 8, 13
1 2 3 4 ...D
Multi-Paxos
8
A B C
• 1 RTT to commit
Friday, November 8, 13
1 2 3 4 ...D
Multi-Paxos
8
A B C
• 1 RTT to commit
• Bottleneck for performance and availabilty
Friday, November 8, 13
Can we have it all?
9
Friday, November 8, 13
Can we have it all?
9
• High throughput, low latency
Friday, November 8, 13
Can we have it all?
9
• High throughput, low latency
• Constant availability
Friday, November 8, 13
Can we have it all?
9
• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
Friday, November 8, 13
Can we have it all?
9
• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
• Use fastest replicas
Friday, November 8, 13
Can we have it all?
9
• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
• Use fastest replicas
• Use closest (lowest latency) replicas
Friday, November 8, 13
Can we have it all?
9
Paxos• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
• Use fastest replicas
• Use closest (lowest latency) replicas
Friday, November 8, 13
Can we have it all?
9
PaxosMulti-• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
• Use fastest replicas
• Use closest (lowest latency) replicas
Friday, November 8, 13
Can we have it all?
9
PaxosMulti-
Egalitarian Paxos(EPaxos)
• High throughput, low latency
• Constant availability
• Distribute load evenly across all replicas
• Use fastest replicas
• Use closest (lowest latency) replicas
Friday, November 8, 13
EPaxos is all about ordering
10
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots Paxos
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots
...
One replicadecides
Paxos
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots
...
One replicadecides
Paxos
Multi-Paxos, Fast Paxos,Generalized Paxos
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots
...
One replicadecides
Paxos
Multi-Paxos, Fast Paxos,Generalized Paxos
...
Take turnsround-robin
Friday, November 8, 13
EPaxos is all about orderingPrevious strategies:
10
...
Contend for slots
...
One replicadecides
Paxos
Multi-Paxos, Fast Paxos,Generalized Paxos
Mencius
...
Take turnsround-robin
Friday, November 8, 13
EPaxos intuition
11
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...
11
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...
A
11
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...A
C B
11
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
After commit @ each replica
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
D
A B C DAfter commit @ each replica
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
• Load balance (every replica is a leader)
D
A B C DAfter commit @ each replica
Friday, November 8, 13
EPaxos intuition
1 2 3 4 ...ACB
11
• Load balance (every replica is a leader)
• EPaxos can choose any quorum for each command
D
A B C DAfter commit @ each replica
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅ B ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
B ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
PreAccept(C)
C ➝ {A}
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
PreAccept(C)
C ➝ {A} C ➝ {A, B}
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
PreAccept(C)
C ➝ {A} C ➝ {A, B}
Commit C ➝ {A, B}
Friday, November 8, 13
EPaxos commit protocol
12
R1
R2
R3
R4
R5
PreAccept(A)
A ➝ ∅ A ➝ ∅
PreAccept(B)
B ➝ ∅
B ➝ {A}
Accept(B)
B ➝ {A}B ➝ ∅ ACK
Commit A ➝ ∅
Commit B ➝ {A}
PreAccept(C)
C ➝ {A} C ➝ {A, B}
Commit C ➝ {A, B}
Friday, November 8, 13
Order only interfering commands• 1 RTT
• Non-concurrent commands
• OR non-interfering commands
• 2 RTTs
• Concurrent AND interfering
13
Friday, November 8, 13
Interference is application-specific
14
Friday, November 8, 13
Interference is application-specificKV store
• Infer from operation key
14
Friday, November 8, 13
Interference is application-specificKV store
• Infer from operation key
Google App Engine
• Programmer-specified
14
Friday, November 8, 13
Interference is application-specificKV store
• Infer from operation key
Google App Engine
• Programmer-specified
Relational databases
• Most transactions are simple, can be analyzed
• Few remaining transactions interfere w/ everything
14
Friday, November 8, 13
Execution
15
A
B
D
C
E
Friday, November 8, 13
Execution
15
A
B
D
C
E
strongly-connected component
Friday, November 8, 13
Execution
15
A
B
D
C
E
strongly-connected component
Friday, November 8, 13
Execution
15
A
B
D
C
E
strongly-connected component
Friday, November 8, 13
Execution
15
A
B
D
C
E
E
D
B, C, A
1
2
3
strongly-connected component
Friday, November 8, 13
Execution
15
A
B
D
C
E
E
D
B, C, A
1
2
3
strongly-connected component
Approximate sequence # order(Lamport clock)
Friday, November 8, 13
EPaxos properties
16
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Fast-path quorum: F +⎡F / 2⎤• Optimal for 3 and 5 replicas
• Better than Fast / Generalized Paxos by 1
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Fast-path quorum: F +⎡F / 2⎤• Optimal for 3 and 5 replicas
• Better than Fast / Generalized Paxos by 1
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Fast-path quorum: F +⎡F / 2⎤• Optimal for 3 and 5 replicas
• Better than Fast / Generalized Paxos by 1
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Fast-path quorum: F +⎡F / 2⎤• Optimal for 3 and 5 replicas
• Better than Fast / Generalized Paxos by 1
Friday, November 8, 13
EPaxos propertiesLinearizability: If A~B, and A committed before B proposed
then A will be executed before B.
16
Fast-path quorum: F +⎡F / 2⎤• Optimal for 3 and 5 replicas
• Better than Fast / Generalized Paxos by 1
Friday, November 8, 13
Results
17
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
Optimal wide-area commit latency
18
Friday, November 8, 13
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
CA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
VA
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
CA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
VA OR
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
CA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
VA OR
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
JPCA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
VA OR
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
EUJPCA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
Friday, November 8, 13
VA OR
EPaxos
Mencius
Generalized Paxos
Multi-Paxos (CA leader)
0 50 100 150 200 250
Median Commit Latency [ms]
EUJPCA
85ms90ms130ms
300ms150ms
Optimal wide-area commit latency
18
EPaxos: Optimal commit latency in wide-areafor 3 and 5 replicas
Friday, November 8, 13
Operations / secOperations / sec
Multi-PaxosMencius
EPaxos 0%EPaxos 2%
EPaxos 100%
0 10000 20000 30000 40000 50000 60000
Higher + more stable throughput
19
0 10000 20000 30000 40000 50000 60000
5 replicas 3 replicas
Friday, November 8, 13
Operations / secOperations / sec
Multi-PaxosMencius
EPaxos 0%EPaxos 100%
0 10000 20000 30000 40000 50000 60000
Multi-PaxosMencius
EPaxos 0%EPaxos 2%
EPaxos 100%
0 10000 20000 30000 40000 50000 60000
Higher + more stable throughput
19
0 10000 20000 30000 40000 50000 60000
5 replicas 3 replicas
0 10000 20000 30000 40000 50000 60000
When one replica is slow
Operations / secOperations / sec
Friday, November 8, 13
EPaxos: higher throughput w/ batching
20
Latency[ms]
log scale
5 ms batching, local area
Multi-PaxosEPaxos 0%EPaxos 100%
Throughput (ops / sec)
Friday, November 8, 13
EPaxos: higher throughput w/ batching
20
Latency[ms]
log scale
5 ms batching, local area
0 100000 200000 300000 400000 5000001
10
100
1000
10000Multi-PaxosEPaxos 0%EPaxos 100%
Throughput (ops / sec)
Friday, November 8, 13
EPaxos: higher throughput w/ batching
20
Latency[ms]
log scale
5 ms batching, local area
0 100000 200000 300000 400000 5000001
10
100
1000
10000Multi-PaxosEPaxos 0%EPaxos 100%
Throughput (ops / sec)
Friday, November 8, 13
EPaxos: higher throughput w/ batching
20
Latency[ms]
log scale
5 ms batching, local area
0 100000 200000 300000 400000 5000001
10
100
1000
10000Multi-PaxosEPaxos 0%EPaxos 100%
Throughput (ops / sec)
Friday, November 8, 13
EPaxos: higher throughput w/ batching
20
Latency[ms]
log scale
5 ms batching, local area
0 100000 200000 300000 400000 5000001
10
100
1000
10000Multi-PaxosEPaxos 0%EPaxos 100%
Throughput (ops / sec)
Friday, November 8, 13
Time [sec]
0 10000 20000 30000
0 10000 20000 30000
0
10000
20000
30000
0 5 10 15 20 25 30 35
Constant availability
21
Committhroughput[ops / sec]
leader failure
replica failure
Multi-Paxos
Mencius
EPaxosreplica failure
delayed commits
Friday, November 8, 13
Time [sec]
0 10000 20000 30000
0 10000 20000 30000
0
10000
20000
30000
0 5 10 15 20 25 30 35
Constant availability
21
Committhroughput[ops / sec]
leader failure
replica failure
Multi-Paxos
Mencius
EPaxosreplica failure
delayed commits
Friday, November 8, 13
Time [sec]
0 10000 20000 30000
0 10000 20000 30000
0
10000
20000
30000
0 5 10 15 20 25 30 35
Constant availability
21
Committhroughput[ops / sec]
leader failure
replica failure
Multi-Paxos
Mencius
EPaxosreplica failure
delayed commits
Friday, November 8, 13
EPaxos insights
22
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput Stability
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput Stability Low latency
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput Stability Low latency
Optimize only delays that matter(clients co-located w/ closest replica)
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput Stability Low latency
Optimize only delays that matter(clients co-located w/ closest replica)
Smaller quorums
Friday, November 8, 13
EPaxos insights
22
Order commands explicitly
High throughput Stability Low latency
Optimize only delays that matter(clients co-located w/ closest replica)
Smaller quorums
Friday, November 8, 13
23
TLA+ Spec
http://cs.cmu.edu/~imoraru/epaxos/tr.pdf
Formal Proof
Open Source Release
http://github.com/efficient/epaxos
Friday, November 8, 13