conflict-free replicated data types marc shapiro, nuno preguiÇa, carlos baquero and marek zawirski...
TRANSCRIPT
Conflict-free Replicated Data Types
MARC SHAPIRO, NUNO PREGUIÇA, CARLOS BAQUERO AND MAREK ZAWIRSKI
Presented by: Ron Zisman
2Motivation
Replication and Consistency - essential features of large distributed systems such as www, p2p, and cloud computing
Lots of replicas Great for fault-tolerance and read latency
× Problematic when updates occur
• Slow synchronization
• Conflicts in case of no synchronization
3Motivation
We look for an approach that: supports Replication
guarantees Eventual Consistency
is Fast and Simple
Conflict-free objects = no synchronization whatsoever
Is this practical?
4Contributions
Theory
Strong Eventual Consistency (SEC)
A solution to the CAP problem
Formal definitions
Two sufficient conditions
Strong equivalence between the two
Incomparable to sequential consistency
Practice
CRDTs = Convergent or Commutative Replicated Data Types
Counters
Set
Directed graph
5Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
6Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
7Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
8Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
9Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
10Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
11Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
12Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
13Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
14Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
15Eventual Consistency
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
16Eventual Consistency
Reconcile
Update local and propagate No foreground synch
Eventual, reliable delivery
On conflict Arbitrate
Roll back
Consensus moved to background Better performance
× Still complex
17Strong Eventual Consistency
Update local and propagate No synch
Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults Solves the CAP problem
18Strong Eventual Consistency
Update local and propagate No synch
Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults Solves the CAP problem
19Strong Eventual Consistency
Update local and propagate No synch
Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults Solves the CAP problem
20Strong Eventual Consistency
Update local and propagate No synch
Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults Solves the CAP problem
21Strong Eventual Consistency
Update local and propagate No synch
Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates
No consensus: ≤ n-1 faults Solves the CAP problem
22Definition of EC
Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas
Termination: All method executions terminate
Convergence: Correct replicas that have delivered the same updates eventually reach equivalent state Doesn’t preclude roll backs and reconciling
23Definition of SEC
Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas
Termination: All method executions terminate
Strong Convergence: Correct replicas that have delivered the same updates have equivalent state
24
System model
System of non-byzantine processes interconnected by an asynchronous network
Partition-tolerance and recovery
What are the two simple conditions that guarantee strong convergence?
25Query
Client sends the query to any of the replicas Local at source replica
Evaluate synchronously, no side effects
26Query
Client sends the query to any of the replicas Local at source replica
Evaluate synchronously, no side effects
27Query
Client sends the query to any of the replicas Local at source replica
Evaluate synchronously, no side effects
28State-based approach
An object is a tuple
Local queries, local updates
Send full state; on receive, merge Update is said ‘delivered’ at some replica when it is
included in its casual history
Causal History:
payload set
initial state quer
y
update
merge
29State-based replication
Local at source .u(a), .u(b), …
Precondition, compute
Update local payload
Causal History:
on query:
on update:
30State-based replication
Local at source .u(a), .u(b), …
Precondition, compute
Update local payload
Causal History:
on query:
on update:
31State-based replication
Local at source .u(a), .u(b), …
Precondition, compute
Update local payload
Convergence
Episodically: send payload
On delivery: merge payloads
Causal History:
on query:
on update:
on merge:
32State-based replication
Local at source .u(a), .u(b), …
Precondition, compute
Update local payload
Convergence
Episodically: send payload
On delivery: merge payloads
Causal History:
on query:
on update:
on merge:
33State-based replication
Local at source .u(a), .u(b), …
Precondition, compute
Update local payload
Convergence
Episodically: send payload
On delivery: merge payloads
Causal History:
on query:
on update:
on merge:
34Semi-lattice
A poset is a join-semilattice if for all x,y in S a LUB exists
LUB = Least Upper Bound
Associative:
Commutative:
Idempotent:
Examples:
35State-based: monotonic semi-
lattice CvRDT
If:
then replicas converge to LUB of last values
payload type forms a semi-lattice
updates are increasing
merge computes Least Upper Bound
36Operation-based approach
An object is a tuple
prepare-update Precondition at source
1st phase: at source, synchronous, no side effects
effect-update Precondition against downstream state (P)
2nd phase, asynchronous, side-effects to downstream state
payload set
initial state quer
yprepare-update
effect-updatedelivery precondition
37Operation-based replication
Local at source
Precondition, compute
Broadcast to all replicas
Causal History:
on query/prepare-update:
38Operation-based replication
Local at source
Precondition, compute
Broadcast to all replicas
Eventually, at all replicas:
Downstream precondition
Assign local replica
Causal History:
on query/prepare-update:
on effect-update:
39Operation-based replication
Local at source
Precondition, compute
Broadcast to all replicas
Eventually, at all replicas:
Downstream precondition
Assign local replica
Causal History:
on query/prepare-update:
on effect-update:
40Op-based: commute CmRDT
If:
then replicas converge
Liveness: all replicas execute all operations in delivery order where the downstream precondition (P) is true
Safety: concurrent operations all commute
41Monotonic semi-lattice
Commutative
A state-based object can emulate an operation-based object, and vice-versa
Use state-based reasoning and then covert to operation based for better efficiency
42Comparison
State-based Update ≠ merge
operation
Simple data types
State includes preceding updates; no separate historical information
Inefficient if payload is large
File systems (NFS, Dynamo)
Operation-based Update operation
Higher level, more complex
More powerful, more constraining
Small messages
Collaborative editing (Treedoc), Bayou, PNUTS
State-based or op-based, as convenient
43SEC is incomparable to sequential consistency
There is a SEC object that is not sequentially-consistent:Consider a Set CRDT S with operations add(e) and remove(e)
remove(e) → add(e) e ∈ S
add(e) ║ remove(e’) e ∈ S ∧ e’ S
add(e) ║ remove(e) e ∈ S (suppose add wins)
Consider the following scenario with replicas , :
1. [add(e); remove(e’)] ║ [add(e’); remove(e)]
2. merges the states from and : e ∈ S ∧ e’∈ S
The state of replica will never occur in a sequentially-consistent execution (either remove(e) or remove(e’) must be last)
44SEC is incomparable to sequential consistency
There is a sequentially-consistent object that is not SEC: If no crashes occur, a sequentially-consistent object is
SEC
Generally, sequential consistency requires consensus to determine the single order of operations – cannot be solved if n-1 crashes occur (while SEC can tolerate n-1 crashes)
47Multi-master counter
Increment / Decrement
Payload: ,
Partial order:
value() = -
increment() = ++
decrement() = ++
merge(x,y) = = (
)
48Set design alternatives
Sequential specification: {true} add(e) {e ∈ S}
{true} remove(e) {e ∈ S}
Concurrent: {true} add(e) ║ remove(e) {???} linearizable?
error state?
last writer wins?
add wins?
remove wins?
54Observed-Remove Set
Payload: added, removed (element, unique token)
add(e) =
Remove all unique elements observed: remove(e) =
lookup(e) =
merge(S,S’) =
55Observed-Remove Set
Payload: added, removed (element, unique token)
add(e) =
Remove all unique elements observed: remove(e) =
lookup(e) =
merge(S,S’) =
56Observed-Remove Set
Payload: added, removed (element, unique token)
add(e) =
Remove all unique elements observed: remove(e) =
lookup(e) =
merge(S,S’) =
57OR-Set + Snapshot
Read consistent snapshots despite concurrent, incremental updates
Vector clock for each process (global time) Payload: a set of (event, timestamp) pairs
Snapshot: vector clock value
lookup(e,t):
Garbage Collection: retain tombstones until not needed log entry discarded as soon as its timestamp is less than all
remote vector clocks (delivered to all processes)
58Sharded OR-Set
Very large objects Independent shards
Static: hash, Dynamic: consensus
Statically-Sharded CRDT Each shard is a CRDT
Update: single shard
No cross-object invariants
A combination of independent CRDTs remains a CRDT
Statically-Sharded OR-Set Combination of smaller OR-Sets
Consistent snapshots: clock cross shards
59Directed Graph – Motivation
Design a web search engine compute page rank by a
directed graph
Efficiency and scalability Asynchronous processing
Responsiveness Incremental processing,
as fast as each page is crawled
Operations Find new pages: add vertex
Parse page links: add/remove arc
Add URLs of linked pages to be crawled: add vertex
Deleted pages: remove vertex (lookup masks incident arcs)
Broken links allowed: add arc works even if tail vertex doesn’t exist
60Graph design alternatives
Graph = (V,A) where A ⊆ V V Sequential specification:
{v’,v’’ V} addArc(v’,v’’) {…}
{(v’,v’’) A} removeVertex(v’) {…}
Concurrent: removeVertex(v) ║ addArc(v’,v’’) linearizable?
last writer wins?
addArc(v’,v’’) wins? – v’ or v’’ restored if removed
removeVertex(v) wins? - all edges to or from v are removed
63Summary
Principled approach Strong Eventual Consistency
Two sufficient conditions: State: monotonic semi-lattice
Operation: commutativity
Useful CRDTs Multi-master counter, OR-Set, Directed Graph
64Future Work
Theory Class of computations accomplished by CRDTs
Complexity classes of CRDTs
Classes of invariants supported by a CRDT
CRDTs and self-stabilization, aggregation, and so on
Practice Library implementation of CRDTs
Supporting non-critical synchronous operations (commiting a state, global reset, etc)
Sharding
65Extras: MV-Register and the
Shopping Cart Anomaly
MV-Register ≈ LWW-Set Register Payload = { (value, versionVector) }
assign: overwrite value, vv++
merge: union of every element in each input set that is not dominated by an element in the other input set
A more recent assignment overwrites an older one
Concurrent assignments are merged by union (VC merge)
66Extras: MV-Register and the
Shopping Cart Anomaly
Shopping cart anomaly deleted element reappears
MV-Register does not behave like a set Assignment is not an alternative to proper
add and remove operations