conflict-free replicated data types marc shapiro, nuno preguiÇa, carlos baquero and marek zawirski...

Conflict-free Replicated Data Types

MARC SHAPIRO, NUNO PREGUIÇA, CARLOS BAQUERO AND MAREK ZAWIRSKI

Presented by: Ron Zisman

2Motivation

Replication and Consistency - essential features of large distributed systems such as www, p2p, and cloud computing

Lots of replicas Great for fault-tolerance and read latency

× Problematic when updates occur

• Slow synchronization

• Conflicts in case of no synchronization

3Motivation

We look for an approach that: supports Replication

guarantees Eventual Consistency

is Fast and Simple

Conflict-free objects = no synchronization whatsoever

Is this practical?

4Contributions

Theory

Strong Eventual Consistency (SEC)

A solution to the CAP problem

Formal definitions

Two sufficient conditions

Strong equivalence between the two

Incomparable to sequential consistency

Practice

CRDTs = Convergent or Commutative Replicated Data Types

Counters

Set

Directed graph

5Strong Consistency

Ideal consistency: all replicas know about the update immediately after it executes

Preclude conflicts Replicas update in the same

total order Any deterministic object

Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale

6Strong Consistency





7Strong Consistency





8Strong Consistency





9Strong Consistency





10Eventual Consistency

Update local and propagate No foreground synch

Eventual, reliable delivery

On conflict Arbitrate

Roll back

Consensus moved to background Better performance

× Still complex





Roll back


× Still complex


Reconcile




Roll back


× Still complex

17Strong Eventual Consistency

Update local and propagate No synch


No conflict deterministic outcome of

concurrent updates

No consensus: ≤ n-1 faults Solves the CAP problem





concurrent updates


22Definition of EC

Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas

Termination: All method executions terminate

Convergence: Correct replicas that have delivered the same updates eventually reach equivalent state Doesn’t preclude roll backs and reconciling

23Definition of SEC

Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas

Termination: All method executions terminate

Strong Convergence: Correct replicas that have delivered the same updates have equivalent state

24

System model

System of non-byzantine processes interconnected by an asynchronous network

Partition-tolerance and recovery

What are the two simple conditions that guarantee strong convergence?

25Query

Client sends the query to any of the replicas Local at source replica

Evaluate synchronously, no side effects

26Query



27Query



28State-based approach

An object is a tuple

Local queries, local updates

Send full state; on receive, merge Update is said ‘delivered’ at some replica when it is

included in its casual history

Causal History:

payload set

initial state quer

y

update

merge

29State-based replication

Local at source .u(a), .u(b), …

Precondition, compute

Update local payload

Causal History:

on query:

on update:





Causal History:

on query:

on update:





Convergence

Episodically: send payload

On delivery: merge payloads

Causal History:

on query:

on update:

on merge:





Convergence



Causal History:

on query:

on update:

on merge:

34Semi-lattice

A poset is a join-semilattice if for all x,y in S a LUB exists

LUB = Least Upper Bound

Associative:

Commutative:

Idempotent:

Examples:

35State-based: monotonic semi-

lattice CvRDT

If:

then replicas converge to LUB of last values

payload type forms a semi-lattice

updates are increasing

merge computes Least Upper Bound

36Operation-based approach

An object is a tuple

prepare-update Precondition at source

1st phase: at source, synchronous, no side effects

effect-update Precondition against downstream state (P)

2nd phase, asynchronous, side-effects to downstream state

payload set

initial state quer

yprepare-update

effect-updatedelivery precondition

37Operation-based replication

Local at source


Broadcast to all replicas

Causal History:

on query/prepare-update:


Local at source



Eventually, at all replicas:

Downstream precondition

Assign local replica

Causal History:


on effect-update:

40Op-based: commute CmRDT

If:

then replicas converge

Liveness: all replicas execute all operations in delivery order where the downstream precondition (P) is true

Safety: concurrent operations all commute

41Monotonic semi-lattice

Commutative

A state-based object can emulate an operation-based object, and vice-versa

Use state-based reasoning and then covert to operation based for better efficiency

42Comparison

State-based Update ≠ merge

operation

Simple data types

State includes preceding updates; no separate historical information

Inefficient if payload is large

File systems (NFS, Dynamo)

Operation-based Update operation

Higher level, more complex

More powerful, more constraining

Small messages

Collaborative editing (Treedoc), Bayou, PNUTS

State-based or op-based, as convenient

43SEC is incomparable to sequential consistency

There is a SEC object that is not sequentially-consistent:Consider a Set CRDT S with operations add(e) and remove(e)

remove(e) → add(e) e ∈ S

add(e) ║ remove(e’) e ∈ S ∧ e’ S

add(e) ║ remove(e) e ∈ S (suppose add wins)

Consider the following scenario with replicas , :

1. [add(e); remove(e’)] ║ [add(e’); remove(e)]

2. merges the states from and : e ∈ S ∧ e’∈ S

The state of replica will never occur in a sequentially-consistent execution (either remove(e) or remove(e’) must be last)

44SEC is incomparable to sequential consistency

There is a sequentially-consistent object that is not SEC: If no crashes occur, a sequentially-consistent object is

SEC

Generally, sequential consistency requires consensus to determine the single order of operations – cannot be solved if n-1 crashes occur (while SEC can tolerate n-1 crashes)

45

Example CRDTs

Multi-master counter

Observed-Remove Set

Directed Graph

46Multi-master counter

Increment

Payload:

Partial order:

value() =

increment() = ++

merge(x,y) = =

47Multi-master counter

Increment / Decrement

Payload: ,

Partial order:

value() = -

increment() = ++

decrement() = ++

merge(x,y) = = (

)

48Set design alternatives

Sequential specification: {true} add(e) {e ∈ S}

{true} remove(e) {e ∈ S}

Concurrent: {true} add(e) ║ remove(e) {???} linearizable?

error state?

last writer wins?

add wins?

remove wins?

49Observed-Remove Set


Payload: added, removed (element, unique token)

add(e) =



add(e) =



add(e) =

Remove all unique elements observed: remove(e) =

lookup(e) =

merge(S,S’) =



add(e) =


lookup(e) =

merge(S,S’) =

57OR-Set + Snapshot

Read consistent snapshots despite concurrent, incremental updates

Vector clock for each process (global time) Payload: a set of (event, timestamp) pairs

Snapshot: vector clock value

lookup(e,t):

Garbage Collection: retain tombstones until not needed log entry discarded as soon as its timestamp is less than all

remote vector clocks (delivered to all processes)

58Sharded OR-Set

Very large objects Independent shards

Static: hash, Dynamic: consensus

Statically-Sharded CRDT Each shard is a CRDT

Update: single shard

No cross-object invariants

A combination of independent CRDTs remains a CRDT

Statically-Sharded OR-Set Combination of smaller OR-Sets

Consistent snapshots: clock cross shards

59Directed Graph – Motivation

Design a web search engine compute page rank by a

directed graph

Efficiency and scalability Asynchronous processing

Responsiveness Incremental processing,

as fast as each page is crawled

Operations Find new pages: add vertex

Parse page links: add/remove arc

Add URLs of linked pages to be crawled: add vertex

Deleted pages: remove vertex (lookup masks incident arcs)

Broken links allowed: add arc works even if tail vertex doesn’t exist

60Graph design alternatives

Graph = (V,A) where A ⊆ V V Sequential specification:

{v’,v’’ V} addArc(v’,v’’) {…}

{(v’,v’’) A} removeVertex(v’) {…}

Concurrent: removeVertex(v) ║ addArc(v’,v’’) linearizable?

last writer wins?

addArc(v’,v’’) wins? – v’ or v’’ restored if removed

removeVertex(v) wins? - all edges to or from v are removed

61Directed Graph (op-based)

Payload: OR-Set V (vertices), OR-Set A (arcs)

62Directed Graph (op-based)

Payload: OR-Set V (vertices), OR-Set A (arcs)

63Summary

Principled approach Strong Eventual Consistency

Two sufficient conditions: State: monotonic semi-lattice

Operation: commutativity

Useful CRDTs Multi-master counter, OR-Set, Directed Graph

64Future Work

Theory Class of computations accomplished by CRDTs

Complexity classes of CRDTs

Classes of invariants supported by a CRDT

CRDTs and self-stabilization, aggregation, and so on

Practice Library implementation of CRDTs

Supporting non-critical synchronous operations (commiting a state, global reset, etc)

Sharding

65Extras: MV-Register and the

Shopping Cart Anomaly

MV-Register ≈ LWW-Set Register Payload = { (value, versionVector) }

assign: overwrite value, vv++

merge: union of every element in each input set that is not dominated by an element in the other input set

A more recent assignment overwrites an older one

Concurrent assignments are merged by union (VC merge)

66Extras: MV-Register and the

Shopping Cart Anomaly

Shopping cart anomaly deleted element reappears

MV-Register does not behave like a set Assignment is not an alternative to proper

add and remove operations

67

The problem with eventual consistency jokes is that you can't tell who doesn't get it from who hasn't gotten it.

conflict-free replicated data types marc shapiro, nuno preguiÇa, carlos baquero and marek zawirski...

Documents

complex slide

conflicts replicas

doesnt scale slide

foreground synch eventual

replicas great

ron zisman slide

n2 faults

total order