eventual consistency jinyang. sequential consistency sequential consistency properties: –latest...
TRANSCRIPT
Sequential consistency
• Sequential consistency properties:– Latest read must see latest write
• Handles caching
– All writes are applied in a single order• Handles concurrent writes
• Realizing sequential consistency:– Reads/writes from a single node execute one at a
time– All reads/writes to address X must be ordered by
one memory/storage module responsible for X
Realizing sequential consistency
W(A
)1
W(A)2
Cacheor
replica
CacheOr
replica
W(B)3Invalidate, R
(B)
Disadvantages of sequential consistency
• Requires highly available connections– Lots of chatter between clients/servers
• Not suitable for certain scenarios:– Disconnected clients (e.g. your laptop)– Apps might prefer potential inconsistency
to loss of availability
Why (not) eventual consistency?
• Support disconnected operations– Better to read a stale value than nothing– Better to save writes somewhere than nothing
• Potentially anomalous application behavior– Stale reads and conflicting writes…
Operating w/o total connectivity
replica replica
Client writes to its local replica
W(A)1 W(A)2
Sync w/ server resolves non-conflicting changes,reports conflicting ones
to user
No sync between clients
Pair-wise synchronization
replica replica
replica
W(A)1 W(A)2
W(B)3Pair-wise sync resolves non-conflicting changes,reports conflicting ones
to users
File synchronizer
• Goal1. All replica contents eventually become
identical
2. No lost updates– Do not replace new version with old ones
Prevent lost updates
• Detect if updates were sequential– If so, replace old version with new one– If not, detect conflict
• “Optimistic” vs. “Pessimistic” – Eventual Consistency: Let updates
happen, worry about whether they can be serialized later
– Sequential Consistency: Updates cannot take effect unless they are serialized first
How to prevent lost updates?
• Strawman: use mtime to decide which version should replace the other
• Problem w/ wallclock: cannot detect disagreement on ordering
H1
H2
W(f)a
mtime: 15648
W(f)c
23657
f
W(f)b
16679f
12354f 15648
Strawman fix
• Carry the entire modification history
• If history X is a prefix of Y, Y is newer
H1
W(f)a W(f)b
W(f)c
H1:15648
H1:15648
H1:15648H1:16679
H1:15648H2:23657
Compress version history
H1
W(f)a W(f)b
W(f)c
H1:1
H1:1
H1:1H1:2
H1:1H1:2H2:1
H1:1H1:2
H1:2 implies H1:1,so we only need one
number per host
H1:1 H1:2
H1:1 H1:2 H1:2H2:1
H2
How to deal w/ conflicts?
• Easy: mailboxes w/ two different set of messages
• Medium: changes to different lines of a C source file
• Hard: changes to same line of a C source file
• After conflict resolution, what should the vector timestamp be?
What about file deletion?
• Can we forget about the vector timestamp for deleted files?
• Simple solution: treat deletion as a write– Conflicts involving a deleted file is easy
• Downside:– Need to remember vector timestamp for
deleted files indefinitely
Tra [Cox, Josephson]
• What are Tra’s novel properties?– Easy to compress storage of vector
timestamps– No need to check every file’s version vector
during sync– Allows partial sync of subtrees– No need to keep timestamp for deleted files
forever
Tra’s key technique
• Two vector timestamps:1. One represents modification time
– Tracks what a host has
2. One represents synchronization time– Tracks what a host knows
• Sync time implies no modification happens since mod time
H1:1H2:5H3:7
H1:10H2:20H3:25
f1 f2H1:0
H1:0H2:0
H1:0
H1:0H2:0
Using sync time
H1
W(f1)a W(f2)b
H1:1
H1:1H2:0
H2
H1:2
H1:2H2:0
f1
f1 f2H1:1
H1:2H2:0
H1:2
H1:2H2:0
f2
Compress mtime and synctime
• dir synctime = element-wise min of child sync times
• dir mtime = element-wise max of child mod times
• Sync(d1d1’)– Skip d1 if mtime of d1 is less than synctime of d1’
• Can we achieve this with single mtime?– Skip d1 if mtime of d1 is less than mtime of d1’
Synctime enables partial synchronization
• Directory d1 contains f1 and f2, suppose host sync a subtree (d1/f1)– With synctime+mtime: synctime of d1 does not
change. Mtime of d1 increases– With mtime only: Mtime of d1 increases
• Host later syncs subtree d1/f2– With synctime+mtime: will pull in modifications in
e2 because synctime of d1 is smaller– With mtime only: skips d1 because mtime is high
enough
f2 H1:0H1:0H2:0
Using sync time
H1
W(f1)a W(f2)b
H1:1
H2
H1:2f1 f2
H1:2
H1:2H2:0
d
Sync f1 only
f1 H1:0H1:0H2:0
H1:2
H1:0H2:0
d
f1 H1:1H1:2H2:0
H1:2
H1:0H2:0
d
Sync f2 only
f1 H1:1
H1:2
H1:2H2:0
d
f2 H1:2
f2 H1:0
How to deal w/ deletion
H1
W(f1)a D(f2)
H1:1
H2
f1 f2
H1:2
H1:2H2:0
d
f1 H1:0
H1:0
H1:0H2:0
d
H1:2H2:0
Deletion notice for a deleted file
contains its sync time
f1 H1:1
H1:2
H1:2H2:0
d
f2
How to deal w/ deletion
H1
W(f1)a D(f2)
H1:1
H2
f1 f2
H1:2
H1:2H2:0
d
f1 H1:0
H1:0
H1:0H2:1
d
H1:2H2:0
Deletion notice for a deleted file
contains its sync time
H2:1 H2:1f1 H1:1
H1:2
H1:2H2:1
d
f2
Another definition of eventual consistency
• Eventual consistency (Tra)– All replica contents are eventually identical– Do not care about individual writes, just
overwrite old replica w/ new one
• Eventual consistency (Bayou)– Writes are eventually applied in total order– Reads might not see most recent writes in
total order
Bayou
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
0:01:02:0
0:01:02:0
0:01:02:0
N0
N1
N2
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
0:31:02:0
N0
N1
N2
1:0 W(x)2:0 W(y)3:0 W(z)
0:01:12:0
0:01:02:0
1:1 W(x)
1:0 W(x)2:0 W(y)3:0 W(z)
0:31:02:0
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
0:31:02:0
N0
N1
N2
1:0 W(x)2:0 W(y)3:0 W(z)
0:31:42:0
0:01:02:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
1:1 W(x)0:31:42:0
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
N0
N1
N2
0:31:42:0
0:01:02:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:41:42:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z) Which portion of
The log is stable?
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
N0
N1
N2
0:31:42:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:41:42:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:31:42:5
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
N0
N1
N2
0:31:62:5
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:41:42:0
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:41:42:5
1:0 W(x)1:1 W(x)2:0 W(y)3:0 W(z)
0:31:42:5
Bayou uses a primary to commit a total order
• Why is it important to make log stable?– Stable writes can be committed – Stable portion of the log can be truncated
• Problem: If any node is offline, the stable portion of all logs stops growing
• Bayou’s solution:– A designated primary defines a total commit order – Primary assigns CSNs (commit-seq-no)– Any write with a known CSN is stable– All stable writes are ordered before tentative writes
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
0:31:02:0
N0
N1
N2
1:1:0 W(x)2:2:0 W(y)3:3:0 W(z)
0:01:12:0
0:01:02:0
∞:1:1 W(x)
∞:1:1 W(x) 0:01:12:0
Bayou propagation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
VersionVector
Write log
0:41:12:0
N0
N1
N2
1:1:0 W(x)2:2:0 W(y)3:3:0 W(z)
0:01:12:0
0:01:02:0
∞:1:1 W(x)
4:1:1 W(x)
1:1:0 W(x)2:2:0 W(y)3:3:0 W(z)4:1:1 W(x)
0:41:12:0