pbs vldb 2012 - shivaramshivaram.org/talks/pbs-vldb12-talk.pdf · soundcloud twitter mozilla yammer...
Embed Size (px)
TRANSCRIPT
-
Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica
VLDB2012
UC Berkeley
Probabilistically Bounded Stalenessfor Practical Partial Quorums
PBS
-
watch a recording(of an earlier talk) at:http://vimeo.com/37758648
PBSNOTE TO READER
http://vimeo.com/37758648http://vimeo.com/37758648
-
play with a live demo at:http://pbs.cs.berkeley.edu/#demo
PBSNOTE TO READER
http://vimeo.com/37758648http://vimeo.com/37758648
-
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
-
consistencycontinuum is a choicebinary
strong eventual
-
latency vs.consistency
informed by practice
our focus:
availability, partitions,failures
not in this talk:
-
quantify eventual consistency:wall-clock time (“how eventual?”)versions (“how consistent?”)
analyze real-world systems:EC is often strongly consistentdescribe when and why
our contributions
-
introsystem model
practicemetricsinsights
integration
-
Apache, DataStax
Project Voldemort
Dynamo:Amazon’s Highly Available Key-value Store
SOSP 2007
-
Adobe
Cisco
Digg
Gowalla
IBM
Morningstar
NetflixPalantir
Rackspace
Reddit
Rhapsody
Shazam
Spotify
Soundcloud
Twitter
Mozilla
Ask.comYammerAol
GitHubJoyentCloud
Best Buy
LinkedInBoeing
Comcast
Cassandra
RiakVoldemortGilt Groupe
-
N replicas/keyread: wait for R replieswrite: wait for W acks
N=3R=2
-
“strong”consistency
else:
R+W > Nif:
eventualconsistency
then:
-
reads return the last acknowledged write or anin-flight write (per-key)
consistency___“strong”
regular register
R+W > N
-
99th 99.9th1 1x 1x2 1.59x 2.35x3 4.8x 6.13xR
W99th 99.9th
1 1x 1x2 2.01x 1.9x3 4.96x 14.96x
LatencyLinkedIndisk-basedmodelN=3
-
⇧consistency, ⇧latencywait for more replicas,
read more recent data
consistency, ⇧ ⇧
latencywait for fewer replicas,
read less recent data
-
eventualconsistency“if no new updates are made to the object, eventually all accesses will return the last updated value”
W. Vogels, CACM 2008
R+W ≤ N
-
HowHow long do I have to wait?
eventual?
-
consistent?How
What happens if I don’t wait?
-
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
-
introsystem model
practicemetricsinsights
integration
-
Cassandra:R=W=1, N=3
by default(1+1 ≯ 3)
-
eventual consistency
“maximum performance”
“very lowlatency”
okay for “most data”
“general case”
in the wild
-
anecdotally, EC“good enough” for many kinds of data
How eventual?How consistent?
“eventual and consistent enough”
-
Probabilistically Bounded Staleness
can’t make promisescan give expectations
Can we do better?
-
introsystem model
practicemetricsinsights
integration
-
HowHow long do I have to wait?
eventual?
-
t-visibility: probability p of consistent reads after t seconds
(e.g., 10ms after write, 99.9% of reads consistent)
How eventual?
-
t-visibility depends on
messaging and processing delays
-
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
-
R1
R2
write
ack
read
W=1
R=1
N=2
response
Time
Alice
Bob
R2R1
R2inconsistent
-
(R)
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
(A)
(S)
Coordinator Replica Time
-
solving WARS: order statistics
dependent variables
instead: Monte Carlo methods
-
to use WARS:
W53.244.5101.1
...
A10.38.211.3...
R15.322.419.8...
S9.614.26.7...
run simulationMonte Carlo, sampling
gather latency data
44.511.3
15.314.2
-
real Cassandra cluster varying latencies:
t-visibility RMSE: 0.28%latency N-RMSE: 0.48%
WARS accuracy
-
How eventual?
key: WARS modelneed: latencies
t-visibility: consistent reads with probability p after t seconds
-
introsystem model
practicemetricsinsights
integration
-
Yammer100K+ companies
uses Riak
LinkedIn 175M+ users
built and uses Voldemort
production latenciesfit gaussian mixtures
-
10 ms
N=3
-
Latency is combined read and write latency at 99.9th percentile
R=3, W=1100% consistent:Latency: 15.01 ms
LNKD-DISKN=3
16.5% faster
R=2, W=1, t =13.6 ms99.9% consistent:Latency: 12.53 ms
worthwhile?
-
N=3
-
Latency is combined read and write latency at 99.9th percentile
R=3, W=1100% consistent:Latency: 4.20 ms
LNKD-SSDN=3
59.5% faster
R=1, W=1, t = 1.85 ms99.9% consistent:Latency: 1.32 ms
-
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
N=3
-
Coordinator Replica
write
ack(A)
(W)
response(S)
(R)
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
SSDs reducevariance
compared todisks!
read
-
Yammer
latency81.1%
⇧
(187ms) 202 mst-visibility 99.9th
N=3
-
k-staleness (versions)How consistent?monotonic reads
quorum load
in the paper
-
in the paper
-staleness:versions and time
-
latency distributions
WAN model
varying quorum sizes
staleness detection
in the paper
-
introsystem model
practicemetricsinsights
integration
-
1.Tracing2. Simulation3. Tune N,R,W
Integration
Project Voldemort
-
https://issues.apache.org/jira/browse/CASSANDRA-4261
https://issues.apache.org/jira/browse/CASSANDRA-4261https://issues.apache.org/jira/browse/CASSANDRA-4261
-
http://pbs.cs.berkeley.edu/#demo
http://pbs.cs.berkeley.edu/#http://pbs.cs.berkeley.edu/#
-
Related WorkQuorum Systems• probabilistic quorums [PODC ’97]• deterministic k-quorums [DISC ’05, ’06]
Consistency Verification• Golab et al. [PODC ’11]• Bermbach and Tai [M4WSOC ’11]• Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10]• Transactional consistency:
Zellag and Kemme [ICDE ’11],Fekete et al. [VLDB ’09]
Latency-Consistency• Daniel Abadi [Computer ’12]• Kraska et al. [VLDB ’09]
Bounded StalenessGuarantees
• TACT [OSDI ’00]• FRACS [ICDCS ’03]• AQuA [IEEE TPDS ’03]
-
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
-
consistencycontinuum is a
strong eventual
-
quantify eventual consistency
model staleness in time, versions
latency-consistency trade-offs
analyze real systems and hardware
pbs.cs.berkeley.eduPBSquantify which choice is best and explain
why EC is often strongly consistent
http://bailis.org/projects/pbshttp://bailis.org/projects/pbs
-
Extra Slides
-
PBS and apps
-
staleness requires either:
staleness-tolerant data structurestimelines, logs
cf. commutative data structures logical monotonicity
asynchronous compensation codedetect violations after data is returned; see paper
cf. “Building on Quicksand” memories, guesses, apologies
write code to fix any errors
-
minimize:(compensation cost)×(# of expected anomalies)
asynchronouscompensation
-
Read only newer data?
client’s read rateglobal write rate
(monotonic reads session guarantee)
# versions tolerablestaleness
=
(for a given key)
-
Failure?
-
latency spikes
Treat failures as
-
How l o n gdo partitions last?
-
what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr8.76 consecutive hours down⇒ bad 8-hour rolling average
hide in tail of distribution ORcontinuously evaluate SLA, adjust
-
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
N=3
-
10�2 10�1 100 101 102 103
0.20.40.60.81.0
R=3
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103Write Latency (ms)
0.20.40.60.81.0
W=2
(LNKD-SSD and LNKD-DISK identical for reads)N=3
-
-staleness:versions and time
approximation: exponentiate
t-staleness by k
-
Synthetic,Exponential Distributions
W 1/4x ARS
W 10x ARS
N=3, W=1, R=1
-
concurrent writes:deterministically choose
Coordinator R=2
(“key”, 1) (“key”, 2)
-
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
client
read(“key”)(“key”, 1)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
-
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)read(“key”)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
-
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)read(“key”)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
-
(“key”, 2)
Coordinator Coordinator
read(“key”)
ack(“key”, 2)
write(“key”, 2)
write(“key”, 2)ack(“key”, 2)
W=1
R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)(“key”, 2)
(“key”, 1)
read(“key”)
(“key”,1)
ack(“key”, 2) ack(“key”, 2)
(“key”, 2)
(“key”, 2) (“key”, 2)
R=1
-
N=3