pbs vldb 2012 - shivaramshivaram.org/talks/pbs-vldb12-talk.pdf · soundcloud twitter mozilla yammer...

Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica

VLDB2012

UC Berkeley

Probabilistically Bounded Stalenessfor Practical Partial Quorums

PBS

watch a recording(of an earlier talk) at:http://vimeo.com/37758648

PBSNOTE TO READER

http://vimeo.com/37758648


play with a live demo at:http://pbs.cs.berkeley.edu/#demo

PBSNOTE TO READER



R+W

strongconsistencyhigherlatency

eventualconsistency

lowerlatency

consistencycontinuum is a choicebinary

strong eventual

latency vs.consistency

informed by practice

our focus:

availability, partitions,failures

not in this talk:

quantify eventual consistency:wall-clock time (“how eventual?”)versions (“how consistent?”)

analyze real-world systems:EC is often strongly consistentdescribe when and why

our contributions

introsystem model

practicemetricsinsights

integration

Apache, DataStax

Project Voldemort

Dynamo:Amazon’s Highly Available Key-value Store

SOSP 2007

Adobe

Cisco

Digg

Gowalla

IBM

Morningstar

NetflixPalantir

Rackspace

Reddit

Rhapsody

Shazam

Spotify

Soundcloud

Twitter

Mozilla

Ask.comYammerAol

GitHubJoyentCloud

Best Buy

LinkedInBoeing

Comcast

Cassandra

RiakVoldemortGilt Groupe

N replicas/keyread: wait for R replieswrite: wait for W acks

N=3R=2

“strong”consistency

else:

R+W > Nif:

eventualconsistency

then:

reads return the last acknowledged write or anin-flight write (per-key)

consistency___“strong”

regular register

R+W > N

99th 99.9th1 1x 1x2 1.59x 2.35x3 4.8x 6.13xR

W99th 99.9th

1 1x 1x2 2.01x 1.9x3 4.96x 14.96x

LatencyLinkedIndisk-basedmodelN=3

⇧consistency, ⇧latencywait for more replicas,

read more recent data

consistency, ⇧ ⇧

latencywait for fewer replicas,

read less recent data

eventualconsistency“if no new updates are made to the object, eventually all accesses will return the last updated value”

W. Vogels, CACM 2008

R+W ≤ N

HowHow long do I have to wait?

eventual?

consistent?How

What happens if I don’t wait?

R+W


eventualconsistency

lowerlatency

introsystem model


integration

Cassandra:R=W=1, N=3

by default(1+1 ≯ 3)

eventual consistency

“maximum performance”

“very lowlatency”

okay for “most data”

“general case”

in the wild

anecdotally, EC“good enough” for many kinds of data

How eventual?How consistent?

“eventual and consistent enough”

Probabilistically Bounded Staleness

can’t make promisescan give expectations

Can we do better?

introsystem model


integration

HowHow long do I have to wait?

eventual?

t-visibility: probability p of consistent reads after t seconds

(e.g., 10ms after write, 99.9% of reads consistent)

How eventual?

t-visibility depends on

messaging and processing delays

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

R1

R2

write

ack

read

W=1

R=1

N=2

response

Time

Alice

Bob

R2R1

R2inconsistent

(R)

(W)write

ack

read

response


t seconds elapse


response is stale


once per replica

(A)

(S)

Coordinator Replica Time

solving WARS: order statistics

dependent variables

instead: Monte Carlo methods

to use WARS:

W53.244.5101.1

...

A10.38.211.3...

R15.322.419.8...

S9.614.26.7...

run simulationMonte Carlo, sampling

gather latency data

44.511.3

15.314.2

real Cassandra cluster varying latencies:

t-visibility RMSE: 0.28%latency N-RMSE: 0.48%

WARS accuracy

How eventual?

key: WARS modelneed: latencies

t-visibility: consistent reads with probability p after t seconds

introsystem model


integration

Yammer100K+ companies

uses Riak

LinkedIn 175M+ users

built and uses Voldemort

production latenciesfit gaussian mixtures

10 ms

N=3

Latency is combined read and write latency at 99.9th percentile

R=3, W=1100% consistent:Latency: 15.01 ms

LNKD-DISKN=3

16.5% faster

R=2, W=1, t =13.6 ms99.9% consistent:Latency: 12.53 ms

worthwhile?

Latency is combined read and write latency at 99.9th percentile

R=3, W=1100% consistent:Latency: 4.20 ms

LNKD-SSDN=3

59.5% faster

R=1, W=1, t = 1.85 ms99.9% consistent:Latency: 1.32 ms

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

Coordinator Replica

write

ack(A)

(W)

response(S)

(R)


t seconds elapse


response is stale


once per replica

SSDs reducevariance

compared todisks!

read

Yammer

latency81.1%

⇧

(187ms) 202 mst-visibility 99.9th

N=3

k-staleness (versions)How consistent?monotonic reads

quorum load

in the paper

in the paper

<k,t>-staleness:versions and time

latency distributions

WAN model

varying quorum sizes

staleness detection

in the paper

introsystem model


integration

1.Tracing2. Simulation3. Tune N,R,W

Integration

Project Voldemort

https://issues.apache.org/jira/browse/CASSANDRA-4261



http://pbs.cs.berkeley.edu/#demo

http://pbs.cs.berkeley.edu/#

http://pbs.cs.berkeley.edu/#

Related WorkQuorum Systems• probabilistic quorums [PODC ’97]

• deterministic k-quorums [DISC ’05, ’06]

Consistency Verification• Golab et al. [PODC ’11]

• Bermbach and Tai [M4WSOC ’11]

• Wada et al. [CIDR ’11]

• Anderson et al. [HotDep ’10]

• Transactional consistency:Zellag and Kemme [ICDE ’11],Fekete et al. [VLDB ’09]

Latency-Consistency• Daniel Abadi [Computer ’12]

• Kraska et al. [VLDB ’09]

Bounded StalenessGuarantees

• TACT [OSDI ’00]

• FRACS [ICDCS ’03]

• AQuA [IEEE TPDS ’03]

R+W


eventualconsistency

lowerlatency

consistencycontinuum is a

strong eventual

quantify eventual consistency

model staleness in time, versions

latency-consistency trade-offs

analyze real systems and hardware

pbs.cs.berkeley.eduPBSquantify which choice is best and explain

why EC is often strongly consistent

http://bailis.org/projects/pbs

http://bailis.org/projects/pbs

Extra Slides

PBS and apps

staleness requires either:

staleness-tolerant data structurestimelines, logs

cf. commutative data structures logical monotonicity

asynchronous compensation codedetect violations after data is returned; see paper

cf. “Building on Quicksand” memories, guesses, apologies

write code to fix any errors

minimize:(compensation cost)×(# of expected anomalies)

asynchronouscompensation

Read only newer data?

client’s read rateglobal write rate

(monotonic reads session guarantee)

# versions tolerablestaleness

=

(for a given key)

Failure?

latency spikes

Treat failures as

How l o n gdo partitions last?

what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr

8.76 consecutive hours down⇒ bad 8-hour rolling average

hide in tail of distribution ORcontinuously evaluate SLA, adjust

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WAN



N=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

R=3




10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

(LNKD-SSD and LNKD-DISK identical for reads)N=3

<k,t>-staleness:versions and time

approximation: exponentiate

t-staleness by k

Synthetic,Exponential Distributions

W 1/4x ARS

W 10x ARS

N=3, W=1, R=1

concurrent writes:deterministically choose

Coordinator R=2

(“key”, 1) (“key”, 2)

N = 3 replicas

Coordinator

read(“key”)

(“key”, 1)(“key”, 1)(“key”, 1)

client

read(“key”)(“key”, 1)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

Coordinator

read(“key”)

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)read(“key”)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

read(“key”)

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)read(“key”)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

(“key”, 2)

Coordinator Coordinator

read(“key”)

ack(“key”, 2)

write(“key”, 2)

write(“key”, 2)ack(“key”, 2)

W=1

R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)(“key”, 2)

(“key”, 1)

read(“key”)

(“key”,1)

ack(“key”, 2) ack(“key”, 2)

(“key”, 2)

(“key”, 2) (“key”, 2)

R=1

pbs vldb 2012 - shivaramshivaram.org/talks/pbs-vldb12-talk.pdf · soundcloud twitter mozilla yammer...

Documents