pbs vldb 2012 - shivaramshivaram.org/talks/pbs-vldb12-talk.pdf · soundcloud twitter mozilla yammer...

of 79 /79
Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica VLDB 2012 UC Berkeley Probabilistically Bounded Staleness for Practical Partial Quorums PBS

Author: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica

    VLDB2012

    UC Berkeley

    Probabilistically Bounded Stalenessfor Practical Partial Quorums

    PBS

  • watch a recording(of an earlier talk) at:http://vimeo.com/37758648

    PBSNOTE TO READER

    http://vimeo.com/37758648http://vimeo.com/37758648

  • play with a live demo at:http://pbs.cs.berkeley.edu/#demo

    PBSNOTE TO READER

    http://vimeo.com/37758648http://vimeo.com/37758648

  • R+W

    strongconsistencyhigherlatency

    eventualconsistency

    lowerlatency

  • consistencycontinuum is a choicebinary

    strong eventual

  • latency vs.consistency

    informed by practice

    our focus:

    availability, partitions,failures

    not in this talk:

  • quantify eventual consistency:wall-clock time (“how eventual?”)versions (“how consistent?”)

    analyze real-world systems:EC is often strongly consistentdescribe when and why

    our contributions

  • introsystem model

    practicemetricsinsights

    integration

  • Apache, DataStax

    Project Voldemort

    Dynamo:Amazon’s Highly Available Key-value Store

    SOSP 2007

  • Adobe

    Cisco

    Digg

    Gowalla

    IBM

    Morningstar

    NetflixPalantir

    Rackspace

    Reddit

    Rhapsody

    Shazam

    Spotify

    Soundcloud

    Twitter

    Mozilla

    Ask.comYammerAol

    GitHubJoyentCloud

    Best Buy

    LinkedInBoeing

    Comcast

    Cassandra

    RiakVoldemortGilt Groupe

  • N replicas/keyread: wait for R replieswrite: wait for W acks

    N=3R=2

  • “strong”consistency

    else:

    R+W > Nif:

    eventualconsistency

    then:

  • reads return the last acknowledged write or anin-flight write (per-key)

    consistency___“strong”

    regular register

    R+W > N

  • 99th 99.9th1 1x 1x2 1.59x 2.35x3 4.8x 6.13xR

    W99th 99.9th

    1 1x 1x2 2.01x 1.9x3 4.96x 14.96x

    LatencyLinkedIndisk-basedmodelN=3

  • ⇧consistency, ⇧latencywait for more replicas,

    read more recent data

    consistency, ⇧ ⇧

    latencywait for fewer replicas,

    read less recent data

  • eventualconsistency“if no new updates are made to the object, eventually all accesses will return the last updated value”

    W. Vogels, CACM 2008

    R+W ≤ N

  • HowHow long do I have to wait?

    eventual?

  • consistent?How

    What happens if I don’t wait?

  • R+W

    strongconsistencyhigherlatency

    eventualconsistency

    lowerlatency

  • introsystem model

    practicemetricsinsights

    integration

  • Cassandra:R=W=1, N=3

    by default(1+1 ≯ 3)

  • eventual consistency

    “maximum performance”

    “very lowlatency”

    okay for “most data”

    “general case”

    in the wild

  • anecdotally, EC“good enough” for many kinds of data

    How eventual?How consistent?

    “eventual and consistent enough”

  • Probabilistically Bounded Staleness

    can’t make promisescan give expectations

    Can we do better?

  • introsystem model

    practicemetricsinsights

    integration

  • HowHow long do I have to wait?

    eventual?

  • t-visibility: probability p of consistent reads after t seconds

    (e.g., 10ms after write, 99.9% of reads consistent)

    How eventual?

  • t-visibility depends on

    messaging and processing delays

  • Coordinator Replicawrite

    ack

    read

    response

    wait for W responses

    t seconds elapse

    wait for R responses

    response is stale

    if read arrives before write

    once per replica Time

  • R1

    R2

    write

    ack

    read

    W=1

    R=1

    N=2

    response

    Time

    Alice

    Bob

    R2R1

    R2inconsistent

  • (R)

    (W)write

    ack

    read

    response

    wait for W responses

    t seconds elapse

    wait for R responses

    response is stale

    if read arrives before write

    once per replica

    (A)

    (S)

    Coordinator Replica Time

  • solving WARS: order statistics

    dependent variables

    instead: Monte Carlo methods

  • to use WARS:

    W53.244.5101.1

    ...

    A10.38.211.3...

    R15.322.419.8...

    S9.614.26.7...

    run simulationMonte Carlo, sampling

    gather latency data

    44.511.3

    15.314.2

  • real Cassandra cluster varying latencies:

    t-visibility RMSE: 0.28%latency N-RMSE: 0.48%

    WARS accuracy

  • How eventual?

    key: WARS modelneed: latencies

    t-visibility: consistent reads with probability p after t seconds

  • introsystem model

    practicemetricsinsights

    integration

  • Yammer100K+ companies

    uses Riak

    LinkedIn 175M+ users

    built and uses Voldemort

    production latenciesfit gaussian mixtures

  • 10 ms

    N=3

  • Latency is combined read and write latency at 99.9th percentile

    R=3, W=1100% consistent:Latency: 15.01 ms

    LNKD-DISKN=3

    16.5% faster

    R=2, W=1, t =13.6 ms99.9% consistent:Latency: 12.53 ms

    worthwhile?

  • N=3

  • Latency is combined read and write latency at 99.9th percentile

    R=3, W=1100% consistent:Latency: 4.20 ms

    LNKD-SSDN=3

    59.5% faster

    R=1, W=1, t = 1.85 ms99.9% consistent:Latency: 1.32 ms

  • 10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    W=3

    10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    CD

    F

    W=1

    10�2 10�1 100 101 102 103Write Latency (ms)

    0.20.40.60.81.0

    W=2

    LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

    N=3

  • Coordinator Replica

    write

    ack(A)

    (W)

    response(S)

    (R)

    wait for W responses

    t seconds elapse

    wait for R responses

    response is stale

    if read arrives before write

    once per replica

    SSDs reducevariance

    compared todisks!

    read

  • Yammer

    latency81.1%

    (187ms) 202 mst-visibility 99.9th

    N=3

  • k-staleness (versions)How consistent?monotonic reads

    quorum load

    in the paper

  • in the paper

    -staleness:versions and time

  • latency distributions

    WAN model

    varying quorum sizes

    staleness detection

    in the paper

  • introsystem model

    practicemetricsinsights

    integration

  • 1.Tracing2. Simulation3. Tune N,R,W

    Integration

    Project Voldemort

  • https://issues.apache.org/jira/browse/CASSANDRA-4261

    https://issues.apache.org/jira/browse/CASSANDRA-4261https://issues.apache.org/jira/browse/CASSANDRA-4261

  • http://pbs.cs.berkeley.edu/#demo

    http://pbs.cs.berkeley.edu/#http://pbs.cs.berkeley.edu/#

  • Related WorkQuorum Systems• probabilistic quorums [PODC ’97]• deterministic k-quorums [DISC ’05, ’06]

    Consistency Verification• Golab et al. [PODC ’11]• Bermbach and Tai [M4WSOC ’11]• Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10]• Transactional consistency:

    Zellag and Kemme [ICDE ’11],Fekete et al. [VLDB ’09]

    Latency-Consistency• Daniel Abadi [Computer ’12]• Kraska et al. [VLDB ’09]

    Bounded StalenessGuarantees

    • TACT [OSDI ’00]• FRACS [ICDCS ’03]• AQuA [IEEE TPDS ’03]

  • R+W

    strongconsistencyhigherlatency

    eventualconsistency

    lowerlatency

  • consistencycontinuum is a

    strong eventual

  • quantify eventual consistency

    model staleness in time, versions

    latency-consistency trade-offs

    analyze real systems and hardware

    pbs.cs.berkeley.eduPBSquantify which choice is best and explain

    why EC is often strongly consistent

    http://bailis.org/projects/pbshttp://bailis.org/projects/pbs

  • Extra Slides

  • PBS and apps

  • staleness requires either:

    staleness-tolerant data structurestimelines, logs

    cf. commutative data structures logical monotonicity

    asynchronous compensation codedetect violations after data is returned; see paper

    cf. “Building on Quicksand” memories, guesses, apologies

    write code to fix any errors

  • minimize:(compensation cost)×(# of expected anomalies)

    asynchronouscompensation

  • Read only newer data?

    client’s read rateglobal write rate

    (monotonic reads session guarantee)

    # versions tolerablestaleness

    =

    (for a given key)

  • Failure?

  • latency spikes

    Treat failures as

  • How l o n gdo partitions last?

  • what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr8.76 consecutive hours down⇒ bad 8-hour rolling average

    hide in tail of distribution ORcontinuously evaluate SLA, adjust

  • 10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    W=3

    10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    CD

    F

    W=1

    10�2 10�1 100 101 102 103Write Latency (ms)

    0.20.40.60.81.0

    W=2

    LNKD-SSD LNKD-DISK YMMR WAN

    LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

    LNKD-SSD LNKD-DISK YMMR WAN

    N=3

  • 10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    R=3

    LNKD-SSD LNKD-DISK YMMR WAN

    LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

    LNKD-SSD LNKD-DISK YMMR WAN

    10�2 10�1 100 101 102 103

    0.20.40.60.81.0

    CD

    F

    W=1

    10�2 10�1 100 101 102 103Write Latency (ms)

    0.20.40.60.81.0

    W=2

    (LNKD-SSD and LNKD-DISK identical for reads)N=3

  • -staleness:versions and time

    approximation: exponentiate

    t-staleness by k

  • Synthetic,Exponential Distributions

    W 1/4x ARS

    W 10x ARS

    N=3, W=1, R=1

  • concurrent writes:deterministically choose

    Coordinator R=2

    (“key”, 1) (“key”, 2)

  • N = 3 replicas

    Coordinator

    read(“key”)

    (“key”, 1)(“key”, 1)(“key”, 1)

    client

    read(“key”)(“key”, 1)readR=3

    R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

  • N = 3 replicas

    Coordinator

    read(“key”)

    (“key”, 1)(“key”, 1)(“key”, 1)

    (“key”, 1)read(“key”)readR=3

    R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

    client

  • N = 3 replicas

    Coordinator

    read(“key”)

    (“key”, 1)(“key”, 1)(“key”, 1)

    (“key”, 1)read(“key”)

    send read to all

    readR=1

    R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

    client

  • (“key”, 2)

    Coordinator Coordinator

    read(“key”)

    ack(“key”, 2)

    write(“key”, 2)

    write(“key”, 2)ack(“key”, 2)

    W=1

    R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)(“key”, 2)

    (“key”, 1)

    read(“key”)

    (“key”,1)

    ack(“key”, 2) ack(“key”, 2)

    (“key”, 2)

    (“key”, 2) (“key”, 2)

    R=1

  • N=3