probabilistically consistent indranil gupta (indy) department of computer science, uiuc...

Probabilistically Consistent

Indranil Gupta (Indy)Department of Computer Science,

[email protected]

FuDiCo 2015DPRG: http://dprg.cs.uiuc.edu 1

mailto:[email protected]

http://dprg.cs.uiuc.edu/

Joint Work With

• Muntasir Rahman (Graduating PhD Student)• Luke Leslie, Lewis Tseng• Mayank Pundir (MS, now at Facebook)

• Work funded by Air Force Research Labs/AFOSR, National Science Foundation, Google, Yahoo!, and Microsoft

Hard Choices in Extensible Distributed Systems• Users in extensible distributed systems desire

• Timeliness and Correctness Guarantees

• But these are at odds with…• Unpredictability

• Network Delays and Failures

• Research community and industry often tends to translate this into hard choices in systems design

• Examples1. CAP Theorem: choice between consistency and availability (or latency)

• Either relational databases or eventually consistent NoSQL stores• (Maybe a convergence now?)

2. Always get 100% answers in computation engines (batch or stream)• Use checkpointing

Hard Choices… Can in fact be Probabilistic Choices!

• Many of these are in fact probabilistic choices• One of the earliest examples: pbcast/Bimodal Multicast

• Examples1. CAP Theorem:

• We derive a probabilistic CAP theorem that defines an achievable boundary between consistency and latency in any database system

• We use this to incorporate probabilistic consistency and latency SLAs into Cassandra and Riak

2. Always get 100% answers in computation engines (batch or stream)• In many systems, checkpointing results in 8-31x higher execution time!• We show that in systems like distributed graph processing systems

• We can avoid checkpointing altogether• Instead, have a reactive approach: upon failure, reactively scrounge state (naturally replicated)• And achieve very high accuracy (95-99%)

Key-value/NoSQL Storage Systems

• Key-value/NoSQL stores: $3.4B sector by 2018• Distributed storage in the cloud• Netflix: video position (Cassandra) • Amazon: shopping cart (DynamoDB)• And many others

• Necessary API operations: get(key) and put(key, value)• And some extended operations, e.g., “CQL”

in Cassandra key-value store

Key-value/NoSQL Storage: Fast and Fresh

• Cloud clients expect both • Latency: Low latency for all operations (reads/writes)

• 500ms latency increase at Google.com costs 20% drop in revenue • each extra ms $4 M revenue loss• Long latency User Cognitive Drift

• Consistency: read returns value of one of latest writes• Freshness of data means accurate tracking and higher user satisfaction• Most KV stores only offer weak consistency (Eventual consistency)• Eventual consistency = if writes stop, all replicas converge, eventually

Hard vs. Soft Partitions

• CAP Theorem looks at hard partitions• However, soft partitions may happen inside a

data-center• Periods of elevated message delays • Periods of elevated loss rates

• Soft partitions are more frequent

Data-center 1(America)

Data-center 2(Europe)

Hard partition

ToR ToR

CoreSw

Congestion at switches=> Soft partition

Our work: From Impossibility to Possibility

• C Probabilistic C (Consistency)• A Probabilistic A (Latency)• P Probabilistic P (Partition Model)

• A probabilistic CAP theorem• A system that validates how close we are to the

achievable envelope• (Goal is not: another consistency model, or

NoSQL vs New/Yes SQL)

8

time

W(1) W(2) R(1)

tc

A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read

pic is likelihood a read is NOT tc-fresh

Probabilistic Consistency (pic ,tc)

pua is likelihood a read DOES NOT return an answer within ta time units

Probabilistic Latency (pua ,ta)

α is likelihood that a random path ( client server client) has message delay exceeding tp

time units

Probabilistic Partition (α, tp )

PCAP Theorem: Impossible to achieve both Probabilistic Consistency and Latency

under Probabilistic Partitions if:

tc + ta < tp and pua + pic < α

Bad network -> High (α, tp )

To get better consistency -> lower (pic ,tc)

To get better latency -> lower (pua ,ta)

Probabilistic CAP

9Full proof in our arXiv paper: http://arxiv.org/abs/1509.02464

Special case: Original CAP has α=1 and tp = ∞

10

Towards Probabilistic SLAs

• Latency SLA: Similar to latency SLAs already existing in industry.• Meet a desired probability that client receives operation’s result

within the timeout• Maximize freshness probability within given freshness interval• Example: Amazon shopping cart

• Doesn’t want to lose customers due to high latency• Only 10% operations can take longer than 300ms

• SLA: (pua, ta) = (0.1, 300ms)

• Minimize staleness (don’t want customers to lose items)

• Minimize: pic (Given: tc)

11

Towards Probabilistic SLAs (2)

• Consistency SLA: Goal is to • Meet a desired freshness probability (given freshness interval) • Maximize probability that client receives operation’s result

within the timeout• Example: Google search application/Twitter search

• Wants users to receive “recent” data as search• Only 10% results can be more than 5 min stale

• SLA: (pic , tc)=(0.1, 5 min)

• Minimize response time (fast response to query)

• Minimize: pua (Given: ta)

Meeting these SLAs: PCAP Systems

Increased Knob Latency Consistency

Read Delay Degrades Improves

Read Repair Rate Unaffected Improves

Consistency Level

Degrades Improves

Continuously adapt control knobs to always satisfy PCAP SLA

KV-store (Cassandra,

Riak)

CONTROL KNOBS

PCAPSystem

Satisfies PCAP SLAADAPTIVE CONTROL

System assumptions:• Client sends query to coordinator server which then forwards to replicas (answers reverse path)• There exist background mechanisms to bring stale replicas up to date

Meeting Consistency SLA for PCAP Cassandra (pic=0.135)

Consistency always below target SLA

Setup • 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM• 100 Mbps Ethernet• YCSB workload (144 client threads)• Network delay: Log-normal distribution [Benson 2010]

Mean latency = 3 ms | 4 ms | 5 ms

Meeting Consistency SLA for PCAP Cassandra (pic=0.135)

Optimal envelopes under different Network conditions (based on PCAP theorems)

PCAP system SatisfiesSLA and close to Optimal envelope

Geo-Distributed PCAP

15

N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump

Consistency degrades after delay jump

Fast convergence initially, and after delay jump

Reduced oscillation, compared to multiplicative controller

PCAP multiplicative controller

Related Work

• Pileus/Tuba [Doug Terry et al]• Utility-based SLAs • Focus on wide-area• Can be used underneath our PCAP system (instead of our SLAs)

• Consistency Metrics: PBS [Peter Bailis et al] • Considers write end time (we consider write start time)• May not be able to define consistency for some read-write pairs (PCAP

accommodates all combinations)• Can use it in PCAP system

• Approximate answers: Hadoop [ApproxHadoop], Querying [BlinkDB], Bimodal multicast

16

PCAP Summary

• CAP Theorem motivated NoSQL Revolution• But apps need freshness + fast responses

• Under soft partition• We proposed

• Probabilistic models for C, A, P• Probabilistic CAP theorem – generalizes classical CAP• PCAP system satisfies Latency/Consistency SLAs• Integrated into Apache Cassandra and Riak KV stores

• Riak has expressed interest in incorporating these into their mainline code

17

Distributed Graph Processing and Checkpointing

• Checkpointing: Proactively save state to persistent storage• If there’s a failure, recover 100% cost• Used by:

•PowerGraph [Gonzalez et al. OSDI 2012]•Giraph [Apache Giraph]•Distributed GraphLab [Low et al. VLDB 2012]•Hama [Seo et al. CloudCom 2010]

18

Checkpointing Bad

19

Graph Dataset

Vertex Count

Edge Count

CA-Road 1.96 M 2.77 M

Twitter 41.65 M 1.47 B

UK Web 105.9 M 3.74 B

8x

31x

19

8 – 31x Increased Per-Iteration Execution Time

Users Already Don’t (Use or Like) Checkpointing

• “While we could turn on checkpointing to handle some of these failures, in practice we choose to disable checkpointing.” [Ching et. al. (Giraph @ Facebook) VLDB 2015]

• “Existing graph systems only support checkpoint-based fault tolerance, which most users leave disabled due to performance overhead.” [Gonzalez et. al. (GraphX) OSDI 2014]

• “The choice of interval must balance the cost of constructing the checkpoint with the computation lost since the last checkpoint in the event of a failure.” [Low et. al. (GraphLab) VLDB 2012]

• “Better performance can be obtained by balancing fault tolerance costs against that of a job restart.” [Low et al. (GraphLab) VLDB 2012]

20

Our Approach: Zorro

• No checkpointing. Common case is fast.• When failure occurs, opportunistically scrounge state (from surviving

servers) and continue computation• Natural replication in distributed processing systems

• A vertex data is present at its neighbor vertices• Each vertex assigned to one server, and its neighbors likely on

other servers• We get very high accuracy (95%+)

21

Natural Replication => Can Retrieve a Lot of State

22

PowerGraph LFGraph87 – 95% Graph State is Recoverable

Even After Half the Servers Fail

92 – 95%

87 – 91%

22

Natural Replication => Low InAccuracy

23

PowerGraph LFGraph

2%

3%

Natural Replication => Low InAccuracy

24

Algorithm PowerGraph LFGraphPageRank 2 % 3 %

Single-Source Shortest Paths

0.0025 % 0.06 %

Connected Components 1.6 % 2.15 %K-Core 0.0054% 1.4 %

Graph Coloring* 5.02 % NAGroup-Source Shortest

Paths*0.84 % NA

Triangle Count* 0 % NAApproximate Diameter* 0 % NA

Takeaways

• Impossibility theorems and 100% correct answers are great• But they entail

• Inflexibility in design (NoSQL or SQL)• High overhead (Checkpointing)

• Important to explore • Probabilistic tradeoffs and Achievable envelopes • Leads to more flexibility in design

• Other applicable areas: stream processing, machine learning

DPRG: http://dprg.cs.uiuc.edu

http://dprg.cs.uiuc.edu/

Plug: MOOC on “Cloud Computing Concepts”

• Free course, On Coursera• Ran Feb-Apr 2015• 120K+ students

Next run: Spring 2016• Covered distributed systems and algorithms used in cloud computing• Free and Open to everyone

• https://www.coursera.org/course/cloudcomputing• Or do a search on Google for “Cloud Computing Course” (click on first

link)

https://www.coursera.org/course/cloudcomputing



Backup Slides

28

PCAP Consistency Metric Is more Generic Than PBS

time

W(1)W(2)

R(1)

tc

A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read starts

W(1) and R(1) can overlap

time

W(1)W(2)

R(1)

tc

A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read ends

W(1) and R(1) cannot overlap

PCAP

PBS

GeoPCAP: 2 Key Techniques

Client Read, SLA

Prob C1, L1

Local DC

Composed modelProb CC, LC

Compare

SLA

Given client C or L SLA:• QUICKEST: at-least one DC satisfies SLA• ALL: each DC satisfies SLA

Prob C2, L2 Prob C3,L3

(1) Prob Composition Rules

Prob WAN Model

Δ Δ Δ(2) Tune Geo-delay using PID Control

CAP Theorem NoSQL Revolution

• Conjectured: [Brewer 00] • Proved: [Gilbert Lynch 02]• Kicked off NoSQL

revolution• Abadi’s PACELC

• If P, choose A or C• Else, choose L

(latency) or C

Consistency

Partition-tolerance Availability (Latency)

RDBMSs (non-replicated)

Cassandra, RIAK, Dynamo, Voldemort

HBase, HyperTable,BigTable, Spanner

Geo-Distributed PCAP

31

N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump

Consistency degrades after delay jump

Fast convergence initially, and after delay jump

Reduced oscillation, compared to multiplicative controller

PCAP multiplicative controller

probabilistically consistent indranil gupta (indy) department of computer science, uiuc...

Documents

probabilistic consistency

ms latency

consistency model

hard choices

probabilistic cap theorem

hard partitions

soft partition slide

fact probabilistic choices