probabilistically consistent indranil gupta (indy) department of computer science, uiuc...
TRANSCRIPT
Probabilistically Consistent
Indranil Gupta (Indy)Department of Computer Science,
FuDiCo 2015DPRG: http://dprg.cs.uiuc.edu 1
Joint Work With
• Muntasir Rahman (Graduating PhD Student)• Luke Leslie, Lewis Tseng• Mayank Pundir (MS, now at Facebook)
• Work funded by Air Force Research Labs/AFOSR, National Science Foundation, Google, Yahoo!, and Microsoft
Hard Choices in Extensible Distributed Systems• Users in extensible distributed systems desire
• Timeliness and Correctness Guarantees
• But these are at odds with…• Unpredictability
• Network Delays and Failures
• Research community and industry often tends to translate this into hard choices in systems design
• Examples1. CAP Theorem: choice between consistency and availability (or latency)
• Either relational databases or eventually consistent NoSQL stores• (Maybe a convergence now?)
2. Always get 100% answers in computation engines (batch or stream)• Use checkpointing
Hard Choices… Can in fact be Probabilistic Choices!
• Many of these are in fact probabilistic choices• One of the earliest examples: pbcast/Bimodal Multicast
• Examples1. CAP Theorem:
• We derive a probabilistic CAP theorem that defines an achievable boundary between consistency and latency in any database system
• We use this to incorporate probabilistic consistency and latency SLAs into Cassandra and Riak
2. Always get 100% answers in computation engines (batch or stream)• In many systems, checkpointing results in 8-31x higher execution time!• We show that in systems like distributed graph processing systems
• We can avoid checkpointing altogether• Instead, have a reactive approach: upon failure, reactively scrounge state (naturally replicated)• And achieve very high accuracy (95-99%)
Key-value/NoSQL Storage Systems
• Key-value/NoSQL stores: $3.4B sector by 2018• Distributed storage in the cloud• Netflix: video position (Cassandra) • Amazon: shopping cart (DynamoDB)• And many others
• Necessary API operations: get(key) and put(key, value)• And some extended operations, e.g., “CQL”
in Cassandra key-value store
Key-value/NoSQL Storage: Fast and Fresh
• Cloud clients expect both • Latency: Low latency for all operations (reads/writes)
• 500ms latency increase at Google.com costs 20% drop in revenue • each extra ms $4 M revenue loss• Long latency User Cognitive Drift
• Consistency: read returns value of one of latest writes• Freshness of data means accurate tracking and higher user satisfaction• Most KV stores only offer weak consistency (Eventual consistency)• Eventual consistency = if writes stop, all replicas converge, eventually
Hard vs. Soft Partitions
• CAP Theorem looks at hard partitions• However, soft partitions may happen inside a
data-center• Periods of elevated message delays • Periods of elevated loss rates
• Soft partitions are more frequent
Data-center 1(America)
Data-center 2(Europe)
Hard partition
ToR ToR
CoreSw
Congestion at switches=> Soft partition
Our work: From Impossibility to Possibility
• C Probabilistic C (Consistency)• A Probabilistic A (Latency)• P Probabilistic P (Partition Model)
• A probabilistic CAP theorem• A system that validates how close we are to the
achievable envelope• (Goal is not: another consistency model, or
NoSQL vs New/Yes SQL)
8
time
W(1) W(2) R(1)
tc
A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read
pic is likelihood a read is NOT tc-fresh
Probabilistic Consistency (pic ,tc)
pua is likelihood a read DOES NOT return an answer within ta time units
Probabilistic Latency (pua ,ta)
α is likelihood that a random path ( client server client) has message delay exceeding tp
time units
Probabilistic Partition (α, tp )
PCAP Theorem: Impossible to achieve both Probabilistic Consistency and Latency
under Probabilistic Partitions if:
tc + ta < tp and pua + pic < α
Bad network -> High (α, tp )
To get better consistency -> lower (pic ,tc)
To get better latency -> lower (pua ,ta)
Probabilistic CAP
9Full proof in our arXiv paper: http://arxiv.org/abs/1509.02464
Special case: Original CAP has α=1 and tp = ∞
10
Towards Probabilistic SLAs
• Latency SLA: Similar to latency SLAs already existing in industry.• Meet a desired probability that client receives operation’s result
within the timeout• Maximize freshness probability within given freshness interval• Example: Amazon shopping cart
• Doesn’t want to lose customers due to high latency• Only 10% operations can take longer than 300ms
• SLA: (pua, ta) = (0.1, 300ms)
• Minimize staleness (don’t want customers to lose items)
• Minimize: pic (Given: tc)
11
Towards Probabilistic SLAs (2)
• Consistency SLA: Goal is to • Meet a desired freshness probability (given freshness interval) • Maximize probability that client receives operation’s result
within the timeout• Example: Google search application/Twitter search
• Wants users to receive “recent” data as search• Only 10% results can be more than 5 min stale
• SLA: (pic , tc)=(0.1, 5 min)
• Minimize response time (fast response to query)
• Minimize: pua (Given: ta)
Meeting these SLAs: PCAP Systems
Increased Knob Latency Consistency
Read Delay Degrades Improves
Read Repair Rate Unaffected Improves
Consistency Level
Degrades Improves
Continuously adapt control knobs to always satisfy PCAP SLA
KV-store (Cassandra,
Riak)
CONTROL KNOBS
PCAPSystem
Satisfies PCAP SLAADAPTIVE CONTROL
System assumptions:• Client sends query to coordinator server which then forwards to replicas (answers reverse path)• There exist background mechanisms to bring stale replicas up to date
Meeting Consistency SLA for PCAP Cassandra (pic=0.135)
Consistency always below target SLA
Setup • 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM• 100 Mbps Ethernet• YCSB workload (144 client threads)• Network delay: Log-normal distribution [Benson 2010]
Mean latency = 3 ms | 4 ms | 5 ms
Meeting Consistency SLA for PCAP Cassandra (pic=0.135)
Optimal envelopes under different Network conditions (based on PCAP theorems)
PCAP system SatisfiesSLA and close to Optimal envelope
Geo-Distributed PCAP
15
N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump
Consistency degrades after delay jump
Fast convergence initially, and after delay jump
Reduced oscillation, compared to multiplicative controller
PCAP multiplicative controller
Related Work
• Pileus/Tuba [Doug Terry et al]• Utility-based SLAs • Focus on wide-area• Can be used underneath our PCAP system (instead of our SLAs)
• Consistency Metrics: PBS [Peter Bailis et al] • Considers write end time (we consider write start time)• May not be able to define consistency for some read-write pairs (PCAP
accommodates all combinations)• Can use it in PCAP system
• Approximate answers: Hadoop [ApproxHadoop], Querying [BlinkDB], Bimodal multicast
16
PCAP Summary
• CAP Theorem motivated NoSQL Revolution• But apps need freshness + fast responses
• Under soft partition• We proposed
• Probabilistic models for C, A, P• Probabilistic CAP theorem – generalizes classical CAP• PCAP system satisfies Latency/Consistency SLAs• Integrated into Apache Cassandra and Riak KV stores
• Riak has expressed interest in incorporating these into their mainline code
17
Distributed Graph Processing and Checkpointing
• Checkpointing: Proactively save state to persistent storage• If there’s a failure, recover 100% cost• Used by:
•PowerGraph [Gonzalez et al. OSDI 2012]•Giraph [Apache Giraph]•Distributed GraphLab [Low et al. VLDB 2012]•Hama [Seo et al. CloudCom 2010]
18
Checkpointing Bad
19
Graph Dataset
Vertex Count
Edge Count
CA-Road 1.96 M 2.77 M
Twitter 41.65 M 1.47 B
UK Web 105.9 M 3.74 B
8x
31x
19
8 – 31x Increased Per-Iteration Execution Time
Users Already Don’t (Use or Like) Checkpointing
• “While we could turn on checkpointing to handle some of these failures, in practice we choose to disable checkpointing.” [Ching et. al. (Giraph @ Facebook) VLDB 2015]
• “Existing graph systems only support checkpoint-based fault tolerance, which most users leave disabled due to performance overhead.” [Gonzalez et. al. (GraphX) OSDI 2014]
• “The choice of interval must balance the cost of constructing the checkpoint with the computation lost since the last checkpoint in the event of a failure.” [Low et. al. (GraphLab) VLDB 2012]
• “Better performance can be obtained by balancing fault tolerance costs against that of a job restart.” [Low et al. (GraphLab) VLDB 2012]
20
Our Approach: Zorro
• No checkpointing. Common case is fast.• When failure occurs, opportunistically scrounge state (from surviving
servers) and continue computation• Natural replication in distributed processing systems
• A vertex data is present at its neighbor vertices• Each vertex assigned to one server, and its neighbors likely on
other servers• We get very high accuracy (95%+)
21
Natural Replication => Can Retrieve a Lot of State
22
PowerGraph LFGraph87 – 95% Graph State is Recoverable
Even After Half the Servers Fail
92 – 95%
87 – 91%
22
Natural Replication => Low InAccuracy
23
PowerGraph LFGraph
2%
3%
Natural Replication => Low InAccuracy
24
Algorithm PowerGraph LFGraphPageRank 2 % 3 %
Single-Source Shortest Paths
0.0025 % 0.06 %
Connected Components 1.6 % 2.15 %K-Core 0.0054% 1.4 %
Graph Coloring* 5.02 % NAGroup-Source Shortest
Paths*0.84 % NA
Triangle Count* 0 % NAApproximate Diameter* 0 % NA
Takeaways
• Impossibility theorems and 100% correct answers are great• But they entail
• Inflexibility in design (NoSQL or SQL)• High overhead (Checkpointing)
• Important to explore • Probabilistic tradeoffs and Achievable envelopes • Leads to more flexibility in design
• Other applicable areas: stream processing, machine learning
DPRG: http://dprg.cs.uiuc.edu
Plug: MOOC on “Cloud Computing Concepts”
• Free course, On Coursera• Ran Feb-Apr 2015• 120K+ students
Next run: Spring 2016• Covered distributed systems and algorithms used in cloud computing• Free and Open to everyone
• https://www.coursera.org/course/cloudcomputing• Or do a search on Google for “Cloud Computing Course” (click on first
link)
Backup Slides
28
PCAP Consistency Metric Is more Generic Than PBS
time
W(1)W(2)
R(1)
tc
A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read starts
W(1) and R(1) can overlap
time
W(1)W(2)
R(1)
tc
A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read ends
W(1) and R(1) cannot overlap
PCAP
PBS
GeoPCAP: 2 Key Techniques
Client Read, SLA
Prob C1, L1
Local DC
Composed modelProb CC, LC
Compare
SLA
Given client C or L SLA:• QUICKEST: at-least one DC satisfies SLA• ALL: each DC satisfies SLA
Prob C2, L2 Prob C3,L3
(1) Prob Composition Rules
Prob WAN Model
Δ Δ Δ(2) Tune Geo-delay using PID Control
CAP Theorem NoSQL Revolution
• Conjectured: [Brewer 00] • Proved: [Gilbert Lynch 02]• Kicked off NoSQL
revolution• Abadi’s PACELC
• If P, choose A or C• Else, choose L
(latency) or C
Consistency
Partition-tolerance Availability (Latency)
RDBMSs (non-replicated)
Cassandra, RIAK, Dynamo, Voldemort
HBase, HyperTable,BigTable, Spanner
Geo-Distributed PCAP
31
N(20,sqrt(2)) | N(22,sqrt(2.2)Latency SLA met before and after jump
Consistency degrades after delay jump
Fast convergence initially, and after delay jump
Reduced oscillation, compared to multiplicative controller
PCAP multiplicative controller