spanstore: cost-effective geo-replicated storage spanning multiple cloud services zhe wu, michael...
TRANSCRIPT
SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services
Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, Harsha V. Madhyastha
UC Riverside and USC
• 2
Geo-distributed Services for Low Latency
• 3
Cloud Services Simplify Geo-distribution
• 4
Need for Geo-Replication
Data uploaded by a user may be viewed/edited by users in other locations• Social networking (Facebook, Twitter)• File sharing (Dropbox, Google Docs) Geo-replication of data is necessary
Isolated storage service in each cloud data centerApplication needs to handle replication itself
• 5
Geo-replication on Cloud Services
Lots of recent work on enabling geo-replication• Walter(SOSP’11), COPS(SOSP’11),
Spanner(OSDI’12), Gemini(OSDI’12), Eiger(NSDI’13)…
• Faster performance or stronger consistency
Added consideration on cloud servicesMinimizing cost
• 6
Outline
Problem and motivation
SPANStore overview
Techniques for reducing cost
Evaluation
• 7
SPANStore
Key value store (GET/PUT interface) spanning cloud storage servicesMain objective: minimize costSatisfy application requirements• Latency SLOs• Consistency (Eventual vs. sequential
consistency)• Fault-tolerance
• 8
SPANStore Overview
SPANStore
App
Metadata lookups
Return data/ACK
Library
request
Read/write data based on optimal replication policy
Data center A
Data center B
Data center C
Data center D
• 9
SPANStore Overview
SPANStoreApp
Data center B
SPANStoreApp
Data center C
SPANStore
Data center A
SPANStoreApp
Data center D
Placement Manager
workloadReplication policy
Inter-DC latenciesPricing policies
Latency, consistency and fault tolerance requirements
SPANStore Characterization
Application Input
• 10
Outline
Problem and motivation
SPANStore overview
Techniques for reducing cost
Evaluation
Questions to be addressed for every object:• Where to store replicas• How to execute PUTs and GETs
• 12
Cloud Storage Service Cost
Storage cost
Request cost
Data transfer cost
+
+
= Storage service cost
(the amount of data stored)
(the number of PUT and GET requests issued)
(the amount of data transferred out of data center)
• 13
Low Latency SLO Requires High Replication in Single Cloud Deployment
R R R
R
Latency bound = 100ms
AWS regions
• 14
Technique 1: Harness Multiple Clouds
R R R
RR
R
Latency bound = 100ms
AWS regions
• 15
Price Discrepancies across Clouds
Cloud region
Storage price (GB)
Data transfer price (GB)
GET request price (10000 requests)
PUT request price (1000 requests)
S3 US West 0.095$ 0.12$ 0.004$ 0.005$
Azure Zone2 0.095$ 0.19$ 0.001$ 0.0001$
GCS 0.085$ 0.12$ 0.01$ 0.01$
… … … … …
Leveraging discrepancies judiciously can reduce cost
• 16
Range of Candidate Replication Policies
Strategy 1: single replica in cheapest storage cloud
R High latencies
• 17
Range of Candidate Replication Policies
Strategy 2: few replicas to reduce latencies
RR
High data transfer costHigh data transfer costHigh data transfer cost
• 18
Range of Candidate Replication Policies
Strategy 3: replicated everywhere
PUT
R R
R
R
High latencies& cost of PUTs
High storage cost
Optimal replication policy depends on:
1. application requirements2. workload properties
• 19
High Variability of Individual Objects
Estimate workload based on same hour in previous week
60% of hours have error higher than 50%
20% of hours have error higher than 100%
Error can be as high as 1000%
Analyze predictability of Twitter workload
• 20
Technique 2: Aggregate Workload Prediction per Access Set
Observation: stability in aggregate workload• Diurnal and weekly patterns
Classify objects by access set:• Set of data centers from which object is accessed
Leverage application knowledge of sharing pattern• Dropbox/Google Docs know users that share a file• Facebook controls every user’s news feed
• 21
Technique 2: Aggregate Workload Prediction per Access Set
Aggregate workload is more stable and predictable
Estimate workload based on same hour in previous week
• 22
Optimizing Cost for GETs and PUTs
R
R
GET R
R
Use cheap (request + data transfer) data centers
• 23
Technique 3: Relay Propagation
PUT
Asynchronous propagation (no latency constraint)
R
0.25$/GB
0.19$/GB
0.2$/GB
0.19$/GB
0.12$/GB
R
R
R
R
• 24
Technique 3: Relay Propagation
PUT
0.25$/GB
0.19$/GB
0.2$/GB
0.19$/GB
0.12$/GB
Violate SLO
Asynchronous propagation (no latency constraint)Synchronous propagation (bounded by latency SLO)
R
R
R
R
R
• 25
Summary
Insights to reduce cost• Multi-cloud deployment• Use aggregate workload per access set• Relay propagation
Placement manager uses ILP to combine insightsOther techniques• Metadata management• Two phase-locking protocol• Asymmetric quorum set
• 26
Outline
Problem and motivation
SPANStore overview
Techniques for reducing cost
Evaluation
• 27
Evaluation
Scenario• Application is deployed on EC2• SPANStore is deployed across S3, Azure and
GCS
Simulations to evaluate cost savingsDeployment to verify application requirements• Retwis • ShareJS
• 28
Simulation Settings
Compare SPANStore against• Replicate everywhere• Single replica• Single cloud deployment
Application requirements• Sequential consistency• PUT SLO: min SLO satisfies replicate everywhere• GET SLO: min SLO satisfies single replica
• 29
SPANStore Enables Cost Savings across Disparate Workloads
Savings by relay propagation
#1: big objects, more GETs(Lots of data transfers from
replicas)#2: big objects, more PUTs(Lots of data transfers to
replicas)
Savings by reducing data transfer
#3: small objects, more GETs(Lots of GET requests)
Savings by price discrepancy of GET request
#4: small objects, more PUTs(Lots of PUT requests)
Savings by price discrepancy of PUT request
• 30
Deployment Settings
Retwis• Scale down Twitter workload• GET: read timeline• PUT: make post• Insert: read follower’s timeline and append
post to it
Requirements:• Eventual consistency• 90%ile PUT/GET SLO = 100ms
• 31
SPANStore Meets SLOs
SLO
90%ile
Insert SLO
• 32
Conclusions
SPANStore• Minimize cost while satisfying latency,
consistency and fault-tolerance requirements
Use multiple cloud providers for greater data center density and pricing discrepancies
Judiciously determine replication policy based on workload properties and application needs