c3: cutting tail latency in cloud data stores via adaptive ......uses history of read latencies and...
TRANSCRIPT
![Page 1: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/1.jpg)
C3: Cutting Tail Latency inCloud Data Stores via
Adaptive Replica Selection
Lalith Suresh(TU Berlin)
with Marco Canini (UCL), Stefan Schmid, Anja Feldmann (TU Berlin)
![Page 2: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/2.jpg)
2
OneUser Request
Tens to Thousandsof data accesses
Tail-latency matters
![Page 3: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/3.jpg)
3
For 100 leaf servers, 99th percentile latency will reflect in 63% of user requests!
OneUser Request
Tail-latency matters
Tens to Thousandsof data accesses
![Page 4: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/4.jpg)
4
Server performance fluctuations are the norm
Queueingdelays
Skewed access patterns
CDF
Resource contention
Background activities
![Page 5: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/5.jpg)
Effectiveness of replica selection in reducing tail latency?
5
?Client
Server
Server
Server
Request
![Page 6: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/6.jpg)
Replica Selection Challenges
6
![Page 7: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/7.jpg)
Replica Selection Challenges
• Service-time variations
7
RequestClient
Server
Server
Server
4 ms
5 ms
30 ms
![Page 8: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/8.jpg)
Replica Selection Challenges
• Herd behavior and load oscillations
8
Request
Request
RequestClient
Client
Client
Server
Server
Server
![Page 9: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/9.jpg)
9
Impact of Replica Selection in Practice?
Dynamic Snitching
Uses history of read latencies and I/O load for replica selection
![Page 10: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/10.jpg)
10
Experimental Setup
• Cassandra cluster on Amazon EC2• 15 nodes, m1.xlarge instances• Read-heavy workload with YCSB (120 threads)• 500M 1KB records (larger than memory)• Zipfian key access pattern
![Page 11: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/11.jpg)
11
Cassandra Load Profile
![Page 12: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/12.jpg)
12
Also observed that 99.9th percentile latency ~ 10x median latency
Cassandra Load Profile
![Page 13: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/13.jpg)
13
Load Conditioning in our Approach
![Page 14: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/14.jpg)
C3 Adaptive replica selection mechanism that is robust to service time heterogeinity
14
![Page 15: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/15.jpg)
C3 • Replica Ranking• Distributed Rate Control
15
![Page 16: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/16.jpg)
C3 • Replica Ranking• Distributed Rate Control
16
![Page 17: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/17.jpg)
17
ClientServer
Client
Client
Server
µ-1 = 2 ms
µ-1 = 6 ms
![Page 18: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/18.jpg)
18
ClientServer
Client
Client
Server
Balance product of queue-size and service time{ q · µ-1 }
µ-1 = 2 ms
µ-1 = 6 ms
![Page 19: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/19.jpg)
19
Server-side Feedback
Servers piggyback {qs } and {µμ𝒔#𝟏} in every response
Client Server
{ qs , µμ𝒔#𝟏 }
![Page 20: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/20.jpg)
20
Server-side Feedback
• Concurrency compensation
Servers piggyback {qs } and {µμ𝒔#𝟏} in every response
![Page 21: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/21.jpg)
21
Server-side Feedback
• Concurrency compensation
𝑞&' = 1 + 𝑜𝑠'. 𝑤 + 𝑞'
Servers piggyback {qs } and {µμ𝒔#𝟏} in every response
Outstanding requests Feedback
![Page 22: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/22.jpg)
22
Select server with min 𝑞&' . µμ𝒔#𝟏 ?
![Page 23: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/23.jpg)
23
Select server with min 𝑞&' . µμ𝒔#𝟏 ?
Server
Server
µ-1 = 4 ms
µ-1 = 20 ms
20 requests
100 requests!
• Potentially long queue sizes• What if a GC pause happens?
![Page 24: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/24.jpg)
24
Penalizing Long Queues
Server
Server
µ-1 = 4 ms
µ-1 = 20 ms
20 requests
35 requests
Select server with min 𝑞&' . µμ𝒔#𝟏b
b = 3
![Page 25: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/25.jpg)
C3 • Replica Ranking• Distributed Rate Control
25
![Page 26: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/26.jpg)
26
Need for rate control
Replica ranking insufficient
• Avoid saturating individual servers?
• Non-internal sources of performance fluctuations?
![Page 27: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/27.jpg)
27
Cubic Rate Control
• Clients adjust sending rates according tocubic function
• If receive rate isn’t increasing further, multiplicatively decrease
![Page 28: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/28.jpg)
28
Putting everythingtogether
Server
Server
1000 req/s
2000 req/s
Rate Limiters
Replica group
scheduler
Sort replicasby score
C3 Client
{ Feedback }
![Page 29: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/29.jpg)
29
Implementation in Cassandra
Details in the paper!
![Page 30: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/30.jpg)
30
Evaluation
Amazon EC2
Controlled Testbed
Simulations
![Page 31: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/31.jpg)
31
Evaluation
Amazon EC2
• 15 node Cassandra cluster• M1.xlarge• Workloads generated using YCSB (120 threads)• Read-heavy, update-heavy, read-only• 500M 1KB records dataset (larger than memory)• Compare against Cassandra’s Dynamic Snitching (DS)
![Page 32: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/32.jpg)
32
Lower is better
![Page 33: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/33.jpg)
33
2x – 3x improved99.9 percentilelatencies
Also improves median and mean latencies
![Page 34: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/34.jpg)
34
2x – 3x improved99.9 percentilelatencies
26% - 43% improved throughput
![Page 35: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/35.jpg)
35
Takeaway:
C3 does not tradeoff throughput for latency
![Page 36: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/36.jpg)
36
How does C3 react to dynamic workload changes?
• Begin with 80 read-heavy workload generators
• 40 update-heavy generators join the system after 640s
• Observe latency profile with and without C3
![Page 37: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/37.jpg)
37
Latency profile degrades gracefully with C3
Takeaway: C3 reacts effectively to dynamic workloads
![Page 38: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/38.jpg)
38
Summary of other results
Higher system load
Skewed record sizes
SSDs instead of HDDs
> 3x better 99.9th
percentile latency
50% higher throughputthan with DS
![Page 39: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/39.jpg)
39
Ongoing work
• Tests at SoundCloud and Spotify
• Stability analysis of C3
• Alternative rate adaptation algorithms
• Token aware Cassandra clients
![Page 40: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon](https://reader031.vdocuments.net/reader031/viewer/2022011912/5f98e1a12f4ddf2549001ebf/html5/thumbnails/40.jpg)
40
?
Client
Server
Server
Server
C3Replica Ranking
+ Dist. Rate Control
Summary