seer: leveraging big data to navigate the increasing complexity … · 2019-12-18 · yu gan,...
TRANSCRIPT
![Page 1: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/1.jpg)
Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou
Cornell University
HotCloud– July9th 2018
Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging
![Page 2: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/2.jpg)
2
¨ Microservices puts more pressure on performance predictability ¤ Microservices dependencies à propagate & amplify QoS violations¤ Finding the culprit of a QoS violation is difficult¤ Post-QoS violation, returning to nominal operation is hard
¨ Anticipating QoS violations & identifying culprits
¨ Seer: Data-driven Performance Debugging for Microservices¤ Combines lightweight RPC-level distributed tracing with hardware
monitoring¤ Leverages scalable deep learning to signal QoS violations with
enough slack to apply corrective action
Executive Summary
![Page 3: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/3.jpg)
3
From Monoliths to Microservices
![Page 4: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/4.jpg)
4
¨ Advantages of microservices: ¤ Ease & speed of code development & deployment¤ Security, error isolation¤ PL/framework heterogeneity
¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies¤ Amplify tail-at-scale effects¤ More sensitive to performance unpredictability¤ No representative end-to-end apps with microservices
Motivation
![Page 5: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/5.jpg)
5
¨ 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app¤ Social Network¤ Movie Reviewing/Renting/Streaming¤ E-commerce¤ Drone control service
¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP, and Go¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian¤ Apache Thrift RPC, RESTful APIs¤ Docker containers¤ Lightweight RPC-level distributed tracing
An End-to-End Suite for Cloud & IoT Microservices
![Page 6: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/6.jpg)
6
Resource Management Implications
¨ Challenges of microservices: ¤ Dependencies complicate resource management¤ Dependencies change over time à difficult for users to express¤ Amplify tail@scale effects
Netflix Twitter Amazon Movie Streaming
![Page 7: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/7.jpg)
7
¨ Detecting QoS violations after they occur: ¤ Unpredictable performance propagates through system¤ Long time until return to nominal operation¤ Does not scale
The Need for Proactive Performance Debugging
![Page 8: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/8.jpg)
8
Performance ImplicationsCPU Mem Net DiskQueue
![Page 9: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/9.jpg)
9
Performance ImplicationsCPU Mem Net DiskQueue
![Page 10: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/10.jpg)
10
¨ Leverage the massive amount of traces collected over time
1. Apply online, practical data mining techniques that identify the culprit of an upcoming QoS violation
2. Use per-server hardware monitoring to determine the cause of the QoS violation
3. Take corrective action to prevent the QoS violation from occurring
¨ Need to predict 100s of msec – a few sec in the future
Seer: Data-Driven Performance Debugging
![Page 11: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/11.jpg)
11
¨ RPC level tracing¨ Based on Apache Thrift
¨ Timestamp start-end for each microservice
¨ Store in centralized DB (Cassandra)
¨ Record all requests àNo sampling
¨ Overhead: <0.1% in throughput and <0.2% in tail latency
TracingCollector
WebUI
Client
http
Cassandra
QueryEngine
[…]
mic
rose
rvic
es
latency
Gantt charts
zTracer
TCP
TCP
Proc
uService KRPC timeTX
zTracer
TCP
TCP
Proc
uService K+1
RPC timeRX
TCP procTX
TCP procRX
App proc
[…]
Tracing Framework
![Page 12: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/12.jpg)
12
¨ Why? ¤ Architecture-agnostic¤ Adjusts to changes in
dependencies over time
¤ High accuracy, good scalability
¤ Inference within the required window
Deep Learning to the Rescue
![Page 13: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/13.jpg)
13
¨ Container utilization
¨ Latency
¨ Queue depth
DNN Configuration
Output signal
Which microservicewill cause a
QoS violation in the near
future?
Input signal
![Page 14: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/14.jpg)
14
¨ Container utilization
¨ Latency
¨ Queue depth
DNN Configuration
Output signal
Which microservicewill cause a
QoS violation in the near
future?
Input signal
![Page 15: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/15.jpg)
15
¨ Training once: slow (hours - days)¤ Across load levels, load distributions, request types¤ Distributed queue traces, annotated with QoS violations¤ Weight/bias inference with SGD¤ Retraining in the background
¨ Inference continuously: streaming trace data
DNN Configuration
93% accuracy in signaling upcoming QoS violations
91% accuracy in attributing QoSviolation to correct microservice
![Page 16: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/16.jpg)
16
¨ Challenges: ¤ In large clusters inference too slow to prevent QoS violations¤ Offload on TPUs, 10-100x improvement; 10ms for 90th %ile
inference¤ Fast enough for most corrective actions to take effect (net bw
partitioning, RAPL, cache partitioning, scale-up/out, etc.)
DNN Configuration
Accuracy stable or increasing with cluster size
![Page 17: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/17.jpg)
17
¨ 40 dedicated servers¨ ~1000 single-concerned
containers¨ Machine utilization 80-85%
¨ Inject interference to cause QoS violation¤ Using microbenchmarks
(CPU, cache, memory, network, disk I/O)
Experimental Setup
![Page 18: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/18.jpg)
18
¨ Identify cause of QoS violation¤ Private cluster: performance counters & utilization monitors¤ Public cluster: contentious microbenchmarks
¨ Adjust resource allocation¤ RAPL (fine-grain DVFS) & scale-up for CPU contention¤ Cache partitioning (CAT) for cache contention¤ Memory capacity partitioning for memory contention¤ Network bandwidth partitioning (HTB) for net contention ¤ Storage bandwidth partitioning for I/O contention
Restoring QoS
![Page 19: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/19.jpg)
19
¨ Post-detection, baseline system à dropped requests
¨ Post-detection, Seer à maintain nominal performance
Restoring QoS
![Page 20: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/20.jpg)
20
Demo CPU Mem Net DiskQueue
![Page 21: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/21.jpg)
21
![Page 22: Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University](https://reader034.vdocuments.net/reader034/viewer/2022042218/5ec469dee70ddc2d884049d4/html5/thumbnails/22.jpg)
22
¨ Security implications of data-driven approaches
¨ Fall-back mechanisms when ML goes wrong
¨ Not a single-layer solution à Predictability needs vertical approaches
Challenges Ahead
Thank you!
Serverless microservices IoT swarms