![Page 1: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/1.jpg)
Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud
Coburn WatsonManager, Cloud Performance, NetflixSurge ‘13
![Page 2: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/2.jpg)
2
Netflix, Inc.
• World's leading internet television network• ~ 38 Million subscribers in 40+ countries• Over a billion hours streamed per month• Approximately 33% of all US Internet traffic
at night• Recent Notables• Increased Originals catalog• Large open source contribution• OpenConnect (homegrown CDN)
![Page 3: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/3.jpg)
3
About Me
• Manage Cloud Performance Engineering Team• Sub-team of Cloud Solutions Organization
• Focus on performance since 2000• Large-scale billing applications, eCommerce,
datacenter mgmt., etc.• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.
• Passion for tackling performance at cloud-scale• Looking for great performance engineers• [email protected]
![Page 4: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/4.jpg)
4
Freedom and Responsibility
• Culture deck..a great read• Good performers: 2x, Top performers: 10x• What engineers dislike• cumbersome processes• deployment inefficiency• restricted access• restricted technical freedom• lack of trust
• If removed…maximize:• Engineering velocity• Engineer satisfaction
![Page 5: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/5.jpg)
5
Maximizing: Engineering Velocity
![Page 6: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/6.jpg)
6
How
• Implementation freedom• SCM, libraries, language
• that said..platform benefits exist
• Deployment freedom• Service team owns• push schedule, functionality, performance
• operational activities (being paged)• On-demand cloud capacity
• Thousands of instances at the push of a button
![Page 7: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/7.jpg)
7
Rapid Deployment?
Impossible..
3-6 Months?
![Page 8: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/8.jpg)
8
Rapid (Cloud) Deployment
3-5 Minutes
![Page 9: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/9.jpg)
9
BaseAMI• Supply the foundation• Monitoring, java, apache, tomcat, etc.
• Open source project: Aminator
![Page 10: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/10.jpg)
10
Pushing Code: Red-Black
• Gracefully roll code in, or out, of production• Asgard is our AWS configuration mgmt.
tool
![Page 11: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/11.jpg)
11
Compounded risks with increased velocity
Risks: Decreased Reliability, Performance, and Scalability
Not all Roses
![Page 12: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/12.jpg)
12
Goal: CI (Continuous Improvement)
![Page 13: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/13.jpg)
13
Maximizing: Reliability
![Page 14: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/14.jpg)
14
Fear (Revere) the Monkeys
• Simulate• Latency• Errors
• Initiate• Instance Termination• Availability Zone Failure
• Identify• Configuration Drift
… in Test and Production
![Page 15: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/15.jpg)
15
Tracking Change: Chronos
• Aggregate Significant Events *• Current Sources:• Pushes (Asgard)• Production Change Requests (JIRA)• AWS Notifications• Dynamic Property Changes• ASG Scaling Events
• Implementation• Simple REST-service; customized adapters
* - “can disrupt production service”
![Page 16: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/16.jpg)
16
Chronos, cont.
![Page 17: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/17.jpg)
17
Automated Canary Analysis•Identify regression between new and existing code•Point ACA to baseline (prod) and canary ASG
• Typically analyze an hours worth of time series data• Compare ratio of averages between canary and baseline• Evaluate range and noise; determine quality of signal
• Bucket: Hot, Cold, Noisy, or OK• Multiple classifiers available• Multiple metric collections (e.g. hand-picked by service, general)
• Rollup• Constrained: along metric dimensions• Final: Score the canary
•Implementation: R-based analysis
![Page 18: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/18.jpg)
18
HOT OK NOISYCOLDOK
NOISY
constrained rollup (dashed)final rollup
ACA: in Action
![Page 19: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/19.jpg)
19
Hystrix: Defend Your App
● Protection from downstream service failures● Functional (unavailable) or performance in nature
![Page 20: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/20.jpg)
20
Maximizing: Scalability and Performance
![Page 21: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/21.jpg)
21
Dynamic Scaling
EC2 footprint autoscales 2500-3500 instances per day• order of tens of thousands of EC2 instances• Larger ASG spans 200-900 m2.4xlarge daily
Why:• Improved scalability during unexpected workloads• Absorb variance in service performance profile• Reactive chain of dependencies• Creates "reserved instance troughs" for batch
activity
![Page 22: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/22.jpg)
22
Dynamic Scaling, cont.
Example covers 3 services• 2 edge (A,B), 1 mid-tier (C)• C has more upstream services
than simply A and B
Multiple Autoscaling Policies• (A) System Load Average• (B,C) Request-Rate based
![Page 23: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/23.jpg)
23
Dynamic Scaling, cont.
![Page 24: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/24.jpg)
24
Dynamic Scaling, cont.
• Response time variability greatest during scaling events• Average response time primary between 75-150 msec
![Page 25: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/25.jpg)
25
Dynamic Scaling, cont.
• Instance counts 3x, Aggregate requests 4.5x (not shown)• Average CPU utilization per instance: ~25-55%
![Page 26: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/26.jpg)
26
Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge)• mid-tier service load application• Targeting 2x production rates
• Increase read ops from 30k to to 70k in ~ 3 minutes
• Increase write ops 750 to 1500 in ~ 3 minutes
Results: • 95th pctl response time increase: ~ 17 msec to 45
msec• 99th pctl response time increase: ~ 35 msec to 80
msec
Cassandra Performance
![Page 27: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/27.jpg)
27
Response times consistent during 4x increase in load *
* Due to upstream code change
EVcache (memcached) Scalability
![Page 28: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/28.jpg)
28
Cloud-scale Load Testing
• Ad-Hoc or CI-based load test model• (CI) Run-over-run comparison; email on rule
violation
1. Jenkins initiates job2. JMeter instances apply load3. Results written to s3 4. Instance metrics published to Atlas5. Raw data fetched and processed
![Page 29: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/29.jpg)
29
Conclusions
• Continually accelerate engineering velocity• Evolve architecture and processes to mitigate
risks
• Stateless micro-service architectures win!
• Remove barriers for engineers• Last option should be to reduce rate of change
• Exercise failure and “thundering herd” scenarios
• Cloud native scaling and resiliency are key factors• Leverage pre-existing OSS PaaS when
possible
![Page 30: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/30.jpg)
30
Netflix Open Source
Our Open Source Software simplifies mgmt at scale
Great projects, stunning colleagues: jobs.netflix.com
![Page 31: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud](https://reader034.vdocuments.net/reader034/viewer/2022052618/554f4401b4c905423f8b4767/html5/thumbnails/31.jpg)
31
Q&A
• Netflix Tech Blog: http://techblog.netflix.com