(ism301) engineering netflix global operations in the cloud
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Josh Evans - Director of Operations Engineering
ISM301
Engineering
Netflix Global Operations
in the Cloud
Approaching Global Reach
October - Spain, Portugal, Italy
Early 2016 - Korea, Taiwan, Singapore, Hong Kong
65m members 100m
~60 counties 200
Netflix CDN(Open Connect)
Cloud
Control Plane
Internet
The Bigger Picture
Service
PartnersService
Partners
Availability vs. Rate of Change
Rate of Change
Availa
bili
ty (
nin
es)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
Quality vs. Velocity
Availability vs. Rate of Change
Rate of Change
Availa
bili
ty (
nin
es)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
Availability vs. Rate of Change
Rate of Change
Availa
bili
ty (
nin
es)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
31.5 seconds
5.26 minutes
52.56 minutes
8.76 hours
3.26 days
36.5 days
The Zero Sum Game
Availability vs. Rate of Change
Rate of Change
Availa
bili
ty (
nin
es)
6
5
4
3
2
1
0
1 10 100 1000
99.9999%
99.999%
99.99%
99.9%
99%
90%
Shifting the Curve
Operational Excellence is the continuous improvement
of the management, design, and function of operational
environments to achieve greater quality, velocity, and
competitive advantage.
Build It
design
code
build
bake
test
deploy
Run It
operate
configure
monitor
respond
You build it, you run it…
…globally
Operations Engineering is the application of software
engineering practices and principles to achieve and sustain
operational excellence.
• automation
• modular components
• tools & services
• best practices
Our Journey – Operations Engineering
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Data Center
● Delayed provisioning
● Hand-crafted servers
● Variations and complexity
Our Artisanal Past
Delivery
● Late night, manual deployments
● Repeated mistakes
● Painful delays to production fixes
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
• DES on time series
data
• Predict the future
based on history
• Favor recent history
• Threshold-based alerts
• 6-8 minute delay
Anomaly Detection
Alert!
• Unsupervised machine learning
• Density-based clustering
algorithm
• Actions
• Email, page
• OOS, detach, terminate
Kepler
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
100 Servers
5 Servers
95%
5%
Metrics
Canary Release Process
Old Version (v1.0)
New Version
(v1.1)
Load BalancerCustomers
0 Servers
100 Servers
100%
Metrics
Canary Release Process
Define• Metrics
• A threshold
Every n minutes● Classify metrics
● Compute score
● Make a decision
Automatic Canary Analysis
• Systematic observation of facets & permutations
• Unsupervised monitoring & decision- making
• Automated tuning & recovery
• Alerts with analysis
Thinking Globally
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Chaos Engineering is the discipline of experimenting on
a distributed system in order to build confidence in the
systems capability to withstand turbulent conditions in
production.
Xen Hypervisor vulnerability – 9/25/14
218 out of 2700+ Cassandra nodes rebooted
22 did not reboot successfully
Automation handled the rest
A State of Xen – Chaos Monkey & Cassandra
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
Device Service B
Service C
Internet EdgeZuul
Service A
ELB
FIT
Fault-Injection Testing (FIT)
• Simulate service failures
• Override by device or account
• % of member traffic
• Alerting and Monitoring
• Apache & Tomcat Hardening
• Automated Canary Analysis
• Autoscaling
• Chaos Participation• Consistent Naming
• ELB Configuration
• Healthcheck Configured
• Red-Black Pipeline
• Squeeze Testing
• Timeout & Fallback Tuning
• Workload Reliability
Production Ready?
Our Journey
• Engineering Tools
• Insight & Real-time Analytics
• Performance & Reliability
• Leverage
Canary Analysis
Conformity
Integration Tests
Citrus
Chaos
Static
Unit Tests
Deep Integration
Modular Components
Functional
Testing
RTA auto-tuning• Alerts
• Apache/Tomcat
• Auto-scaling
• Hystrix fallbacks
RTA decision support• ACA
• Citrus
• Flow
Conformity checks• Consistent names
• ELBs
• Health check
• Red/black deployment
Delivery integration• ACA
• Citrus
• FIT
Production Ready – Automation & Integration
Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew ParkWed @
2:45pmPalazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8
Million Events Per SecondPeter Bakas
Wed @
2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave HahnWed @
4:15pmVenetian H
Availability: The New Kind of Innovator’s Dilemma Coburn WatsonWed @
4:15pm
Marcello
4501B
Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport
Chris Sanden
Wed @
4:15pmLido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping
Developers and Auditors Happy in the CloudJason Chan Thu @ 11am
Marcello
4501B
@