(spot302) availability: the new kind of innovator’s dilemma
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Coburn Watson, Director of Performance and Reliability, Netflix
October 2015
SPOT302
AvailabilityThe New Kind of Innovators Dilemma
@coburnw
• Cloud performance and reliability @ Netflix
• Reduce time-to-detect and time-to-resolve
• Optimize usage of AWS cloud
• Steer global user traffic and support failover
• Inject chaos into production environment
• Build innovative performance analysis tooling
• Drive operational best practice adoption
• 67M+ subscribers
• > 50 countries
• > 3 billion hours of video streamed monthly
• Massive cloud footprint
• Homegrown CDN
• Strong Originals slate
Atlas
https://netflix.github.io/
What to Expect from the Session
• Strategies
• Maximizing engineering velocity in the cloud
• Minimizing risks to availability
The cloud is a journey
…not a destination*
* Adapted from Ralph Waldo Emerson
2008
2010
2011
2013
2015
Datacenter
Failure
Serving off
AWS US-EAST-1
Three AZ
Deployments
Serving off
AWS EU-WEST-1
Chaos Monkey
Unleashed
Serving from
AWS US-WEST-2
Running
Active-Active
Chaos Kong
Unleashed
Last Application
to the Cloud
Active-Active in
three AWS regions
The Netflix Cloud Journey
The Innovators Dilemma
vs.
Shifting the Curve@Netflix
• Maintain or improve availability as engineering velocity increases
Maximize Engineering Velocity
"FA-18 Hornet breaking sound barrier (7 July 1999) - filtered" by Ensign John Gay, U.S. Navy
Infrastructure on Demand
• No procurement process• “all you can eat” **
• Expose IaaS via Spinnaker• No passwords, no keys
** please don’t eat all of it
Accelerate Code Deployment
• Commit-to-cloud in minutes
• Across three AWS regions
Decouple Services
• µservice architecture (500+ @Netflix)
• One Auto Scaling group per service
• Independent push schedules (1day 4weeks)
• Communicate via API
• Independent databases (280+ Cassandra clusters)
• Minimize aggregate rate of change
• Update code which needs updating…
Minimize Risks to Availability
“If everything seems under control, you're not going fast enough.”
― Mario Andretti
Maximize Infrastructure Stability
• Run on AWS
• Purchase 3-year EC2 Reserved instances (for failover as well)
• Distribute Auto Scaling groups across 3 Availability Zones per region
Propagate Changes Safely into Production
• Rolling regional “red-black” pushes
• Build pipelines & automated canary analysis
• 30 second time-to-detect on critical metrics
• Rigorous quality and performance checks part of code push
• Canary score is the gate for push
Automated Canary Analysis
Cross-Service Resiliency
• Isolate misbehaving services
• Open “circuits” and provide fallback experiences
Normal(personalized)
Degraded(unpersonalized)
Improve Time-To-Detect
• 30 second alerts vs. prior 8 minutes
• Utilize streaming analysis infrastructure at the edge tier
Dynamically Provision Capacity
• Reactively scale Auto Scaling groups
Flexibility in Traffic Management
• Target three primary AWS regions
• Maintain capacity to allow regional evacuation
Frequently Exercise “Chaos”
• Netflix runs regional failover exercises monthly
• Can you spot the chaos?
Frequently Exercise “Chaos”
• Validates• Failover correctness
• Capacity
• Failover velocity
• Confidence in usage
(same time window as previous slide)
Continually Lower Operational Barriers
• “Production Ready” Program
• Identify operational best practices
• Develop tooling
• Consult with engineering teams
• Identify reliability “anti-patterns”…address
• Example key areas
• Auto Scaling, Hystrix tuning, alerting,
automated Canary analysis, Apache/Tomcat tuning
It Works
Regional Isolation
Push-induced failure
Automated Service Fallbacks
• Downstream service issue; fallbacks gracefully applied
….but what about efficiency?
..That’s a separate talk altogether
Wrapping it Up
• “To the cloud” – a journey
• Abstract complexity via platform
• Don’t be afraid to break things
• Break things intentionally and frequently
• Invest in reliability to support increased innovation
• Hire top talent
Related Sessions
Talk Speaker When? Where?
Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N
Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C
Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million
Events Per SecondPeter Bakas Wed @ 2:45pm
San Polo
3501B
A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H
Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport
Chris SandenWed @ 4:15pm Lido 3001B
Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F
Splitting the Check on Compliance and Security: Keeping Developers and
Auditors Happy in the CloudJason Chan Thu @ 11am
Marcello
4501B
Remember to complete
your evaluations!
Thank you!