(spot302) availability: the new kind of innovator’s dilemma

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Coburn Watson, Director of Performance and Reliability, Netflix

October 2015

SPOT302

AvailabilityThe New Kind of Innovators Dilemma

@coburnw

• Cloud performance and reliability @ Netflix

• Reduce time-to-detect and time-to-resolve

• Optimize usage of AWS cloud

• Steer global user traffic and support failover

• Inject chaos into production environment

• Build innovative performance analysis tooling

• Drive operational best practice adoption

• 67M+ subscribers

• > 50 countries

• > 3 billion hours of video streamed monthly

• Massive cloud footprint

• Homegrown CDN

• Strong Originals slate

Atlas

https://netflix.github.io/

What to Expect from the Session

• Strategies

• Maximizing engineering velocity in the cloud

• Minimizing risks to availability

The cloud is a journey

…not a destination*

* Adapted from Ralph Waldo Emerson

2008

2010

2011

2013

2015

Datacenter

Failure

Serving off

AWS US-EAST-1

Three AZ

Deployments

Serving off

AWS EU-WEST-1

Chaos Monkey

Unleashed

Serving from

AWS US-WEST-2

Running

Active-Active

Chaos Kong

Unleashed

Last Application

to the Cloud

Active-Active in

three AWS regions

The Netflix Cloud Journey

The Innovators Dilemma

vs.

Shifting the Curve@Netflix

• Maintain or improve availability as engineering velocity increases

Maximize Engineering Velocity

"FA-18 Hornet breaking sound barrier (7 July 1999) - filtered" by Ensign John Gay, U.S. Navy

Infrastructure on Demand

• No procurement process• “all you can eat” **

• Expose IaaS via Spinnaker• No passwords, no keys

** please don’t eat all of it

Accelerate Code Deployment

• Commit-to-cloud in minutes

• Across three AWS regions

Decouple Services

• µservice architecture (500+ @Netflix)

• One Auto Scaling group per service

• Independent push schedules (1day 4weeks)

• Communicate via API

• Independent databases (280+ Cassandra clusters)

• Minimize aggregate rate of change

• Update code which needs updating…

Minimize Risks to Availability

“If everything seems under control, you're not going fast enough.”

― Mario Andretti

http://www.goodreads.com/author/show/2115694.Mario_Andretti

Maximize Infrastructure Stability

• Run on AWS

• Purchase 3-year EC2 Reserved instances (for failover as well)

• Distribute Auto Scaling groups across 3 Availability Zones per region

Propagate Changes Safely into Production

• Rolling regional “red-black” pushes

• Build pipelines & automated canary analysis

• 30 second time-to-detect on critical metrics

• Rigorous quality and performance checks part of code push

• Canary score is the gate for push

Automated Canary Analysis

Cross-Service Resiliency

• Isolate misbehaving services

• Open “circuits” and provide fallback experiences

Normal(personalized)

Degraded(unpersonalized)

Improve Time-To-Detect

• 30 second alerts vs. prior 8 minutes

• Utilize streaming analysis infrastructure at the edge tier

Dynamically Provision Capacity

• Reactively scale Auto Scaling groups

Flexibility in Traffic Management

• Target three primary AWS regions

• Maintain capacity to allow regional evacuation

Frequently Exercise “Chaos”

• Netflix runs regional failover exercises monthly

• Can you spot the chaos?

Frequently Exercise “Chaos”

• Validates• Failover correctness

• Capacity

• Failover velocity

• Confidence in usage

(same time window as previous slide)

Continually Lower Operational Barriers

• “Production Ready” Program

• Identify operational best practices

• Develop tooling

• Consult with engineering teams

• Identify reliability “anti-patterns”…address

• Example key areas

• Auto Scaling, Hystrix tuning, alerting,

automated Canary analysis, Apache/Tomcat tuning

It Works

Regional Isolation

Push-induced failure

Automated Service Fallbacks

• Downstream service issue; fallbacks gracefully applied

….but what about efficiency?

..That’s a separate talk altogether

Wrapping it Up

• “To the cloud” – a journey

• Abstract complexity via platform

• Don’t be afraid to break things

• Break things intentionally and frequently

• Invest in reliability to support increased innovation

• Hire top talent

Related Sessions

Talk Speaker When? Where?

Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N

Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C

Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million

Events Per SecondPeter Bakas Wed @ 2:45pm

San Polo

3501B

A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H

Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport

Chris SandenWed @ 4:15pm Lido 3001B

Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F

Splitting the Check on Compliance and Security: Keeping Developers and

Auditors Happy in the CloudJason Chan Thu @ 11am

Marcello

4501B

Remember to complete

your evaluations!

Thank you!

(spot302) availability: the new kind of innovator’s dilemma

Technology