(spot302) availability: the new kind of innovator’s dilemma

32
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coburn Watson, Director of Performance and Reliability, Netflix October 2015 SPOT302 Availability The New Kind of Innovators Dilemma

Upload: amazon-web-services

Post on 14-Feb-2017

1.101 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Coburn Watson, Director of Performance and Reliability, Netflix

October 2015

SPOT302

AvailabilityThe New Kind of Innovators Dilemma

Page 2: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

@coburnw

• Cloud performance and reliability @ Netflix

• Reduce time-to-detect and time-to-resolve

• Optimize usage of AWS cloud

• Steer global user traffic and support failover

• Inject chaos into production environment

• Build innovative performance analysis tooling

• Drive operational best practice adoption

Page 3: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

• 67M+ subscribers

• > 50 countries

• > 3 billion hours of video streamed monthly

• Massive cloud footprint

• Homegrown CDN

• Strong Originals slate

Page 4: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Atlas

https://netflix.github.io/

Page 5: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

What to Expect from the Session

• Strategies

• Maximizing engineering velocity in the cloud

• Minimizing risks to availability

Page 6: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

The cloud is a journey

…not a destination*

* Adapted from Ralph Waldo Emerson

Page 7: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

2008

2010

2011

2013

2015

Datacenter

Failure

Serving off

AWS US-EAST-1

Three AZ

Deployments

Serving off

AWS EU-WEST-1

Chaos Monkey

Unleashed

Serving from

AWS US-WEST-2

Running

Active-Active

Chaos Kong

Unleashed

Last Application

to the Cloud

Active-Active in

three AWS regions

The Netflix Cloud Journey

Page 8: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

The Innovators Dilemma

vs.

Page 9: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Shifting the Curve@Netflix

• Maintain or improve availability as engineering velocity increases

Page 10: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Maximize Engineering Velocity

"FA-18 Hornet breaking sound barrier (7 July 1999) - filtered" by Ensign John Gay, U.S. Navy

Page 11: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Infrastructure on Demand

• No procurement process• “all you can eat” **

• Expose IaaS via Spinnaker• No passwords, no keys

** please don’t eat all of it

Page 12: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Accelerate Code Deployment

• Commit-to-cloud in minutes

• Across three AWS regions

Page 13: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Decouple Services

• µservice architecture (500+ @Netflix)

• One Auto Scaling group per service

• Independent push schedules (1day 4weeks)

• Communicate via API

• Independent databases (280+ Cassandra clusters)

• Minimize aggregate rate of change

• Update code which needs updating…

Page 14: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Minimize Risks to Availability

“If everything seems under control, you're not going fast enough.”

― Mario Andretti

Page 15: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Maximize Infrastructure Stability

• Run on AWS

• Purchase 3-year EC2 Reserved instances (for failover as well)

• Distribute Auto Scaling groups across 3 Availability Zones per region

Page 16: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Propagate Changes Safely into Production

• Rolling regional “red-black” pushes

• Build pipelines & automated canary analysis

• 30 second time-to-detect on critical metrics

Page 17: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

• Rigorous quality and performance checks part of code push

• Canary score is the gate for push

Automated Canary Analysis

Page 18: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Cross-Service Resiliency

• Isolate misbehaving services

• Open “circuits” and provide fallback experiences

Normal(personalized)

Degraded(unpersonalized)

Page 19: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Improve Time-To-Detect

• 30 second alerts vs. prior 8 minutes

• Utilize streaming analysis infrastructure at the edge tier

Page 20: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Dynamically Provision Capacity

• Reactively scale Auto Scaling groups

Page 21: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Flexibility in Traffic Management

• Target three primary AWS regions

• Maintain capacity to allow regional evacuation

Page 22: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Frequently Exercise “Chaos”

• Netflix runs regional failover exercises monthly

• Can you spot the chaos?

Page 23: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Frequently Exercise “Chaos”

• Validates• Failover correctness

• Capacity

• Failover velocity

• Confidence in usage

(same time window as previous slide)

Page 24: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Continually Lower Operational Barriers

• “Production Ready” Program

• Identify operational best practices

• Develop tooling

• Consult with engineering teams

• Identify reliability “anti-patterns”…address

• Example key areas

• Auto Scaling, Hystrix tuning, alerting,

automated Canary analysis, Apache/Tomcat tuning

Page 25: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

It Works

Page 26: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Regional Isolation

Push-induced failure

Page 27: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Automated Service Fallbacks

• Downstream service issue; fallbacks gracefully applied

Page 28: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

….but what about efficiency?

..That’s a separate talk altogether

Page 29: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Wrapping it Up

• “To the cloud” – a journey

• Abstract complexity via platform

• Don’t be afraid to break things

• Break things intentionally and frequently

• Invest in reliability to support increased innovation

• Hire top talent

Page 30: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Related Sessions

Talk Speaker When? Where?

Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N

Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C

Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million

Events Per SecondPeter Bakas Wed @ 2:45pm

San Polo

3501B

A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H

Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport

Chris SandenWed @ 4:15pm Lido 3001B

Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F

Splitting the Check on Compliance and Security: Keeping Developers and

Auditors Happy in the CloudJason Chan Thu @ 11am

Marcello

4501B

Page 31: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Remember to complete

your evaluations!

Page 32: (SPOT302) Availability: The New Kind of Innovator’s Dilemma

Thank you!