(ism301) engineering netflix global operations in the cloud

83
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Josh Evans - Director of Operations Engineering ISM301 Engineering Netflix Global Operations in the Cloud

Upload: amazon-web-services

Post on 21-Jan-2017

3.654 views

Category:

Technology


3 download

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Josh Evans - Director of Operations Engineering

ISM301

Engineering

Netflix Global Operations

in the Cloud

Internet

• Two Operational Challenges

• Operational Excellence

• Operations Engineering

Our Journey

Our Journey

• Two Operational Challenges

• Operational Excellence

• Operations Engineering

Product Innovation

winning moments of truth

● Every facet of the product

● 1400 AB tests in the last year & accelerating

Continuous Innovation

Challenge #1:Accelerate Innovation and Rate of Change

Scale & Complexity

100,000s of requests per second

1000s of Global Starts per Second

Approaching Global Reach

October - Spain, Portugal, Italy

Early 2016 - Korea, Taiwan, Singapore, Hong Kong

65m members 100m

~60 counties 200

EU-WestUS-EastUS-West

Multi-Zone, Multi-Region

Netflix CDN(Open Connect)

Cloud

Control Plane

Internet

The Bigger Picture

Service

PartnersService

Partners

Challenge #2:Sustain & Improve Quality

in the face of ever growing scale & complexity

Our Journey

• Two Operational Challenges

• Operational Excellence

• Operations Engineering

Operational Excellence

Quality Velocity

Availability vs. Rate of Change

Rate of Change

Availa

bili

ty (

nin

es)

6

5

4

3

2

1

0

1 10 100 1000

99.9999%

99.999%

99.99%

99.9%

99%

90%

31.5 seconds

5.26 minutes

52.56 minutes

8.76 hours

3.26 days

36.5 days

Quality vs. Velocity

Availability vs. Rate of Change

Rate of Change

Availa

bili

ty (

nin

es)

6

5

4

3

2

1

0

1 10 100 1000

99.9999%

99.999%

99.99%

99.9%

99%

90%

31.5 seconds

5.26 minutes

52.56 minutes

8.76 hours

3.26 days

36.5 days

The Zero Sum Game

Availability vs. Rate of Change

Rate of Change

Availa

bili

ty (

nin

es)

6

5

4

3

2

1

0

1 10 100 1000

99.9999%

99.999%

99.99%

99.9%

99%

90%

31.5 seconds

5.26 minutes

52.56 minutes

8.76 hours

3.26 days

36.5 days

The Zero Sum Game

Availability vs. Rate of Change

Rate of Change

Availa

bili

ty (

nin

es)

6

5

4

3

2

1

0

1 10 100 1000

99.9999%

99.999%

99.99%

99.9%

99%

90%

Shifting the Curve

Operational Excellence is the continuous improvement

of the management, design, and function of operational

environments to achieve greater quality, velocity, and

competitive advantage.

Our Journey

• Two Operational Challenges

• Operational Excellence

• Operations Engineering

Build It

design

code

build

bake

test

deploy

Run It

operate

configure

monitor

respond

You build it, you run it…

…globally

Undifferentiated

Heavy Lifting

Operations Engineering is the application of software

engineering practices and principles to achieve and sustain

operational excellence.

• automation

• modular components

• tools & services

• best practices

Our Journey – Operations Engineering

• Engineering Tools

• Insight & Real-time Analytics

• Performance & Reliability

• Leverage

Our Journey

• Engineering Tools

• Insight & Real-time Analytics

• Performance & Reliability

• Leverage

Data Center

● Delayed provisioning

● Hand-crafted servers

● Variations and complexity

Our Artisanal Past

Delivery

● Late night, manual deployments

● Repeated mistakes

● Painful delays to production fixes

• productivity

• velocity

• quality

Engineering Tools

• cloud management

• delivery engine

• automation platform

Global Cloud Management

Delivery Pipelines

Automated Global Delivery

The Paved Road• Stash

• Gradle

• Ubuntu

• Jenkins

• Spinnaker

Our Journey

• Engineering Tools

• Insight & Real-time Analytics

• Performance & Reliability

• Leverage

Insight & Real-Time Analytics

OODA loop

An outage may not be life or death but…

• DES on time series

data

• Predict the future

based on history

• Favor recent history

• Threshold-based alerts

• 6-8 minute delay

Anomaly Detection

Alert!

Finer Granularity, Shorter Time Windows

Ensemble Learning

Median Absolute Deviation

IQR

Least Squares

HDI

Voting

observe, orient, decide, act

Alert!

From 6-8 minutes to < 1 minute

observe, orient…

…decide, act

How do we take humans out of the equation?

Outlier Detection & Remediation

• Unsupervised machine learning

• Density-based clustering

algorithm

• Actions

• Email, page

• OOS, detach, terminate

Kepler

An ounce of prevention…

Old Version (v1.0)

New Version

(v1.1)

Load BalancerCustomers

100 Servers

5 Servers

95%

5%

Metrics

Canary Release Process

Old Version (v1.0)

New Version

(v1.1)

Load BalancerCustomers

0 Servers

100 Servers

100%

Metrics

Canary Release Process

Define• Metrics

• A threshold

Every n minutes● Classify metrics

● Compute score

● Make a decision

Automatic Canary Analysis

• Systematic observation of facets & permutations

• Unsupervised monitoring & decision- making

• Automated tuning & recovery

• Alerts with analysis

Thinking Globally

Our Journey

• Engineering Tools

• Insight & Real-time Analytics

• Performance & Reliability

• Leverage

Performance & Reliability

Internet

Zuul

API

NCC

P

Playback

History

Playback Sessions

MAP

Chaos Engineering is the discipline of experimenting on

a distributed system in order to build confidence in the

systems capability to withstand turbulent conditions in

production.

Cluster A Cluster D

Edge Cluster

Cluster B

Cluster C

Imagine a monkey loose in your data center…

Xen Hypervisor vulnerability – 9/25/14

218 out of 2700+ Cassandra nodes rebooted

22 did not reboot successfully

Automation handled the rest

A State of Xen – Chaos Monkey & Cassandra

Device Service B

Service C

Internet EdgeZuul

Service A

ELB

FIT

Fault-Injection Testing (FIT)

• Simulate service failures

• Override by device or account

• % of member traffic

Device Service B

Service C

Internet EdgeZuul

Service A

ELB

FIT

Fault-Injection Testing (FIT)

• Simulate service failures

• Override by device or account

• % of member traffic

US-EastUS-West

AZ1

EU-West

Global Traffic Management

The Internet

DNS-based

Routing

Zuul Proxy

Back Channel

###, ###, ###

• Alerting and Monitoring

• Apache & Tomcat Hardening

• Automated Canary Analysis

• Autoscaling

• Chaos Participation• Consistent Naming

• ELB Configuration

• Healthcheck Configured

• Red-Black Pipeline

• Squeeze Testing

• Timeout & Fallback Tuning

• Workload Reliability

Production Ready?

Our Journey

• Engineering Tools

• Insight & Real-time Analytics

• Performance & Reliability

• Leverage

● A federation of tools

● Common UI elements

● Deep linking

Operational Tools as a Product

Canary Analysis

Conformity

Integration Tests

Citrus

Chaos

Static

Unit Tests

Deep Integration

Modular Components

Functional

Testing

RTA auto-tuning• Alerts

• Apache/Tomcat

• Auto-scaling

• Hystrix fallbacks

RTA decision support• ACA

• Citrus

• Flow

Conformity checks• Consistent names

• ELBs

• Health check

• Red/black deployment

Delivery integration• ACA

• Citrus

• FIT

Production Ready – Automation & Integration

Internet

Our Journey Ends

https://netflix.github.io/

Speaker When? Where?

Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N

Efficient Innovation: High-Velocity Cost Management at Netflix Andrew ParkWed @

2:45pmPalazzo C

Netflix Keystone: How Netflix Handles Data Streams Up to 8

Million Events Per SecondPeter Bakas

Wed @

2:45pm

San Polo

3501B

A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave HahnWed @

4:15pmVenetian H

Availability: The New Kind of Innovator’s Dilemma Coburn WatsonWed @

4:15pm

Marcello

4501B

Real-Time Analytics In Service of Self-Healing EcosystemsRoy Rapoport

Chris Sanden

Wed @

4:15pmLido 3001B

Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F

Splitting the Check on Compliance and Security: Keeping

Developers and Auditors Happy in the CloudJason Chan Thu @ 11am

Marcello

4501B

@

Thank you!

Josh Evans

[email protected]

@josh_evans_nflx