evolve or die - sigcommconferences.sigcomm.org/sigcomm/2016/files/program/...ramesh govindan, ina...

95
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google

Upload: others

Post on 30-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin

Vahdat… and a cast of hundreds at Google

Page 2: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Network availability is the biggest challenge facing large content and

cloud providers today

2

Page 3: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Why?

3

At four 9s availability❖ Outage budget is 4 mins per month

At five 9s availability❖ Outage budget is 24 seconds per month

The push towards higher 9s of availability

Page 4: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

4

By learning from failuresHow do providers achieve these levels?

Page 5: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What design principles can achieve high availability?

What has Google Learnt from Failures?

Why is high network availability a challenge?

What are the characteristics of network availability failures?

5

Paper’s Focus

Page 6: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Why is high network availability a challenge?

Velocity of EvolutionScale

Management Complexity

6

Page 7: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolution

Time

Cap

acity

Saturn

Firehose 1.0

Watchtower

Firehose 1.1

4 Post

Jupiter

7

Network hardware evolves continuously

Page 8: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolution

B4

2006

2008

2010

2012

2014Google Global Cache

BwE

JupitergRPC

Freedome

Watchtower

QUIC

Andromeda

8

So does network software

Page 9: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolution

9

New hardware and software can❖ Introduce bugs❖ Disrupt existing software

Result: Failures!

Page 10: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

B2B4

Data centers

Other ISPs

Scale and Complexity

10

Page 11: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Scale and Complexity

11

B4 and Data Centers❖ Use merchant silicon chips❖ Centralized control planes

Design Differences

B2❖ Vendor gear❖ Decentralized control plane

Page 12: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Scale and Complexity

12

Design Differences

These differences increase management complexity and pose availability challenges

Page 13: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

The Management

Plane

Management Plane Software

13

Managesnetwork evolution

Page 14: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Management Plane Operations

Connect a new data center to B2 and B4

Upgrade B4 or data center control plane software

Drain or undrain links, switches, routers, services

Many operations require multiple steps and can

take hours or days

Temporarily remove from service

14

Page 15: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

The Management

Plane

15

Low-level abstractions for management operations❖ Command-line interfaces to high

capacity routers

A small mistake by operator can impact a large part of network

Page 16: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Why is high network availability a challenge?

What are the characteristics of network availability failures?

Duration, Severity, PrevalenceRoot-cause Categorization

16

Page 17: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Key Takeaway

17

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Page 18: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

We analyzed over 100 Post-mortem reports written over a

2 year period

18

Page 19: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What is a Post-mortem?

Carefully curated description of a previously unseen failure that had significant availability impact

Helps learn from failures

19

Blame-free process

Page 20: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What a Post-Mortem

Contains

20

Description of failure, with detailed timeline

Root-cause(s) confirmed by reproducing the failure

Discussion of fixes, follow up action items

Page 21: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Failure Examples

and Impact

21

❖ Entire control plane fails❖ Upgrade causes backbone traffic shift❖Multiple top-of-rack switches fail

Examples

❖ Data center goes offline❖WAN capacity falls below demand❖ Several services fail concurrently

Impact

Page 22: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Key Quantitative

Results

22

70% of failures occur when management plane operation is in progress

Failures are everywhere: all three networks and three planes see comparable failure rates

80% of failure durations between 10 and 100 minutes

Evolution impacts availability

No silver bullet

Need fast recovery

Page 23: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Root causes

23

Lessons learned from root causes motivate availability design principles

Page 24: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Why is high network availability a challenge?

What are the characteristics of network availability failures?

What design principles can achieve high availability?

Re-Think Management PlaneAvoid and Mitigate Large Failures

Evolve or Die

24

Page 25: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

25

Re-think the Management Plane

Page 26: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Availability Principle

26

Operator types wrong CLI command, runs wrong script

Backbone router fails

Minimize Operator

Intervention

Page 27: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Availability Principle

27

To upgrade part of a large device…❖ Line card, block of Clos fabric

… proceed while rest of device carries traffic❖ Enables higher availability

Necessary for upgrade-in-place

Page 28: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Availability Principle

28

Ensure residual capacity > demand

Early risk assessments were manual

Risky!

High packet loss

Assess risk continuously

Page 29: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Re-think the Management

Plane

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

29

Page 30: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Re-think the Management

Plane

Management Plane Run-time

Management Operations

Device Configurations

Tests to Verify Operation

Apply Configuration

Perform management operation

Verify operation

AssessRisk

Continuously

Minimize Operator

Intervention

30

Automated Risk

Assessment

Page 31: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

31

Avoid and Mitigate Large Failures

Page 32: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Availability Principle

32

B4 and data-centers have dedicated control-plane network❖ Failure of this can bring down entire control plane

Fail openContain failure radius

Page 33: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Fail OpenCentralized

Control Plane

Preserve forwarding state of all switches❖ Fail-open the entire data center

33

Traffic

Exceedingly tricky!

Data center

Page 34: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Availability Principle

34

A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center

Design fallback strategies

Page 35: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Design Fallback Strategies

35

A large section of the WAN fails, so demand exceeds capacity

B4

Page 36: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Design Fallback Strategies

36

B2

Fallback to B2!

Can shift largetraffic volumes from many data centers

B4

Page 37: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Design Fallback

Strategies

37

When centralized traffic engineering fails...❖ … fallback to IP routing

Big Red Buttons❖ For every new software upgrade, design controls so

operator can initiate fallback to “safe” version

Page 38: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

38

Evolve or Die!

Page 39: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

39

We cannot treat a change to the network as an exceptional

event

Page 40: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolve or Die

Make change the common case

Make it easy and safe to evolve the network daily

❖ Forces management automation❖ Permits small, verifiable changes

40

Page 41: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Conclusion

41

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Page 42: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Presentation template from SlidesCarnival

Page 43: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

43

Older Slides

Page 44: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

44

Cabling error, interface card failure, cable cut….

Page 45: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

45

Operator types wrong CLI command, runs wrong script

Page 46: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

46

Incorrect demand or capacity estimation for upgrade-in-place

Page 47: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Upgrade in place

47

Page 48: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Assessing Risk Correctly

Residual Capacity? Demand?

Varies by interconnect Can change dynamically

48

Page 49: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

49

Hardware or link layer failures in control plane network

Page 50: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

50

Two control plane components have inconsistent views of control plane state, caused by bug

Page 51: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Popular root-cause

categories

51

Running out of memory, CPU, OS resources (threads)...

Page 52: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Lessons from Failures

The role of evolution in failures▸ Rethink the

Management Plane

The prevalence of large, severe, failures▸ Prevent and

mitigate large failures

Long failure durations▸ Recover fast

52

Page 53: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

High-level Management

Plane Abstractions

I want to upgrade this router

Why is this difficult? Modern high capacity routers:❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which

have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 100s of thousands of lines

“Intent”

53

Page 54: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

High-level Management

Plane Abstractions

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

54

Page 55: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Management Plane

Automation

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

Apply Configuration

Perform management operation

Verify operation

AssessRisk

Continuously

Minimize Operator

Intervention

55

Page 56: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Large Control

Plane Failures

Centralized Control Plane

56

Page 57: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Contain the blast radiusCentralized

Control Plane

57

Centralized Control Plane

Smaller failure impact, but increased complexity

Page 58: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Fail-OpenCentralized

Control Plane

Preserve forwarding state of all switches❖ Fail-open the entire fabric

58

Page 59: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Defensive Control-Plane

Design

Gateway

Topology Modeler

TE Server

BwE

59

One piece of this large update

seems wrong!!

Page 60: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Trust but Verify

Gateway

Topology Modeler

TE Server

BwE

60

Let me check the correctness of the update...

Page 61: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Fallback to B2

Gateway

Topology Modeler

TE Server

BwE

61

B2

Page 62: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Mitigating Large Failures

Design Fallback Strategies▸ B4 B2▸ Tunneling IP routing▸ Big Red Buttons

62

Page 63: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Continuously Monitor

Invariants

63

Must have onefunctional backup

SDN controller

Anycast route must have AS path length

of 3

Data center must peer with two B2

routers

Page 64: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

This Alone isn’t Enough...

64

Page 65: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

65

We cannot treat a change to the network as an exceptional

event

Page 66: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolve or Die

Make change the common case

Make it easy and safe to evolve the network daily

❖ Forces management automation❖ Permits small, verifiable changes

66

Page 67: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Key Takeaway

67

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Page 68: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin

Vahdat… and a cast of hundreds at Google

Presentation template from SlidesCarnival

Page 69: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Impact of Availability

Failures

69

Page 70: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What design principles can achieve high availability?

A Case Study: Google

Why is high network availability a challenge?

What are the characteristics of network availability failures?

70

Page 71: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

The velocity of evolution is fueled by

traffic growth...

71

Page 72: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

… and by an increase in

product and service

offerings

72

Page 73: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Networks have very different designs

Different hardware Different control planes

Different forwarding paradigms

These differences increase management and evolution complexity

73

Page 74: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

❖ Fabrics with merchant silicon chips❖ Centralized control plane❖ Out of band control plane network

Data centers

Control plane

network

74

SIGCOMM 2015

Page 75: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

B4

Gateway

Topology Modeler

TE Server

BwE

❖ B4 routers built using merchant silicon chips❖ Centralized control plane within each B4 site❖ Centralized traffic engineering❖ Bandwidth enforcement for traffic metering

75

SIGCOMM 2015

SIGCOMM 2013

Page 76: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Other ISPs

❖ B2 routers based on vendor gear❖ Decentralized routing and MPLS TE❖ Class of service (high/low) using MPLS priorities

B2

76

Page 77: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

The Management

Plane

Low-level, per device, abstractions for

management operations77

Page 78: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Where do failures

happen?

No network or plane that dominates

78

Page 79: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

How long do the failures

last?

Durations much longer than outage budgets

Shorter failures on B2

79

Page 80: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What role does

evolution play?

70% of failures happen when a management operation is in progress

80

Page 81: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Where do failures

happen?

12

326

10

8 5

14

6

Control plane

network

12

8

15

81

Page 82: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Failures are everywhere

82

Page 83: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Across networks

All three

All three

All three

All three

All three

83

Page 84: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Across planes

Data

Management

Data

Data

Control

Management

84

Management

Page 85: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Root-Cause Categorization

What are the root causes for these

failures?

85

Page 86: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Rethink the Management

Plane

Low-level network managementcannot ensure high availability

86

Page 87: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Re-think the Management

Plane

I want to upgrade this router

Lots of complexity hidden below this statement❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which

have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 1000s of thousands of lines

“Intent”

87

Page 88: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Contain failure radiusCentralized

Control Plane

88

Centralized Control Plane

Each partition managed by different control plane

Adds design complexity

Even if one partition fails, others can carry traffic

Page 89: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Key Takeaway

89

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Page 90: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

By learning from failures

90

Page 91: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

What design principles can achieve high availability?▸ Lessons

learned from root-causes

What has Google Learnt from Failures?

Why is high network availability a challenge?▸ Factors that

impact availability

What are the characteristics of network failures?▸ Severity,

duration, prevalence

▸ Root-cause categorization

91

Page 92: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

DataCenter

Data Center

DataCenter

In a global networkFailures are common Configuration can change

These can impact network availability92

Page 93: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

How long does it take...

10s of minutes to hours Hours to days

DataCenter

Data Center

DataCenter

… to root-cause a failure … to upgrade part of the network

93

Page 94: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

Outage budgets...

… for four 9s availability? … for five 9s availability?

4 minutes per month 24 seconds per month

99.99% uptime 99.999% uptime

94

Page 95: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides

To move towards higher availability targets, it is important to learn from

failures

95