dpc 2016 - 53 minutes or less - architecting for failure

79
53 Minutes or Less - Architecting For Failure In The Cloud Ben Andersen-Waine

Upload: benwaine

Post on 16-Apr-2017

259 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: DPC 2016 - 53 Minutes or Less - Architecting For Failure

53 Minutes or Less - Architecting For Failure In

The CloudBen Andersen-Waine

Page 2: DPC 2016 - 53 Minutes or Less - Architecting For Failure

53 Minutes?

Page 3: DPC 2016 - 53 Minutes or Less - Architecting For Failure

99.99%

Page 4: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Availability (%) Year Month Week

90 36.5 Days 72 Hours 16.8 Hours

99 3.65 Days 7.2 Hours 1.68 Hours

99.9 8.76 Hours 43.8 Min 10.1 Min

99.99 52.56 Min 4.38 Min 1.01 Min

Adapted From: https://en.wikipedia.org/wiki/High_availability

Page 5: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Architecting For Failure?

Page 6: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Who are you?

1) You have some kind of web application / service

2) You are using an IaaS cloud provider

3) The service needs to be “highly available”

Page 7: DPC 2016 - 53 Minutes or Less - Architecting For Failure

SAMPLE

http://example.com/more/info/README

High Level Content

Deeper Reading

Page 8: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 9: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Infrastructure

Page 10: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Infrastructure

• Regions & Availability Zones

• Autoscaling

• Multi Region

Page 11: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Regions And Availability Zones

“Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.”

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

Page 12: DPC 2016 - 53 Minutes or Less - Architecting For Failure

http://aws.amazon.com/about-aws/global-infrastructure/

Page 13: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 14: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 15: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 16: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 17: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 18: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 19: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 20: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 21: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Auto Scaling

“Auto Scaling helps you maintain application availability and allows you to scale your Amazon EC2 capacity up or down automatically according to conditions you define. ”

https://aws.amazon.com/autoscaling/

Page 22: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Auto Scaling

• Instance metrics (useful for containers)

• Load balancer health check (useful for web apps on EC2)

Page 23: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 24: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 25: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 26: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 27: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 28: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 29: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 30: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 31: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Multi Region

Page 32: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Devops

Page 33: DPC 2016 - 53 Minutes or Less - Architecting For Failure

One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.

http://martinfowler.com/bliki/PhoenixServer.html

Page 34: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Immutable Infrastructure

Page 35: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Devops• Environment Creating

• Releasing

• Secret Management

• Service Discovery

Page 36: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Environment Creation

• Vendors Tool (AWS Cloud Formation / GCE Cloud Deployment Manager)

• 3rd Party Solution - Terraform, Ansible

Page 37: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Immutable Infrastructure

http://martinfowler.com/bliki/SnowflakeServer.html

Configuration changes are regularly needed to tweak the environment so that it runs efficiently and communicates properly with other systems. This requires some mix of command-line invocations, jumping between GUI screens, and editing text files.

The result is a unique snowflake - good for a ski resort, bad for a data center.

Page 38: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Releases: Build An Artifact

• Build A VM (AWS ami / GCE image)

• Use Containers

Page 39: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Releases: Building A VM

Page 40: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Releases: Building A Container

Page 41: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Releases: Canarys

http://martinfowler.com/bliki/CanaryRelease.html

Page 42: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 43: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Releases: Blue / Green Deploy

https://cloudnative.io/blog/2015/02/the-dos-and-donts-of-bluegreen-deployment/

Page 44: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 45: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 46: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 47: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 48: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Service Discovery

https://www.nginx.com/blog/service-discovery-in-a-microservices-architecture/

Page 49: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 50: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 51: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 52: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 53: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Service Discovery

• https://github.com/coreos/etcd

• https://www.consul.io/

• https://zookeeper.apache.org/

Page 54: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Secrets

Page 55: DPC 2016 - 53 Minutes or Less - Architecting For Failure

• Use secret keeper or vault

• Use environment variables

Secrets

Page 56: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Secrets

Page 57: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Secrets

Page 58: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Secrets

Page 59: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Secrets

Page 60: DPC 2016 - 53 Minutes or Less - Architecting For Failure

• https://www.vaultproject.io/

• https://square.github.io/keywhiz/

Secrets

Page 61: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Software Development

Page 62: DPC 2016 - 53 Minutes or Less - Architecting For Failure

General Best Practise

• Write tests (preferably first)

• Continuously integrate

• Write Documentation

Page 63: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Problem: Services Go Away

Page 64: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Circuit Breaking

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Page 65: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Circuit Breaking

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Page 66: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Circuit Breaking

Available solutions:

• https://github.com/Netflix/Hystrix

• https://github.com/ejsmont-artur/php-circuit-breaker

Page 67: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Problem: Spikey Workloads

Page 68: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Queue Based Load Levelling

https://msdn.microsoft.com/en-gb/library/dn589783.aspx

Page 69: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Priority Queue

https://msdn.microsoft.com/en-gb/library/dn589794.aspx

Page 70: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Competing Consumers

https://msdn.microsoft.com/en-gb/library/dn568101.aspx

Page 71: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Monitoring / SLAs

Page 72: DPC 2016 - 53 Minutes or Less - Architecting For Failure

SLA - Service Level Agreement

http://www.nkarten.com/handbook.pdf

Page 73: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Monitoring

Page 74: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Obligatory Meme

Page 75: DPC 2016 - 53 Minutes or Less - Architecting For Failure
Page 76: DPC 2016 - 53 Minutes or Less - Architecting For Failure

The Simian Army

http://techblog.netflix.com/2011/07/netflix-simian-army.htmlhttps://github.com/Netflix/SimianArmy/

Page 77: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Final Thoughts

Page 78: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Questions

Page 79: DPC 2016 - 53 Minutes or Less - Architecting For Failure

Feedback

https://joind.in/talk/41c42