cloud immortality - architecting for high availability & disaster recovery

25
Cloud Immortality: Architecting for High Availability & Disaster Recovery Brian Adler, Sr. Professional Services Architect, RightScale Josep Blanquer, Sr. Systems Architect, RightScale Watch the video of this presentation

Upload: rightscale

Post on 20-Aug-2015

879 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Cloud Immortality - Architecting for High Availability & Disaster Recovery

1

Cloud Immortality: Architecting for High Availability & Disaster Recovery

Brian Adler, Sr. Professional Services Architect, RightScale

Josep Blanquer, Sr. Systems Architect, RightScale

Watch the video of this presentation

Page 2: Cloud Immortality - Architecting for High Availability & Disaster Recovery

2

Real Cloud Experience. Shared.

# 2

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 3: Cloud Immortality - Architecting for High Availability & Disaster Recovery

3

Real Cloud Experience. Shared.

# 3

Terminology• Fault Tolerance

• Designs incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded level) if one or more components fails

• High Availability (HA)• Fault Tolerant systems are measured by their Availability in terms of

planned and unplanned service outages for end users• 99% Availability = 3.65 days of downtime per year• 99.9% Availability = 8.76 hours of downtime per year• 99.99% Availability = 53 minutes of downtime per year• 99.999% Availability = 5.26 minutes of downtime per year

• Disaster Recovery (DR)• The process, policies and procedures related to restoring critical systems

after a catastrophic event

Page 4: Cloud Immortality - Architecting for High Availability & Disaster Recovery

4

Real Cloud Experience. Shared.

# 4

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 5: Cloud Immortality - Architecting for High Availability & Disaster Recovery

5

Real Cloud Experience. Shared.

# 5

Takeaways• Introduction to architectural options for designing highly-

available, fault-tolerant applications

• Best Practices for implementation of these architectural options

• Multi-Availability Zone (AZ), Multi-Region, and Multi-Cloud• Architectural options• Considerations/pros and cons of these options

Page 6: Cloud Immortality - Architecting for High Availability & Disaster Recovery

6

Real Cloud Experience. Shared.

# 6

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 7: Cloud Immortality - Architecting for High Availability & Disaster Recovery

7

Real Cloud Experience. Shared.

# 7

Designing for Failure• Large scale failures in the cloud are rare but do happen• Application owners are ultimately responsible for

availability and recoverability• Need to balance cost and complexity of HA efforts against

risk(s) you are willing to bear• Cloud infrastructure has made DR and HA remarkably

affordable versus past options• Multi-server• Multi-AZ• Multi-Region• Multi-Cloud

Page 8: Cloud Immortality - Architecting for High Availability & Disaster Recovery

8

Real Cloud Experience. Shared.

# 8

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 9: Cloud Immortality - Architecting for High Availability & Disaster Recovery

9

Real Cloud Experience. Shared.

# 9

What do we mean by “Cloud”?

• A cloud is a physical datacenter entity behind an API endpoint

• What does that really mean?• Amazon Web Services is not a cloud• EC2 is not a cloud• Eucalyptus, Cloud.com, OpenStack are not clouds• EC2 US-East, Rackspace, ‘my private cloud’… these are clouds• An availability zone is not a cloud (but it is part of one)

Think of a cloud as a “resource pool” accessed via an API

Page 10: Cloud Immortality - Architecting for High Availability & Disaster Recovery

10

Real Cloud Experience. Shared.

# 10

Regions & Availability Zones

• Zones within a region share a LAN (high bandwidth, low latency, private IP access)• Zones utilize separate power sources, are physically segregated• Regions are “islands”, and share no resources

Page 11: Cloud Immortality - Architecting for High Availability & Disaster Recovery

11

Real Cloud Experience. Shared.

# 11

Overcoming Multi-Cloud Pain Points• APIs differ

• Different sets of available resources• Different formats, encodings and versions

• Abstractions and features differ• Network architectures differ: VLANs, security groups, NAT, IPs, ACLs, …• Storage architectures differ: local/attachable disks, backup, snapshots, …• Hypervisors, machine images, cost models, billing, reporting…etc.

Each cloud is unique in some/many/all respects, with differentaccess mechanisms and varying functionalities providedby the managed resources.

Page 12: Cloud Immortality - Architecting for High Availability & Disaster Recovery

12

Real Cloud Experience. Shared.

# 12

Overcoming Multi-Cloud Pain Points• Navigating the obstacles

• Design using generic concepts (“durable storage”) yet deploy using cloud specifics (“EBS volumes”)

• Have tools that translate your concepts to cloud-specific ones (e.g. scripts/recipes that choose the correct provider for the desired resource)

• Think of how to share resources across clouds (i.e. data sharing)

Page 13: Cloud Immortality - Architecting for High Availability & Disaster Recovery

13

Real Cloud Experience. Shared.

# 13

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• Infrastructure abstraction and automation as building blocks for highly available applications

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 14: Cloud Immortality - Architecting for High Availability & Disaster Recovery

14

Real Cloud Experience. Shared.

# 14

HA/DR Checklist for Risk Mitigation

Determine who owns the architecture, DR process and testing. Develop expertise in-house and / or get outside help. Conduct a risk assessment for each application. Specify your target Recovery Time Objective

and Recovery Point Objective. Design for failure starting with application architecture. This will help

drive the infrastructure architecture. Implement HA best practices balancing cost, complexity and risk.

Automate infrastructure for consistency and reliability.

Document operational processes and automations. Test the failover... then test it again. Release the Chaos Monkey.

Page 15: Cloud Immortality - Architecting for High Availability & Disaster Recovery

15

Real Cloud Experience. Shared.

# 15

General HA Best Practices

Avoid single points of failure Always place (at least) one of each component (load balancers,

app servers, databases) in at least two AZs Maintain sufficient capacity to absorb AZ / cloud failures

Reserved Instances – guarantee capacity is available in a separate region/cloud

Replicate data across AZs and backup or replicate across clouds/regions for failover

Setup monitoring, alerts and operations to identify and automate problem resolution or failover process

Design stateless applications for resilience to reboot / relaunch

Page 16: Cloud Immortality - Architecting for High Availability & Disaster Recovery

16

Real Cloud Experience. Shared.

# 16

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• Infrastructure abstraction and automation as building blocks for highly available applications

• HA and DR Checklists and Best Practices• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 17: Cloud Immortality - Architecting for High Availability & Disaster Recovery

17

Real Cloud Experience. Shared.

# 17

Zone1 Zone2

The Start: A Typical Single-Cloud Application

LB

App

Master

LB

App App

Slave

Cloud 1• Three tiered Application

• RR DNS Load Balancers• Array of Application Servers• Master – Slave Databases

• Isolation Features• Faults: using Availability Zones

• Security: Security Groups / Firewalls

• Plus• Monitoring• Alerts• Automation

Page 18: Cloud Immortality - Architecting for High Availability & Disaster Recovery

18

Real Cloud Experience. Shared.

# 18

Scaling: Bursting App Capacity to another Cloud

LB

Master

LB

Slave

App App App…

Private Cloud Public Cloud

App App App…

Page 19: Cloud Immortality - Architecting for High Availability & Disaster Recovery

19

Real Cloud Experience. Shared.

# 19

ServerArray

Scaling: Bursting App Capacity to another Cloud

LB

App

Master

LB

App App

Slave

… App App App…

Public Cloud

CPUScale UpScale Down

Private Cloud

Page 20: Cloud Immortality - Architecting for High Availability & Disaster Recovery

20

Real Cloud Experience. Shared.

# 20

Public Cloud

ServerArray

CPUScale UpScale Down

Scaling: Bursting App Capacity to another Cloud

LB

App

Master

LB

App App

Slave

… App App App…

Contact DB:• Using DNS / IP• Executing a script

Security:• Access Control• Encryption

Public Internet

Private Cloud

Register with LB’s:• Using SSH• Using RightNet• Central repo

Page 21: Cloud Immortality - Architecting for High Availability & Disaster Recovery

21

Real Cloud Experience. Shared.

# 21

Public Cloud

ServerArray

CPUScale UpScale Down

DR: Replicating Data in another Cloud

LB

App

Master

LB

App App

Slave

… App App App…

Private Cloud

Where’s the data?

Public Internet

Slavesnapsnapsnapsnap

snapsnapsnapsnap

snapsnapsnapsnap

Access and encryption

Page 22: Cloud Immortality - Architecting for High Availability & Disaster Recovery

22

Real Cloud Experience. Shared.

# 22

Public Cloud

ServerArray

CPUScale UpScale Down

DR: Replicating Data in another Cloud

LB

App

Master

LB

App App

Slave

… App App App…

Private Cloud

Public Internet

Slave

snapsnapsnapsnap

snapsnapsnapsnap

snapsnapsnapsnap

• Global storage• Live replication + rsync• NoSQL

S3/Cloudfiles

Page 23: Cloud Immortality - Architecting for High Availability & Disaster Recovery

23

Real Cloud Experience. Shared.

# 23

Public Cloud

ServerArray

CPUScale UpScale Down

Full distribution: Splitting the Load Balancers

LB

App

Master

LB

App App

Slave

… App App App…

Private Cloud

Public Internet

Slave

snapsnapsnapsnap

snapsnapsnapsnap

snapsnapsnapsnap

Page 24: Cloud Immortality - Architecting for High Availability & Disaster Recovery

24

Real Cloud Experience. Shared.

# 24

Agenda

• Terminology/Level-Setting• Takeaways• Designing for Failure• Cloud and component definitions (more terminology)

• Infrastructure abstraction and automation as building blocks for highly available applications

• Building a complete HA example across multiple clouds• Conclusions / Q&A

Page 25: Cloud Immortality - Architecting for High Availability & Disaster Recovery

25

Real Cloud Experience. Shared.

# 25

So how do I make my service immortal?• Be pessimistic: Design for failure

• Assume everything will fail and architect a solution capable of handling it• Trade off between levels of resiliency and cost

• Embrace the Cloud mentality: Unprecedented features• Build architectural building blocks you can reuse, build up and teardown• Use the powerful provider capabilities within a cloud• Build the glue across clouds for global redundancy (RightScale helps a lot)• Automate every single last detail

• Have it your way: No single architecture that fits all cases• Reuse all patterns that fit your use case • Customize where they don’t fit

• Crawl, then walk: Build HA within cloud, then expand• Cost exponentially increases, the 80-20 rule.• Multi-AZ HA, solid DR plan, then full multi-cloud HA