outage-proof your applications - rightscale compute 2013

34
april25-26 sanfrancisco cloud success starts here Outage-Proof Your Applications Brian Adler, Sr. Services Architect, RightScale Sanket Naik, VP Cloud Operations, Coupa

Upload: rightscale

Post on 20-Aug-2015

434 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Outage-Proof Your Applications - RightScale Compute 2013

april25-26 sanfrancisco

cloud success starts here

Outage-Proof Your ApplicationsBrian Adler, Sr. Services Architect,

RightScaleSanket Naik, VP Cloud Operations,

Coupa

Page 2: Outage-Proof Your Applications - RightScale Compute 2013

# 2# 2

#RightscaleCompute

Agenda• Introductions• Terminology/Level-Setting

• Takeaways

• Cloud and Component Definitions• Designing for Failure• Architectural Options and Considerations

High AvailabilityDisaster Recovery

• Conclusions / Q&A

Page 3: Outage-Proof Your Applications - RightScale Compute 2013

# 3# 3

#RightscaleCompute

Our Mission

Delivering software innovation

that breeds responsible spending

while impacting the company bottom

line

Page 4: Outage-Proof Your Applications - RightScale Compute 2013

# 4# 4

#RightscaleCompute

What does success look like?

Manual Companies

10%Spend Under Management

MostCompanies

30%-50%Spend Under Management

Spend Under Management

80%

Source: Aberdeen Group

Page 5: Outage-Proof Your Applications - RightScale Compute 2013

# 5# 5

#RightscaleCompute

Spend Happens

Pre-Approved Un-Approved

We are the only solution that captures all three

ProcurementExpense

ManagementPost-ApprovedInvoice

Management

Page 6: Outage-Proof Your Applications - RightScale Compute 2013

Over 300 Customers and growing…You

smarter spending simplified

Page 7: Outage-Proof Your Applications - RightScale Compute 2013

# 7# 7

#RightscaleCompute

Terminology

High Availability (HA)

Disaster Recovery (DR)

Fault Tolerance

Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails.

The process, policies and procedures related to restoring critical systems after a catastrophic event.

Goal is to get application back up and running within a defined time period (RTO) and within a certain data loss window (RPO).

Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users.

Page 8: Outage-Proof Your Applications - RightScale Compute 2013

# 8# 8

#RightscaleCompute

Terminology - continued

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

Time period in which service must be restored to meet BCP (Business Continuity Planning) objectives

Acceptable data loss as a result of a recovering from a disaster/catastrophic event

RTO and RPO are often at odds, and tradeoffs need to be made in order to find an acceptable middle ground

Page 9: Outage-Proof Your Applications - RightScale Compute 2013

# 9# 9

#RightscaleCompute

Takeaways• Understand core concepts behind HA and DR

• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures

• Best Practices for implementation of these architectural options within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region

• Architectural options and Considerations / pros and cons of these options

• Understanding of the tools RightScale brings to AWS to simplify the creation of these HA and DR environments

Page 10: Outage-Proof Your Applications - RightScale Compute 2013

# 10# 10

#RightscaleCompute

Regions & Zones

• Zones within a region share a LAN (high bandwidth, low latency, private IP access)

• Zones utilize separate power sources, are physically segregated • Regions are “islands”, and share no resources.

Region 3

Availability Zone A

Availability Zone B

Region 2

Availability Zone A

Availability Zone B

Region 1

Availability Zone A

Availability Zone C

Availability Zone B

Region 4

Availability Zone A

Availability Zone B

Region 5

Availability Zone A

Availability Zone B

Page 11: Outage-Proof Your Applications - RightScale Compute 2013

# 11# 11

#RightscaleCompute

Designing for Failure

• Large scale failures in the cloud are rare but do happen

• Application owners are ultimately responsible for availability and recoverability

• Balance cost and complexity of HA efforts against risk(s) you are willing to bear

• Cloud infrastructure has made DR and HA remarkably affordable versus past options-Multi-Server-Multi-AZ (Availability Zone)-Multi-Region

“Everything fails, all the time.” Werner Vogels, CTO Amazon.com

Page 12: Outage-Proof Your Applications - RightScale Compute 2013

# 12# 12

#RightscaleCompute

Designing for Failure – Basic Concepts

• Fault tolerance is the goal. Degradation of service may occur, but application continues to function.

• Avoid single points of failure (SPOF)

• Assume everything fails (remember Werner’s mantra) and design accordingly

• Plan and practice your recovery process (both for HA and DR)

• Remember that better HA and DR equals more $$$. So find that acceptable balance.

Page 13: Outage-Proof Your Applications - RightScale Compute 2013

# 13# 13

#RightscaleCompute

High Availability

Don’t sweat the small stuff. And it’s all small stuff*

*(until it’s not)

Follow a few general best practices to absorbapplication component outages…

Page 14: Outage-Proof Your Applications - RightScale Compute 2013

# 14# 14

#RightscaleCompute

General HA Best Practices

• Avoid single points of failure.

• Always place one of each component (load balancers, app servers, databases) in at least two AZs.

• Replicate data across AZs (HA) and backup or replicate across regions for failover (DR)

• Setup monitoring, alerts and operations to identify and automate problem resolution or failover process.

Page 15: Outage-Proof Your Applications - RightScale Compute 2013

# 15# 15

#RightscaleCompute

Multi-Zone HA

SLAVE DBMASTER DB

SNAPSHOTS

LOAD BALANCERS

REPLICATE

DNS

S3

EBS

US-EAST 1a 1US-EAST 1b

LOAD BALANCERS

APP SERVERS

AUTOSCALE

172.168.7.31 172.168.8.62

Snapshot data volume for backups so the database can be readily

recovered within the region.

Place Slave databases in one or more zones for failover.

Consider local storage for additional slave database to remove

dependency on attached volume

Consider distributed

NoSQL databases with

the same distribution

considerations.

Page 16: Outage-Proof Your Applications - RightScale Compute 2013

# 16# 16

#RightscaleCompute

Disaster Recovery

DR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…

Don’t sweat the small stuff. And it’s all small stuff*

*(until it’s not)

Page 17: Outage-Proof Your Applications - RightScale Compute 2013

# 17# 17

#RightscaleCompute

HA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and

testing.

• Develop expertise in-house and / or get outside help.

• Conduct a risk assessment for each application.

• Specify your target RTO and RPO.

• Design for failure starting with application architecture. This will help drive the infrastructure architecture.

Page 18: Outage-Proof Your Applications - RightScale Compute 2013

# 18# 18

#RightscaleCompute

HA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost,

complexity and risk.-Automate infrastructure for consistency and reliability.

• Document operational processes and automations.

• Test the failover... then test it again.

• Release the Chaos Monkey.

Page 19: Outage-Proof Your Applications - RightScale Compute 2013

# 19# 19

#RightscaleCompute

Multi-Region/Cloud DR Options

Cold DR

Warm DR

Hot DR

Multi-Cloud HA0

< 5 Mins

< 1 Hour

> 1 Hour

$ $$ $$$ $$$$

(Most Common)

(Recommended)

(Least Common)

(Live/Live Config)

DowntimeAvailability

99.999%

99.9%

99.5%

99%

Page 20: Outage-Proof Your Applications - RightScale Compute 2013

# 20# 20

#RightscaleCompute

Multi-Region Cold DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

APP SERVERS

US WEST

SNAPSHOTS

172.168.7.31

SLAVE DB

US EAST

PersistentStorage

Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online

Block Storage

Page 21: Outage-Proof Your Applications - RightScale Compute 2013

# 21# 21

#RightscaleCompute

Multi-Region Warm DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

APP SERVERS

SLAVE DB

REPLICATE

US WEST

172.168.7.31

US EAST

SNAPSHOTS

Staged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery

SNAPSHOTS

Block Storage

PersistentStorage

Page 22: Outage-Proof Your Applications - RightScale Compute 2013

# 22# 22

#RightscaleCompute

APP SERVERS

Multi-Region Hot DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

US WEST

SNAPSHOTS

172.168.7.31

US EAST

Parallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recovery

SNAPSHOTS

Block Storage

PersistentStorage

Page 23: Outage-Proof Your Applications - RightScale Compute 2013

# 23# 23

#RightscaleCompute

Hybrid HA

APP SERVERS

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

CHICAGO

SNAPSHOTS

172.168.7.31 172.168.8.62

US-EAST

SWIFT

SNAPSHOTS

Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manage

Block Storage

PersistentStorage

Page 24: Outage-Proof Your Applications - RightScale Compute 2013

# 24# 24

#RightscaleCompute

APP SERVERS

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

CHICAGO

SNAPSHOTS

172.168.7.31 172.168.8.62

US-EAST

Hybrid HA

You need DNS management or a global load balancer.

Security requires addt’l effort as security groups are Region-

specific.

Machine Images are specific to the

cloud/region.

Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared

SNAPSHOTS

SWIFT

VOLUMEBlock Storage

PersistentStorage

Page 25: Outage-Proof Your Applications - RightScale Compute 2013

# 25# 25

#RightscaleCompute

Designed for Failure

Page 26: Outage-Proof Your Applications - RightScale Compute 2013

# 26# 26

#RightscaleCompute

Multi-Region Warm DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

APP SERVERS

SLAVE DB

REPLICATE

US WEST

172.168.7.31

US EAST

SNAPSHOTSSNAPSHOTS

Block Storage

PersistentStorage

PersistentStorage

Zero Data Loss 99.99% Uptime Enterprise Software True Cloud

Page 27: Outage-Proof Your Applications - RightScale Compute 2013

# 27# 27

#RightscaleCompute

In the Dashboard

Multi-region or cloud

Multi-region Warm DR

Staged servers

Cost forecasting

for DR environment

Page 28: Outage-Proof Your Applications - RightScale Compute 2013

# 28# 28

#RightscaleCompute

Automating HA and DR• Use dynamic DNS for your database servers

Allow app servers to use a single FQDN.Use a low TTL to allow rapid failover in the case of a change in master database

• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launchNo manual interventionNo DNS modifications

• Automated promotion of slave to masterProcess is automatedDecision to run process is manual

Page 29: Outage-Proof Your Applications - RightScale Compute 2013

# 29# 29

#RightscaleCompute

MultiCloud Images• MultiCloud Images can be launched across regions and

hybrid without modification

How RightScale makes it possible

MultiCloud Images

Cloud A, RightImage 1

Cloud B, RightImage 2

Cloud C, RightImage 3

ServerTemplate contains a list of MultiCloud Images (MCIs)

When the Server is created, a specific MCI is chosen.

Cloud A, RightImage 1

Cloud A

Image 1

The appropriate RightImage is used at launch.

RightImage

Stability across clouds

1

2

3

Page 30: Outage-Proof Your Applications - RightScale Compute 2013

# 30# 30

#RightscaleCompute

How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database

connections• Autoscaling across zones• Dynamic configuration

Page 31: Outage-Proof Your Applications - RightScale Compute 2013

# 31# 31

#RightscaleCompute

DR Cost Comparison ExampleMulti-RegionCold DR

Multi-RegionWarm DR

Multi-RegionHot DR

Total $4480 / month $5630 / month $8800 / month

Running $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)

$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)

$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)

Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)

$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)

Replication $10 / month25GB / day cross-zone

$90 / month25GB / day cross-region

$360 / month100GB / day cross-region

Page 32: Outage-Proof Your Applications - RightScale Compute 2013

# 32# 32

#RightscaleCompute

Outage-Proofing Best Practices

Place in >1 zone:• Load balancers• App servers• Databases

Maintain capacity to absorb zone or region failures

Replicate data across zones

Design stateless apps for resilience to reboot / relaunch

Replicate data across zones

Backup across regions

Monitoring, alert, and automate operations to speed up failover

Replication and Failover

Application Design

Resource Placement

Page 33: Outage-Proof Your Applications - RightScale Compute 2013

# 33# 33

#RightscaleCompute

Resources• White Papers

• http://www.rightscale.com/info_center/white-papers.php

• Webinars• http://www.rightscale.com/info_center/webinars.php• http://

www.rightscale.com/info_center/webinars/safeguard-cloud-apps-aws.php

You can reach Sanket at: [email protected]

is hiring: jobs.coupa.com

Page 34: Outage-Proof Your Applications - RightScale Compute 2013

april25-26 sanfrancisco

cloud success starts here

Questions?