outage-proof your applications - rightscale compute 2013
TRANSCRIPT
april25-26 sanfrancisco
cloud success starts here
Outage-Proof Your ApplicationsBrian Adler, Sr. Services Architect,
RightScaleSanket Naik, VP Cloud Operations,
Coupa
# 2# 2
#RightscaleCompute
Agenda• Introductions• Terminology/Level-Setting
• Takeaways
• Cloud and Component Definitions• Designing for Failure• Architectural Options and Considerations
High AvailabilityDisaster Recovery
• Conclusions / Q&A
# 3# 3
#RightscaleCompute
Our Mission
Delivering software innovation
that breeds responsible spending
while impacting the company bottom
line
# 4# 4
#RightscaleCompute
What does success look like?
Manual Companies
10%Spend Under Management
MostCompanies
30%-50%Spend Under Management
Spend Under Management
80%
Source: Aberdeen Group
# 5# 5
#RightscaleCompute
Spend Happens
Pre-Approved Un-Approved
We are the only solution that captures all three
ProcurementExpense
ManagementPost-ApprovedInvoice
Management
Over 300 Customers and growing…You
smarter spending simplified
# 7# 7
#RightscaleCompute
Terminology
High Availability (HA)
Disaster Recovery (DR)
Fault Tolerance
Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails.
The process, policies and procedures related to restoring critical systems after a catastrophic event.
Goal is to get application back up and running within a defined time period (RTO) and within a certain data loss window (RPO).
Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users.
# 8# 8
#RightscaleCompute
Terminology - continued
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Time period in which service must be restored to meet BCP (Business Continuity Planning) objectives
Acceptable data loss as a result of a recovering from a disaster/catastrophic event
RTO and RPO are often at odds, and tradeoffs need to be made in order to find an acceptable middle ground
# 9# 9
#RightscaleCompute
Takeaways• Understand core concepts behind HA and DR
• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures
• Best Practices for implementation of these architectural options within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region
• Architectural options and Considerations / pros and cons of these options
• Understanding of the tools RightScale brings to AWS to simplify the creation of these HA and DR environments
# 10# 10
#RightscaleCompute
Regions & Zones
• Zones within a region share a LAN (high bandwidth, low latency, private IP access)
• Zones utilize separate power sources, are physically segregated • Regions are “islands”, and share no resources.
Region 3
Availability Zone A
Availability Zone B
Region 2
Availability Zone A
Availability Zone B
Region 1
Availability Zone A
Availability Zone C
Availability Zone B
Region 4
Availability Zone A
Availability Zone B
Region 5
Availability Zone A
Availability Zone B
# 11# 11
#RightscaleCompute
Designing for Failure
• Large scale failures in the cloud are rare but do happen
• Application owners are ultimately responsible for availability and recoverability
• Balance cost and complexity of HA efforts against risk(s) you are willing to bear
• Cloud infrastructure has made DR and HA remarkably affordable versus past options-Multi-Server-Multi-AZ (Availability Zone)-Multi-Region
“Everything fails, all the time.” Werner Vogels, CTO Amazon.com
# 12# 12
#RightscaleCompute
Designing for Failure – Basic Concepts
• Fault tolerance is the goal. Degradation of service may occur, but application continues to function.
• Avoid single points of failure (SPOF)
• Assume everything fails (remember Werner’s mantra) and design accordingly
• Plan and practice your recovery process (both for HA and DR)
• Remember that better HA and DR equals more $$$. So find that acceptable balance.
# 13# 13
#RightscaleCompute
High Availability
Don’t sweat the small stuff. And it’s all small stuff*
*(until it’s not)
Follow a few general best practices to absorbapplication component outages…
# 14# 14
#RightscaleCompute
General HA Best Practices
• Avoid single points of failure.
• Always place one of each component (load balancers, app servers, databases) in at least two AZs.
• Replicate data across AZs (HA) and backup or replicate across regions for failover (DR)
• Setup monitoring, alerts and operations to identify and automate problem resolution or failover process.
# 15# 15
#RightscaleCompute
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a 1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups so the database can be readily
recovered within the region.
Place Slave databases in one or more zones for failover.
Consider local storage for additional slave database to remove
dependency on attached volume
Consider distributed
NoSQL databases with
the same distribution
considerations.
# 16# 16
#RightscaleCompute
Disaster Recovery
DR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…
Don’t sweat the small stuff. And it’s all small stuff*
*(until it’s not)
# 17# 17
#RightscaleCompute
HA/DR Checklist for Risk Mitigation• Determine who owns the architecture, DR process and
testing.
• Develop expertise in-house and / or get outside help.
• Conduct a risk assessment for each application.
• Specify your target RTO and RPO.
• Design for failure starting with application architecture. This will help drive the infrastructure architecture.
# 18# 18
#RightscaleCompute
HA/DR Checklist for Risk Mitigation• Implement HA best practices balancing cost,
complexity and risk.-Automate infrastructure for consistency and reliability.
• Document operational processes and automations.
• Test the failover... then test it again.
• Release the Chaos Monkey.
# 19# 19
#RightscaleCompute
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%
# 20# 20
#RightscaleCompute
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
US WEST
SNAPSHOTS
172.168.7.31
SLAVE DB
US EAST
PersistentStorage
Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online
Block Storage
# 21# 21
#RightscaleCompute
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
Block Storage
PersistentStorage
# 22# 22
#RightscaleCompute
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
US WEST
SNAPSHOTS
172.168.7.31
US EAST
Parallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recovery
SNAPSHOTS
Block Storage
PersistentStorage
# 23# 23
#RightscaleCompute
Hybrid HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manage
Block Storage
PersistentStorage
# 24# 24
#RightscaleCompute
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
Hybrid HA
You need DNS management or a global load balancer.
Security requires addt’l effort as security groups are Region-
specific.
Machine Images are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared
SNAPSHOTS
SWIFT
VOLUMEBlock Storage
PersistentStorage
# 25# 25
#RightscaleCompute
Designed for Failure
# 26# 26
#RightscaleCompute
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTSSNAPSHOTS
Block Storage
PersistentStorage
PersistentStorage
Zero Data Loss 99.99% Uptime Enterprise Software True Cloud
# 27# 27
#RightscaleCompute
In the Dashboard
Multi-region or cloud
Multi-region Warm DR
Staged servers
Cost forecasting
for DR environment
# 28# 28
#RightscaleCompute
Automating HA and DR• Use dynamic DNS for your database servers
Allow app servers to use a single FQDN.Use a low TTL to allow rapid failover in the case of a change in master database
• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launchNo manual interventionNo DNS modifications
• Automated promotion of slave to masterProcess is automatedDecision to run process is manual
# 29# 29
#RightscaleCompute
MultiCloud Images• MultiCloud Images can be launched across regions and
hybrid without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, RightImage 1
Cloud B, RightImage 2
Cloud C, RightImage 3
ServerTemplate contains a list of MultiCloud Images (MCIs)
When the Server is created, a specific MCI is chosen.
Cloud A, RightImage 1
Cloud A
Image 1
The appropriate RightImage is used at launch.
RightImage
Stability across clouds
1
2
3
# 30# 30
#RightscaleCompute
How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database
connections• Autoscaling across zones• Dynamic configuration
# 31# 31
#RightscaleCompute
DR Cost Comparison ExampleMulti-RegionCold DR
Multi-RegionWarm DR
Multi-RegionHot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)
$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)
$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)
Replication $10 / month25GB / day cross-zone
$90 / month25GB / day cross-region
$360 / month100GB / day cross-region
# 32# 32
#RightscaleCompute
Outage-Proofing Best Practices
Place in >1 zone:• Load balancers• App servers• Databases
Maintain capacity to absorb zone or region failures
Replicate data across zones
Design stateless apps for resilience to reboot / relaunch
Replicate data across zones
Backup across regions
Monitoring, alert, and automate operations to speed up failover
Replication and Failover
Application Design
Resource Placement
# 33# 33
#RightscaleCompute
Resources• White Papers
• http://www.rightscale.com/info_center/white-papers.php
• Webinars• http://www.rightscale.com/info_center/webinars.php• http://
www.rightscale.com/info_center/webinars/safeguard-cloud-apps-aws.php
You can reach Sanket at: [email protected]
is hiring: jobs.coupa.com
april25-26 sanfrancisco
cloud success starts here
Questions?