designing for failure
DESCRIPTION
You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for - and mitigate - these failures. Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud?TRANSCRIPT
Planning for Failure in Cloud Applications
Wade WegnerCTO, Aditi Technologieshttp://www.wadewegner.comhttp://twitter.com/WadeWegner
Monday, August 13 | 7pm – 12am | Mess HallJoin fellow geeks at hack night and show-off your creativity and summer camp spirit. Form a team or work individually. You’ll have approximately 4 hours to build an app and develop your pitch. Judges are looking for some working code and the sprout of cool idea.
You bring: your laptop (Mac or PC), mobile devices to test withWe bring: power, internet, food/drink, music and technical experts
Entry categories:• Best use of Windows Azure• Best use of Twilio• Best summer camp or Wisconsin theme
All participants will have the chance to earn cool swag, and winning teams can choose from a JAMBOX or Sphero. Prizes will be awarded at midnight.
Win a JAMBOX or Sphero at “Hack without a Thon”
Keep it Legit. Generating ideas ahead of hack night is
encouraged, but please wait to create environments and code
so everyone has an equal starting point.
The real value of attending a conference is making the one-on-one connections and having focused discussions about your business and technology challenges. Visit the Microsoft and Aditi tables in between sessions on Monday & Tuesday, talk to a Windows Azure expert and get a free drink ticket ($5.50 value) for the evening activities.
Let a Windows Azure expert buy your drink
MondayAug 13
10:00 – 11:30am
Economics of doing business in the cloud Jeff NuckollsAditi
Getting started with Windows Azure development Michael CollierMicrosoft MVP
Understanding Windows Azure Storage Brian PrinceMicrosoft
1:00 – 3:00pm
Is the cloud a fit for my project or business? Brian HannaAditi
Understanding Web and Worker Roles in Windows Azure Brian PrinceMicrosoft
Demystifying Windows Azure Service Bus Rick GaribayMicrosoft MVP
3:30 – 5:50pm
Planning for failure in the cloud Wade WegnerAditi
TuesdayAug 14
10:00 – 11:30am
Supporting the Windows Azure application – diagnostics and monitoring
Michael CollierMicrosoft MVP
1:00 – 3:00pm
Preparing your organization for cloud adoption Jeff NuckollsAditi
Planning for Failure in Cloud Applications
Design For FailureRecent cloud outagesBuilding blocks
Infrastructure abstractionAutomation
Architectural options to mitigate failuresConclusions
Takeaways
Outline of architectural options for designing highly-available, fault-tolerant applicationsBest practices for implementation of these architectural optionsMulti-Availability Zone (AZ) & Fault Domain, Multi-Region, and Multi-Cloud
Architectural optionsConsiderations / pros and cons of these options
Why plan for failure?
Notable recent outages:
AWS – Thursday, 4/21/2011Windows Azure – Wednesday, 2/29/2012AWS – Thursday, 6/14/2012AWS – Friday, 6/29/2012Windows Azure – Thursday, 7/26/2012
Why plan for failure?
It will happen!
Terminology
Fault ToleranceDesigns incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded level) if one or more components fails
High Availability (HA)Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users
99% availability = 3.65 days of downtime99.5% = 1.83 days of downtime99.9% = 8.76 hours of downtime99.95% = 4.38 hours of downtime99.99% = 53 minutes of downtime99.999% = 5.26 minutes of downtime
Disaster Recovery (DR)The process, policies, and procedures related to restoring critical systems after a catastrophic event
Compounding SLAs
Windows Azure Compute (2 instances) = 99.95%SQL Azure = 99.9%Windows Azure Storage = 99.9%
Total SLA4.38 hours + 8.76 hours + 8.76 hours21.9 hours
Target: 99.75%
A moment about “the Cloud” …
A frame for discussing the cloud:
A cloud is a physical data center behind an API endpointThink of a cloud as a “resource pool” that you can access via APIA cloud is not …
Windows AzureAmazon Web Services
A cloud is defined by the isolation of the resourcesA cloud is …
Windows Azure North Central RegionAWS US East (North Virginia) Region
This is important moving forward as we talk about DR and HA …
What does HA require?
No single points of failureMultiple web serversMultiple load balancersData replication
Graceful failover when individual components fail (and they will)
Disaster Recovery
Disaster recovery is about preparing for and recovering from a disaster.
Hardware or softwareNetwork or power outagePhysical damage to a buildingHuman error… or something else!
Invest time and resources to plan, prepare, rehearse, document, train, and update processes to deal with events.Continual process of analysis and improvement
Typical DR approach
Duplication of infrastructure to ensure the availability of spare capacity in a disaster scenario
Procured, installed, and maintained so that it’s readSignificant physical distance apart to ensure isolation from faultsTypically under-utilized or over-provisioned
DR with the Cloud
Essential to consider the best use of services and features that support data migration and durable storage; restore data when disaster strikesScale up as needed
Windows AzureRegionsFault Domains
Amazon Web ServicesRegionsAvailability Zones
Design For Failure
Large scale failures in the cloud are rare but happenApplication owners are ultimately responsible for availability and recoverabilityBalance cost and complexity of HA efforts against risk(s) you’re willing to bearCloud infrastructure has made DR and HA remarkably affordable versus past options
Multi-serverMulti-availability zone / fault domainMulti-regionMulti-cloud
Overcoming Multi-Cloud Pain Points
APIs differDifferent sets of resourcesDifferent formats, encodings, and versions
Abstractions and features differNetwork architectures differ: VLANs, security groups, NAT, Ips, ACLs, …Storage architectures differ: local/attachable disks, backup, snapshots, …Hypervisors, machine images … cost models, billing, reporting, …
Each cloud is unique in some/many/all respects, with different access mechanisms and varying functionalities provided by the managed resources.
Overcoming Multi-Cloud Pain Points
Design using generic concepts (e.g. “durable storage”) yet deploy using cloud specifics (e.g. “EBS volumes”)Have tools that translate your concepts to cloud-specific ones (e.g. scripts/recipes that choose the correct provider for the desired resource)Think of how to share resources across clouds (i.e. data sharing)
Infrastructure Abstraction & Automation
Architecture & Application PortabilityAllows simplified deployments across multiple regions/clouds
Automated DeploymentsReproducible configurations with change control (avoids manual configuration errors)Cost effective
Advanced Server and Deployment MonitoringAutomated Scaling and Operations
HA/DR Checklist for Risk Mitigation
Determine who owns the architecture, DR process, and testing.Develop expertise in house and/or get outside help.Conduct a risk assessment for each application.Specify your target Recovery Time Objective and Recovery Point Objective.Design for Failure starting with application architecture.Implement HA best practices, balancing cost, complexity, and risk.
Automate infrastructure for consistency and reliability.Abstract applications for flexibility and portability.
Document operational processes and automations.Test the failover … then test it again.Release the Chaos Monkey.
General HA Best Practices
Avoid single points of failureAlways place (at least) one of each component (load balancers, app servers, databases, …) in at least two AZs or fault domainsMaintain sufficient capacity to absorb AZ / fault domain failures
Reserved Instances – guarantee capacity is available in a separate region/cloud.
Replicate data across AZs and backup or replicate across clouds/regions for failoverSetup monitoring, alerts, and operations to identity and automate problem resolution or failover processDesign stateless applications for resilience to reboot / relaunch
General DR Scenarios
Backup and Restore“Pilot Light” for Simple RecoveryWarm Standby SolutionMulti-site SolutionMulti-cloud Solution
Single-Cloud Multi-Zone/Fault Domain
Running within a single cloud.Theoretically it is highly available but it’s at the mercy of the single data center.If something takes out the data center you lose.
Your Site
Data Storage
LB
AppApp
Master
…
Volume
Zone /FaultDomain 1 LB
AppApp …
Volume
Zone /FaultDomain 2
Slave
Replication
Multi-Cloud Cold / Warm / Hot DR/HA Options
$ $$ $$$ $$$$
> Few Hours
> 1 Hour
> 5 Minutes
No Downtime
Cold DR(Most Common)
Warm DR(Recommended)
Hot DR(Least Common)
Hot HA(Live/Live Config)
Multi-Cloud Cold DR
Only provisioned environment is primary.Not good if rapid recovery is required.Slow to replicate data to other cloud.Slow to bring database to an operational state.
Your Site
Data Storage
LB
AppApp
Master
…
Volume
Cloud 1
LB
AppApp
Slave
…
Volume
Cloud 2
Multi-Cloud Warm DR
Database replicated in secondary but all traffic going to primary.Generally a recommended DR solutionMinimal additional costAllows fairly rapid recovery
Your Site
Data Storage
LB
AppApp
Master
…
Volume
Cloud 1
LB
AppApp …
Volume
Cloud 2
Slave
Replication
Data Storage
Multi-Cloud Hot DR
Parallel deployment with all servers running but all traffic going to primary.Atypical. Very high additional cost.Allows rapid recovery but not significantly faster than “warm” configuration.
Your Site
Data Storage
LB
AppApp
Master
…
Volume
Cloud 1
LB
AppApp …
Volume
Cloud 2
Slave
Replication
Data Storage
Multi-Cloud HA
Live/Live configuration. May use geo-targeting services to direct traffic to regional load balancers.Possible, but costly.Provides high availability, but complex to implement and manage.
Your Site
Data Storage
LB
AppApp
Master
…
Volume
Cloud 1
LB
AppApp …
Volume
Cloud 2
Slave
Replication
Data Storage
How do I make my service immortal?
Be pessimistic: Design for FailureAssume everything will fail and architect a solution capable of handling itTrade off between levels of resiliency and cost
Embrace the cloud mentality: Unprecedented featuresBuild architectural building blocks you can reuse, build up, and teardownUse the powerful provider capabilities within a cloudBuild the glue across clouds for global redundancy Automate every single last detail
Have it your way: No single architecture fits all casesReuse all patterns that fit your use caseCustomize where they don’t’ fit
Crawl, then walk: Build HA within cloud, then expandCost exponentially increases, the 80-20 ruleMulti-AZ HA, solid DR plan, then full multi-cloud HA
Next Year….August 12th – 14th 2013