designing for failure

30

Upload: wade-wegner

Post on 21-May-2015

1.086 views

Category:

Technology


0 download

DESCRIPTION

You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for - and mitigate - these failures. Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud?

TRANSCRIPT

Page 1: Designing for failure
Page 2: Designing for failure

Planning for Failure in Cloud Applications

Wade WegnerCTO, Aditi Technologieshttp://www.wadewegner.comhttp://twitter.com/WadeWegner

Page 3: Designing for failure

Monday, August 13 | 7pm – 12am | Mess HallJoin fellow geeks at hack night and show-off your creativity and summer camp spirit.  Form a team or work individually.  You’ll have approximately 4 hours to build an app and develop your pitch.  Judges are looking for some working code and the sprout of cool idea.

You bring: your laptop (Mac or PC), mobile devices to test withWe bring: power, internet, food/drink, music and technical experts

Entry categories:• Best use of Windows Azure• Best use of Twilio• Best summer camp or Wisconsin theme

All participants will have the chance to earn cool swag, and winning teams can choose from a JAMBOX or Sphero.  Prizes will be awarded at midnight.

Win a JAMBOX or Sphero at “Hack without a Thon”

Keep it Legit. Generating ideas ahead of hack night is

encouraged, but please wait to create environments and code

so everyone has an equal starting point.

Page 4: Designing for failure

The real value of attending a conference is making the one-on-one connections and having focused discussions about your business and technology challenges. Visit the Microsoft and Aditi tables in between sessions on Monday & Tuesday, talk to a Windows Azure expert and get a free drink ticket ($5.50 value) for the evening activities.

Let a Windows Azure expert buy your drink

MondayAug 13

10:00 – 11:30am

Economics of doing business in the cloud Jeff NuckollsAditi

Getting started with Windows Azure development Michael CollierMicrosoft MVP

Understanding Windows Azure Storage Brian PrinceMicrosoft

1:00 – 3:00pm

Is the cloud a fit for my project or business? Brian HannaAditi

Understanding Web and Worker Roles in Windows Azure Brian PrinceMicrosoft

Demystifying Windows Azure Service Bus Rick GaribayMicrosoft MVP

3:30 – 5:50pm

Planning for failure in the cloud Wade WegnerAditi

TuesdayAug 14

10:00 – 11:30am

Supporting the Windows Azure application – diagnostics and monitoring

Michael CollierMicrosoft MVP

1:00 – 3:00pm

Preparing your organization for cloud adoption Jeff NuckollsAditi

Page 5: Designing for failure

Planning for Failure in Cloud Applications

Design For FailureRecent cloud outagesBuilding blocks

Infrastructure abstractionAutomation

Architectural options to mitigate failuresConclusions

Page 6: Designing for failure

Takeaways

Outline of architectural options for designing highly-available, fault-tolerant applicationsBest practices for implementation of these architectural optionsMulti-Availability Zone (AZ) & Fault Domain, Multi-Region, and Multi-Cloud

Architectural optionsConsiderations / pros and cons of these options

Page 7: Designing for failure

Why plan for failure?

Notable recent outages:

AWS – Thursday, 4/21/2011Windows Azure – Wednesday, 2/29/2012AWS – Thursday, 6/14/2012AWS – Friday, 6/29/2012Windows Azure – Thursday, 7/26/2012

Page 8: Designing for failure

Why plan for failure?

It will happen!

Page 9: Designing for failure

Terminology

Fault ToleranceDesigns incorporating redundancy and replication to enable systems to continue operating properly (perhaps at a degraded level) if one or more components fails

High Availability (HA)Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users

99% availability = 3.65 days of downtime99.5% = 1.83 days of downtime99.9% = 8.76 hours of downtime99.95% = 4.38 hours of downtime99.99% = 53 minutes of downtime99.999% = 5.26 minutes of downtime

Disaster Recovery (DR)The process, policies, and procedures related to restoring critical systems after a catastrophic event

Page 10: Designing for failure

Compounding SLAs

Windows Azure Compute (2 instances) = 99.95%SQL Azure = 99.9%Windows Azure Storage = 99.9%

Total SLA4.38 hours + 8.76 hours + 8.76 hours21.9 hours

Target: 99.75%

Page 11: Designing for failure

A moment about “the Cloud” …

A frame for discussing the cloud:

A cloud is a physical data center behind an API endpointThink of a cloud as a “resource pool” that you can access via APIA cloud is not …

Windows AzureAmazon Web Services

A cloud is defined by the isolation of the resourcesA cloud is …

Windows Azure North Central RegionAWS US East (North Virginia) Region

This is important moving forward as we talk about DR and HA …

Page 12: Designing for failure

What does HA require?

No single points of failureMultiple web serversMultiple load balancersData replication

Graceful failover when individual components fail (and they will)

Page 13: Designing for failure

Disaster Recovery

Disaster recovery is about preparing for and recovering from a disaster.

Hardware or softwareNetwork or power outagePhysical damage to a buildingHuman error… or something else!

Invest time and resources to plan, prepare, rehearse, document, train, and update processes to deal with events.Continual process of analysis and improvement

Page 14: Designing for failure

Typical DR approach

Duplication of infrastructure to ensure the availability of spare capacity in a disaster scenario

Procured, installed, and maintained so that it’s readSignificant physical distance apart to ensure isolation from faultsTypically under-utilized or over-provisioned

Page 15: Designing for failure

DR with the Cloud

Essential to consider the best use of services and features that support data migration and durable storage; restore data when disaster strikesScale up as needed

Windows AzureRegionsFault Domains

Amazon Web ServicesRegionsAvailability Zones

Page 16: Designing for failure

Design For Failure

Large scale failures in the cloud are rare but happenApplication owners are ultimately responsible for availability and recoverabilityBalance cost and complexity of HA efforts against risk(s) you’re willing to bearCloud infrastructure has made DR and HA remarkably affordable versus past options

Multi-serverMulti-availability zone / fault domainMulti-regionMulti-cloud

Page 17: Designing for failure

Overcoming Multi-Cloud Pain Points

APIs differDifferent sets of resourcesDifferent formats, encodings, and versions

Abstractions and features differNetwork architectures differ: VLANs, security groups, NAT, Ips, ACLs, …Storage architectures differ: local/attachable disks, backup, snapshots, …Hypervisors, machine images … cost models, billing, reporting, …

Each cloud is unique in some/many/all respects, with different access mechanisms and varying functionalities provided by the managed resources.

Page 18: Designing for failure

Overcoming Multi-Cloud Pain Points

Design using generic concepts (e.g. “durable storage”) yet deploy using cloud specifics (e.g. “EBS volumes”)Have tools that translate your concepts to cloud-specific ones (e.g. scripts/recipes that choose the correct provider for the desired resource)Think of how to share resources across clouds (i.e. data sharing)

Page 19: Designing for failure

Infrastructure Abstraction & Automation

Architecture & Application PortabilityAllows simplified deployments across multiple regions/clouds

Automated DeploymentsReproducible configurations with change control (avoids manual configuration errors)Cost effective

Advanced Server and Deployment MonitoringAutomated Scaling and Operations

Page 20: Designing for failure

HA/DR Checklist for Risk Mitigation

Determine who owns the architecture, DR process, and testing.Develop expertise in house and/or get outside help.Conduct a risk assessment for each application.Specify your target Recovery Time Objective and Recovery Point Objective.Design for Failure starting with application architecture.Implement HA best practices, balancing cost, complexity, and risk.

Automate infrastructure for consistency and reliability.Abstract applications for flexibility and portability.

Document operational processes and automations.Test the failover … then test it again.Release the Chaos Monkey.

Page 21: Designing for failure

General HA Best Practices

Avoid single points of failureAlways place (at least) one of each component (load balancers, app servers, databases, …) in at least two AZs or fault domainsMaintain sufficient capacity to absorb AZ / fault domain failures

Reserved Instances – guarantee capacity is available in a separate region/cloud.

Replicate data across AZs and backup or replicate across clouds/regions for failoverSetup monitoring, alerts, and operations to identity and automate problem resolution or failover processDesign stateless applications for resilience to reboot / relaunch

Page 22: Designing for failure

General DR Scenarios

Backup and Restore“Pilot Light” for Simple RecoveryWarm Standby SolutionMulti-site SolutionMulti-cloud Solution

Page 23: Designing for failure

Single-Cloud Multi-Zone/Fault Domain

Running within a single cloud.Theoretically it is highly available but it’s at the mercy of the single data center.If something takes out the data center you lose.

Your Site

Data Storage

LB

AppApp

Master

Volume

Zone /FaultDomain 1 LB

AppApp …

Volume

Zone /FaultDomain 2

Slave

Replication

Page 24: Designing for failure

Multi-Cloud Cold / Warm / Hot DR/HA Options

$ $$ $$$ $$$$

> Few Hours

> 1 Hour

> 5 Minutes

No Downtime

Cold DR(Most Common)

Warm DR(Recommended)

Hot DR(Least Common)

Hot HA(Live/Live Config)

Page 25: Designing for failure

Multi-Cloud Cold DR

Only provisioned environment is primary.Not good if rapid recovery is required.Slow to replicate data to other cloud.Slow to bring database to an operational state.

Your Site

Data Storage

LB

AppApp

Master

Volume

Cloud 1

LB

AppApp

Slave

Volume

Cloud 2

Page 26: Designing for failure

Multi-Cloud Warm DR

Database replicated in secondary but all traffic going to primary.Generally a recommended DR solutionMinimal additional costAllows fairly rapid recovery

Your Site

Data Storage

LB

AppApp

Master

Volume

Cloud 1

LB

AppApp …

Volume

Cloud 2

Slave

Replication

Data Storage

Page 27: Designing for failure

Multi-Cloud Hot DR

Parallel deployment with all servers running but all traffic going to primary.Atypical. Very high additional cost.Allows rapid recovery but not significantly faster than “warm” configuration.

Your Site

Data Storage

LB

AppApp

Master

Volume

Cloud 1

LB

AppApp …

Volume

Cloud 2

Slave

Replication

Data Storage

Page 28: Designing for failure

Multi-Cloud HA

Live/Live configuration. May use geo-targeting services to direct traffic to regional load balancers.Possible, but costly.Provides high availability, but complex to implement and manage.

Your Site

Data Storage

LB

AppApp

Master

Volume

Cloud 1

LB

AppApp …

Volume

Cloud 2

Slave

Replication

Data Storage

Page 29: Designing for failure

How do I make my service immortal?

Be pessimistic: Design for FailureAssume everything will fail and architect a solution capable of handling itTrade off between levels of resiliency and cost

Embrace the cloud mentality: Unprecedented featuresBuild architectural building blocks you can reuse, build up, and teardownUse the powerful provider capabilities within a cloudBuild the glue across clouds for global redundancy Automate every single last detail

Have it your way: No single architecture fits all casesReuse all patterns that fit your use caseCustomize where they don’t’ fit

Crawl, then walk: Build HA within cloud, then expandCost exponentially increases, the 80-20 ruleMulti-AZ HA, solid DR plan, then full multi-cloud HA

Page 30: Designing for failure

Next Year….August 12th – 14th 2013