disaster recovery and business continuity - toronto fsi symposium - october 2016

Felix Candelario

Global Financial Services Solutions Architect

“Disaster Recovery and Business Continuity”

Agenda

• AWS Disaster Recovery Concepts & Terminology

• Architecting for Recovery & Resiliency

• Disaster Recovery Testing & Assurance

• Architecting for the Cloud

“Everything fails, all the

time”

- Werner Vogels

(CTO, Amazon.com)

Concepts & Terminology

DR Terminology Map

ELB/Appliance

EC2/Auto Scaling

Route 53

Load Balancers

Web/App Servers

Your Data

Centers

DNS

Amazon RDS

Security Groups / ACL

Availability Zones / VPC

Multi-regionGeographical

Redundancy

Data Centers

Firewall

Database Servers

What is an AWS Region?

• Geographic locations that contains a cluster of

availability zones in a given metropolitan area.

• Each region is completely isolated and

independent from other regions

• Each region consists of 2 or more AZs to support

high availability (HA) through AZ independence

Highly Reliable Global Footprint

• Over 1 million active

customers per month across

190 countries

• 2,300 government agencies

• 7,000 educational

institutions

• 35 availability zones + 9

more coming soon

• 59 edge locations

13+ worldwide regions

What are Availability Zones?

• Groupings of one or more data centers that are

physically isolated.

• AZs are connected to each other over low-

latency links within the same region

• Using 2 or more AZs within a region can provide

support for capabilities such as synchronous

database replication and better pricing when

using Amazon EC2 Spot instances

Availability Zones are Notated as Letters

35 Availability Zones (AZs)

• Example

• US East 1 (Northern VA)

– us-east-1a

– us-east-1b

– us-east-1c

– us-east-1d

– us-east-1e

Availability Zone A

Availability Zone B

Availability Zone C

US-EAST-1

Availability Zone D

Availability Zone E

What is an Amazon VPC?

• Virtual isolated network that you define in which you can

launch AWS resources such as Amazon EC2 instances

• Complete control of your virtual networking environment

such as

• Set your own IP address ranges

• Create subnets

• Configure routing tables and network gateways

• Allows extension of your corporate network to the AWS

Cloud

VPC Pattern Diagram - Example

Development

Amazon VPC

Integration

Amazon VPC

Pre-production

Amazon VPC

Production

Amazon VPC

Putting It All Together

What Compute Services are available?

Amazon EC2 Auto ScalingElastic Load

Balancing

Actual

EC2

Elastic Virtual servers

in the cloud

Dynamic traffic

distribution

Automated scaling

of EC2 capacity

What Network Services are available?

Amazon VPC: AWS DirectConnect Amazon Route 53

Availability

Zone BAvailability

Zone A

Private, isolated

section of the AWS

Cloud

Private connectivity

between AWS and your

datacenter

Domain Name System

(DNS) web service.

Architecting for Recovery &

Resiliency

Resiliency

Backup Disaster Recovery

Reducing likelihood of

service failure

Maintaining Data

IntegrityRecovery after loss of

availability

It’s not all or nothing. Choose a strategy that

fits the business objective.

DisasterRecovery point Recovery time

Data loss Down time

Ascending levels of DR options

Backup &

Restore

Pilot Light

Warm

Standby

Hot-Site

Backup of on-

premises data to

AWS to use in a DR

event

Replicate data and

minimal running

services into AWS,

ready to take over

and flare up

Replicate data and

services into AWS

ready to take over

Replicated and load

balanced

environments that

are both actively

taking production

traffic

RPO

aRTO

COST

24 hours 24 hours

$

RPO

aRTO

COST

12 hours 4 hours

$$

RPO

aRTO

COST

1-4 hours 15 min

$$$

RPO

aRTO

COST

<15 min 0-5 min

$$$

Business continuity

begins

Un-interrupted Business

continuity

~$200 / Month

In US-EAST

+VPN

On-premises

Active Production

www.example.com

Corporate data center AWS region

AWS DR failover

App

Servers

DB

Server

VPN

Connection

Storage

GatewayiSCSI

Backup

System

S3 / Bucket

Glacier / Archive

Web

Servers Internet traffic

S3 (1TB)

$31/Month

Glacier (2TB)

$22/Month

Storage Gateway

$125/Month

S3 / Bucket

S3 (1TB)

$31/Month1TB Data

Volume

Backup and Restore Architecture

Suitable for

• Solutions that can sustain higher technical debt

• Lower business critical nature

• Low cost DR option

Leverage existing investments in

• De-duplication

• Compression

• WAN Acceleration

Backup and Restore Details

Pilot light

Subordinate

database

server

Pilot light–prepwww.example.com

Data mirroring replication

Not running

Pilot light system

Reverse

proxy/

caching

server

Datavolume

Application

server

Corporate data center

Reverse proxy/ caching server

Application server

MasterDatabase

server

Database

server

Pilot light–recoverywww.example.com

Start in minutes

Add additional

capacity,

if needed

Reverse

proxy/

caching

server

Datavolume

Application

server



Application server

MasterDatabase

server

Considerations

Suitable for:

• Solutions that need lower RTO & RPO

• higher business critical nature

• Mid-range cost DR option

Pilot Light Details

Warm standby

Warm standby–prep

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

MasterDatabase

server

Warm standby–recover

Elastic loadbalancerActive

Route 53

www.example.com

Scaled-up

production


Datavolume

Applicationserver

Databaseserver


AWS region


Application Server

MasterDatabase

server

Hot site

Hot site–prep



cut over


ActiveRoute 53

www.example.com


Datavolume

Applicationserver

Subordinate database

server


AWS region


Application server

MasterDatabase

server

Active

Hot site–recovery


Route 53

www.example.com


Datavolume

Applicationserver

Databaseserver


AWS region


Application server

MasterDatabase

server

Active

Scaled up

for production

use

Considerations

Suitable for:

• Solutions that require RTO & RPO in minutes

• Core business critical functions

• Higher cost DR option

Warm Standby and Multi-site Details

Disaster Recovery Testing &

Assurance

Continuous Testing of Infrastructure

• Continuously and constantly test.

• Regularly execute tests in stable, production &

production-like test environments.

• Infrastructure as Code

• CI/CD Test in Infrastructure Build Pipeline

• Testing of infrastructure during Integration Test

Warm Standby – Testing



cut over



production traffic

Route 53

www.example.com

Scaled down


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

MasterDatabase

server

Warm Standby – Testing



cut over



production traffic

Route 53

www.example.com

Scaled down


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

MasterDatabase

server

aws rds reboot-db-instance --db-instance-identifier

dbInstanceID --force-failover

Architecting for Cloud

Architecting for Resiliency

Cloud Based Architectures

• High level of control over the environment

• Automate Everything! – Utilise AWS APIs

• Infrastructure as code – CloudFormation

• Parallel environment

• Rolling Update / All at Once

• Blue / Green Deployments

- Significant difference between physical and cloud is the

control and visibility cloud provides

Common thread: Environment automation

Deployment success depends on

mitigating risk for:

• Application issues (functional)

• Application performance

• People/process errors

• Infrastructure failure

• Rollback capability

• Large costs

CloudFormation most

comprehensive

automation platform

• Scope stacks from

network to software

• Control higher-level

automation services:

Elastic Beanstalk, ECS,

OpsWorks, Auto Scaling

Strength of

automation

platform

Benefits of deployment on AWS

AWS:

• Agile deployments

• Flexible options

• RPO/RTO & Business

Continuity objectives

• Scalable capacity

• Pay for what you use

• Automation capabilities

Enterprise Observations

Business

Enablement

Art of the

Possible

Legacy Tech

Debt

Art of the Possible - State of DevOps 2016

Frequent Deployments

200x more frequent

deployment

Faster Recovery

24x faster recovery

from failure

Lower Failure Rate

3x lower change failure

rate

Less Unplanned Work

22% less time spent on

unplanned work and

rework

Shorter Lead Times

2,555x shorter lead

times

Source: Puppet Labs - State of DevOps 2016 Report

Thank You