disaster recovery and business continuity - toronto fsi symposium - october 2016

46
Felix Candelario Global Financial Services Solutions Architect “Disaster Recovery and Business Continuity”

Upload: amazon-web-services

Post on 15-Apr-2017

1.645 views

Category:

Business


2 download

TRANSCRIPT

Page 1: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Felix Candelario

Global Financial Services Solutions Architect

“Disaster Recovery and Business Continuity”

Page 2: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Agenda

• AWS Disaster Recovery Concepts & Terminology

• Architecting for Recovery & Resiliency

• Disaster Recovery Testing & Assurance

• Architecting for the Cloud

Page 3: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

“Everything fails, all the

time”

- Werner Vogels

(CTO, Amazon.com)

Page 4: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Concepts & Terminology

Page 5: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

DR Terminology Map

ELB/Appliance

EC2/Auto Scaling

Route 53

Load Balancers

Web/App Servers

Your Data

Centers

DNS

Amazon RDS

Security Groups / ACL

Availability Zones / VPC

Multi-regionGeographical

Redundancy

Data Centers

Firewall

Database Servers

Page 6: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

What is an AWS Region?

• Geographic locations that contains a cluster of

availability zones in a given metropolitan area.

• Each region is completely isolated and

independent from other regions

• Each region consists of 2 or more AZs to support

high availability (HA) through AZ independence

Page 7: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Highly Reliable Global Footprint

• Over 1 million active

customers per month across

190 countries

• 2,300 government agencies

• 7,000 educational

institutions

• 35 availability zones + 9

more coming soon

• 59 edge locations

13+ worldwide regions

Page 8: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

What are Availability Zones?

• Groupings of one or more data centers that are

physically isolated.

• AZs are connected to each other over low-

latency links within the same region

• Using 2 or more AZs within a region can provide

support for capabilities such as synchronous

database replication and better pricing when

using Amazon EC2 Spot instances

Page 9: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Availability Zones are Notated as Letters

35 Availability Zones (AZs)

• Example

• US East 1 (Northern VA)

– us-east-1a

– us-east-1b

– us-east-1c

– us-east-1d

– us-east-1e

Availability Zone A

Availability Zone B

Availability Zone C

US-EAST-1

Availability Zone D

Availability Zone E

Page 10: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

What is an Amazon VPC?

• Virtual isolated network that you define in which you can

launch AWS resources such as Amazon EC2 instances

• Complete control of your virtual networking environment

such as

• Set your own IP address ranges

• Create subnets

• Configure routing tables and network gateways

• Allows extension of your corporate network to the AWS

Cloud

Page 11: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

VPC Pattern Diagram - Example

Development

Amazon VPC

Integration

Amazon VPC

Pre-production

Amazon VPC

Production

Amazon VPC

Page 12: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Putting It All Together

Page 13: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

What Compute Services are available?

Amazon EC2 Auto ScalingElastic Load

Balancing

Actual

EC2

Elastic Virtual servers

in the cloud

Dynamic traffic

distribution

Automated scaling

of EC2 capacity

Page 14: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

What Network Services are available?

Amazon VPC: AWS DirectConnect Amazon Route 53

Availability

Zone BAvailability

Zone A

Private, isolated

section of the AWS

Cloud

Private connectivity

between AWS and your

datacenter

Domain Name System

(DNS) web service.

Page 15: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Architecting for Recovery &

Resiliency

Page 16: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Resiliency

Backup Disaster Recovery

Reducing likelihood of

service failure

Maintaining Data

IntegrityRecovery after loss of

availability

It’s not all or nothing. Choose a strategy that

fits the business objective.

Page 17: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

DisasterRecovery point Recovery time

Data loss Down time

Page 18: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Ascending levels of DR options

Backup &

Restore

Pilot Light

Warm

Standby

Hot-Site

Backup of on-

premises data to

AWS to use in a DR

event

Replicate data and

minimal running

services into AWS,

ready to take over

and flare up

Replicate data and

services into AWS

ready to take over

Replicated and load

balanced

environments that

are both actively

taking production

traffic

RPO

aRTO

COST

24 hours 24 hours

$

RPO

aRTO

COST

12 hours 4 hours

$$

RPO

aRTO

COST

1-4 hours 15 min

$$$

RPO

aRTO

COST

<15 min 0-5 min

$$$

Business continuity

begins

Un-interrupted Business

continuity

Page 19: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

~$200 / Month

In US-EAST

+VPN

On-premises

Active Production

www.example.com

Corporate data center AWS region

AWS DR failover

App

Servers

DB

Server

VPN

Connection

Storage

GatewayiSCSI

Backup

System

S3 / Bucket

Glacier / Archive

Web

Servers Internet traffic

S3 (1TB)

$31/Month

Glacier (2TB)

$22/Month

Storage Gateway

$125/Month

S3 / Bucket

S3 (1TB)

$31/Month1TB Data

Volume

Backup and Restore Architecture

Page 20: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Suitable for

• Solutions that can sustain higher technical debt

• Lower business critical nature

• Low cost DR option

Leverage existing investments in

• De-duplication

• Compression

• WAN Acceleration

Backup and Restore Details

Page 21: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Pilot light

Page 22: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Subordinate

database

server

Pilot light–prepwww.example.com

Data mirroring replication

Not running

Pilot light system

Reverse

proxy/

caching

server

Datavolume

Application

server

Corporate data center

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 23: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Database

server

Pilot light–recoverywww.example.com

Start in minutes

Add additional

capacity,

if needed

Reverse

proxy/

caching

server

Datavolume

Application

server

Corporate data center

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 24: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Considerations

Suitable for:

• Solutions that need lower RTO & RPO

• higher business critical nature

• Mid-range cost DR option

Pilot Light Details

Page 25: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm standby

Page 26: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm standby–prep

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 27: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm standby–recover

Elastic loadbalancerActive

Route 53

www.example.com

Scaled-up

production

Corporate data center

Datavolume

Applicationserver

Databaseserver

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application Server

MasterDatabase

server

Page 28: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Hot site

Page 29: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Hot site–prep

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveRoute 53

www.example.com

Corporate data center

Datavolume

Applicationserver

Subordinate database

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Active

Page 30: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Hot site–recovery

Elastic loadbalancer

Route 53

www.example.com

Corporate data center

Datavolume

Applicationserver

Databaseserver

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Active

Scaled up

for production

use

Page 31: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Considerations

Suitable for:

• Solutions that require RTO & RPO in minutes

• Core business critical functions

• Higher cost DR option

Warm Standby and Multi-site Details

Page 32: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Disaster Recovery Testing &

Assurance

Page 33: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Continuous Testing of Infrastructure

• Continuously and constantly test.

• Regularly execute tests in stable, production &

production-like test environments.

• Infrastructure as Code

• CI/CD Test in Infrastructure Build Pipeline

• Testing of infrastructure during Integration Test

Page 34: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm Standby – Testing

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 35: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm Standby – Testing

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 36: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm Standby – Testing

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

Page 37: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Warm Standby – Testing

Mirroring /replication

Application data source

cut over

Elastic loadbalancer

ActiveNot active for

production traffic

Route 53

www.example.com

Scaled down

standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server

Reverse proxy/ caching server

AWS region

Reverse proxy/ caching server

Application server

MasterDatabase

server

aws rds reboot-db-instance --db-instance-identifier

dbInstanceID --force-failover

Page 38: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016
Page 39: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Architecting for Cloud

Page 40: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Architecting for Resiliency

Page 41: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Cloud Based Architectures

• High level of control over the environment

• Automate Everything! – Utilise AWS APIs

• Infrastructure as code – CloudFormation

• Parallel environment

• Rolling Update / All at Once

• Blue / Green Deployments

- Significant difference between physical and cloud is the

control and visibility cloud provides

Page 42: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Common thread: Environment automation

Deployment success depends on

mitigating risk for:

• Application issues (functional)

• Application performance

• People/process errors

• Infrastructure failure

• Rollback capability

• Large costs

CloudFormation most

comprehensive

automation platform

• Scope stacks from

network to software

• Control higher-level

automation services:

Elastic Beanstalk, ECS,

OpsWorks, Auto Scaling

Strength of

automation

platform

Page 43: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Benefits of deployment on AWS

AWS:

• Agile deployments

• Flexible options

• RPO/RTO & Business

Continuity objectives

• Scalable capacity

• Pay for what you use

• Automation capabilities

Page 44: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Enterprise Observations

Business

Enablement

Art of the

Possible

Legacy Tech

Debt

Page 45: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Art of the Possible - State of DevOps 2016

Frequent Deployments

200x more frequent

deployment

Faster Recovery

24x faster recovery

from failure

Lower Failure Rate

3x lower change failure

rate

Less Unplanned Work

22% less time spent on

unplanned work and

rework

Shorter Lead Times

2,555x shorter lead

times

Source: Puppet Labs - State of DevOps 2016 Report

Page 46: Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016

Thank You