disaster recovery technologies...drj -disaster recovery technologies 5 availability downtime 99%...

28
Disaster Recovery Technologies Seann Herdejurgen Technical Product Manager Your Business is Expected to Be Running 24x7, but Outages Happen DRJ -Disaster Recovery Technologies 2

Upload: others

Post on 22-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Disaster Recovery Technologies

Seann Herdejurgen

Technical Product Manager

Your Business is Expected to Be Running 24x7, but Outages Happen

DRJ - Disaster Recovery Technologies 2

Page 2: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

3DRJ - Disaster Recovery Technologies

1%

42%

42%

44%

45%

46%

46%

47%

48%

63%

63%

63%

64%

69%

70%

72%

0% 20% 40% 60% 80% 100%

Other

Volcano

Terrorism

Earthquake

Flood

Data leakage or loss

Configuration change management issues

Power outage / failure / issues

How many of each of the following has caused your organization to

experience downtime in the past five years?

(Mark all that apply)

As outage duration increases, actual & indirect costs accelerate.

Direct Cost of DowntimeDirect Cost of Downtime

DRJ - Disaster Recovery Technologies 4

Downtime Retailer Financial

1 hour $60K $16M

6 hours $360K $96M

1 day $1.44M $384M

3 days $4.32M $1.15B

Page 3: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Indirect Costs

Stock Price

Reputation

Market Share

Brand Equity

CustomerSatisfaction

DirectCosts

Lost Revenue

Lost Productivity

Costs: Direct and IndirectCosts: Direct and Indirect

DRJ - Disaster Recovery Technologies 5

Availability Downtime

99% 3.65 days

99.9% 8.76 hours

99.99% 52 minutes

99.999% 5 minutes

99.9999% 31 seconds

Availability

DRJ - Disaster Recovery Technologies 6

Availability is the percentage of total time that a Network, System or Service is available for use.

Page 4: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Logical

Hardware

Site-wide

Threats to Availability

Software defectVirusData corruptionHacker

Accidental deleteDropped tableMemory leak

CPUDiskMemoryNIC

SwitchHBASANUPS

Power outageNetwork outageFloodFire

TornadoHurricaneEarthquake

DRJ - Disaster Recovery Technologies 7

Definitions

• Application – one or more infrastructure components needed to perform a particular function. An application can be as simple as a database instance, mail server, or DNS server, or it can be as complicated as amazon.com.

• Clustering - automating processes to maximize availability of an application in the event of a component failure

• Replication – copying data from a master copy to a replica copy. Replication is used to recover from a failure to access the master copy.

• Write order fidelity – data is replicated in the order it was written at the primary site. This provides a crash consistent data copy at the target site.

DRJ - Disaster Recovery Technologies 8

Page 5: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Why do you need clustering?

Automate with Confidence

9DRJ - Disaster Recovery Technologies

What makes up an application?

DRJ - Disaster Recovery Technologies 10

Application Code

Data

Network

Page 6: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

How do you implement DR?

DRJ - Disaster Recovery Technologies 11

Static content

No need to replicate

Dynamic content

Must replicate

Generally static content

No need to replicate

Clustering Concepts

DRJ - Disaster Recovery Technologies 12

APP 1APP 1APP 3APP 3 APP 2APP 2SAPSAPAPP 4APP 4 APP 4APP 4 APP 1APP 1APP 2APP 2APP 3APP 3

Replication

HA

Local Clustering

DR

Remote Clustering

Page 7: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Clustering Concepts

DRJ - Disaster Recovery Technologies

Shared Storage Replicated Data Cluster

13

host

based

array

based

BusinessDecision

ManualFailover

if RPO>0

AutomaticFailover

if RPO=0

Clustering Concepts

DRJ - Disaster Recovery Technologies 14

Local Remote

Shared Storage

Replicated Data Cluster

HA Cluster

HA Cluster

Campus ClusterCampus Cluster

Shared NothingShared Nothing

Global ClusterGlobal Cluster

Page 8: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Campus Cluster

DRJ - Disaster Recovery Technologies 15

Active Site BActive Site A

Mirrored

Shared Storage

< 65 km

Clustering Concepts

DRJ - Disaster Recovery Technologies 16

Host 1

VMware ESX

VMware HA

Host 2

VM

OS

SQL

ApplicationHA

Application Clustering System Clustering

SQLSQLSQLSQL

Page 9: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Cluster Building Blocks

Clustering uses a combination of redundant hardware, communication links and software configuration to achieve high availability.

DRJ - Disaster Recovery Technologies 17

Servers Storage Networking

Similar sized servers are

recommended

N+1 redundancy

Redundant storage

Redundant HBAs &

SAN switches*

Multi-Pathing

Redundant network

interfaces for heartbeat

links

Redundant network links

for TCP/IP

The transformation of High Availability

Infrastructure Availability

Application Availability

High Availability

DRJ - Disaster Recovery Technologies 18

Page 10: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

What are the benefits of High Availability?

• Minimize downtime

• Automate DR

• Load balancing

• Increase manageability / serviceability

• Enforce dependencies between application components

• Reduce number of personnel needed during incidents

19DRJ - Disaster Recovery Technologies

Typical clustering projects for customers

• HA from scratch

• Hardware refresh standardization

• New deployments

• Mergers & Acquisitions

• Cluster standardization

• Regulatory requirements necessitate need for HA / DR

• Reduce downtime costs

DRJ - Disaster Recovery Technologies 20

Page 11: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Disaster Recovery

Recover mission-critical technology and applications at an alternate site.

Business Continuity Planning

Developing contingency plans for external events that interrupt business operations.

Business Impact

Analysis

Analyzing and assigning a level of importance to business functions.

Work Area Recovery

Recover the business process at an alternate site.

Business Continuity

DRJ - Disaster Recovery Technologies 21

Recovery Point Recovery Time

Business Continuity Concepts

DRJ - Disaster Recovery Technologies

SecsMinsHrsDaysWks Secs Mins Hrs Days Wks

• Recovery Point Objective (RPO)

– The point at which data can successfully be restored

• Amount of data loss acceptable

• Recovery Time Objective (RTO)

– The time it takes to restore data and applications

• Amount of time it takes to come back online

22

Page 12: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Architecting Disaster Recovery

DRJ - Disaster Recovery Technologies 23

As your business requirements for RPO / RTO decrease from days to minutes to zero, the technology required to support your DR solution change.

RPO – protection methods RTO – recovery methods

Days Vault backup tapes Restore vaulted backup tapes

Minutes Asynchronous replication HA/DR failover

Zero Maximum WAN bandwidth and

synchronous replication / mirroring

Campus cluster (<80 km)

Active / active application support

Failover Times

DRJ - Disaster Recovery Technologies 24

RTO Minutes

Cost $

Complexity Low

HA

DB

HA

DB

HA

DB

HA

DB

Cluster File System

DB DB DB DBClustered Database

RTO Sub-minute

Cost $$

Complexity Medium

RTO Seconds

Cost $$$$

Complexity High

Single Instance

Fast FailoverSingle Instance Failover Clustered Databases

Page 13: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Challenges

Troubleshoot incident

Declare disaster

Failure occurs

Contact on-call personnel, subject matter experts & business leaders

Offline application components orderly

(production)

How Long Does it Take You to Recover from a Failure?

DRJ - Disaster Recovery Technologies 25

Manual Recovery

Wait for data to replicate

Online application components orderly

(contingency)

Validate applications are running correctly

Resume normal operations

00:00:00 04:00:00

Personnel available?

Operator error?

Missing patch?

Wrong configuration?

Coordination between

IT teams?Incident recognition?

Problem diagnosis?

TIER 1 TIER 3 TIER 4TIER 2

RTO

RPO

Apps

< 1 hour

Near Zero

Oracle DB2SAP

Web ApplicationsOther Applications

DRJ - Disaster Recovery Technologies 26

Business Impact Analysis

< 12 hours

Today

Intra-web

VMware

Less criticalapplications

When Convenient

Days

Other less critical applications

Email

Other Databases

Applications

Hours

< 6 hours

Constraints

Data growthDR testing

Cost of recovery Manual process errors# of sites

VirtualizationDistance between sites

Page 14: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Data Protection Evolution

DRJ - Disaster Recovery Technologies 27

Technology Frequency

Backup to tape Daily

Point-in-time snapshot Several times a day

Periodic replication Every 30 minutes

Sync/Async replication Continuous

Continuous Data Protection

(CDP)

Continuous backups

Synchronous Replication

DRJ - Disaster Recovery Technologies 28

1

4

2

3

Time

WritesMB/s

Required bandwidth

Typical workload

Max

Page 15: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Asynchronous Replication

DRJ - Disaster Recovery Technologies 29

1

2

3

4

Average

Time

WritesMB/s

Required bandwidth

Typicalworkload

Mirroring vs. Replication

DRJ - Disaster Recovery Technologies 30

Logical Volume Mirror

Synchronous

Asynchronous

1m 1km 100km 1,000km 10,000km

Page 16: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Asynchronous Replication Considerations

• What do you do when your storage replicator log fills up?

– suspend replication to preserve a crash consistent copy of data

– delay I/O until replication catches up (performance hit)

– add more SRL / journal storage

– increase WAN bandwidth

• If you don’t have enough network bandwidth to support your average write rate, your replication solution will fail

DRJ - Disaster Recovery Technologies 31

Based on

actual events

Replication Network Performance Tuning

• Configure bandwidth limits

• Enable jumbo frames

• Max out TCP window size

• Firewalls - Increase network buffers

• Enable compression

• Use fewer network devices

• Use multiple FCIP tunnels

• Update NIC firmware

DRJ - Disaster Recovery Technologies 32

Page 17: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Replication modes to meet your SLA

DRJ - Disaster Recovery Technologies 33

• Maximum Protection: Zero data loss

• Ideal for small distances (< 100km)Synchronous

• Maximum Performance: Limited data loss

• Ideal for any distance between sitesAsynchronous

• Maximum Protection + Maximum Performance

• Zero data loss over any distanceBunker

any distance

< 1

00

km

Bunker Replication

Bunker replication allows you to replicate data synchronously to your bunker site and asynchronously to your contingency site. This supports a zero RPO for data during a primary site failure.

DRJ - Disaster Recovery Technologies 34

Primary

Site

Primary

Site

Syn

chro

no

us

Re

pli

cati

on

Syn

chro

no

us

Re

pli

cati

on

Bunker

Site

Bunker

Site

Asynchronous

Replication

Asynchronous

ReplicationContingency

Site

Contingency

Site

Page 18: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Parallel replication to multiple DR sites

DRJ - Disaster Recovery Technologies 35

VVR

Production

Site

Disaster

Recovery

Site 1

Disaster

Recovery

Site 2

• Each secondary site can

be at different RPO

• Ideal for new site bring-

up and old site retire at

no DR loss

Disaster Recovery

DRJ - Disaster Recovery Technologies

• Periodic replication

• Save bandwidth – replicate subset of files

• Suitable for triggered replication

Primary Site

App

Periodic Replication

Disaster Recovery Site

App

Any Distance

36

Page 19: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Content Distribution

DRJ - Disaster Recovery Technologies

Central Office

App

Branch Office 1

Branch Office 2

• Distribute files between sites

• One source to many targets • Share selected files or directories

• Single direction of data flow

Content Refresh

Content Refresh

Periodic Updates

37

On-Demand Replication

DRJ - Disaster Recovery Technologies

• Replicate On-Demand

• Content distribution on a non-periodic interval

• Refresh test/pre-prod env

Production

App

Replicate On-Demand

Pre-Prod

App

38

Page 20: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Primary DatacenterPrimary Datacenter Secondary DatacenterSecondary Datacenter

Off Host Processing

WAN

DRJ - Disaster Recovery Technologies

Primary File System

Application writes

Read-Only Target File System

• Replication has exclusive R/W access to target file systems

• Prevents accidental writes

39

DRJ - Disaster Recovery Technologies 40

IP Network

Global Clustering

Page 21: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

High Availability and Disaster Recovery Architectures

DRJ - Disaster Recovery Technologies 41

SAPSAPAPP 1APP 1APP 3APP 3 APP 2APP 2SAPSAPSAPSAPAPP 4APP 4 APP 4APP 4 APP 1APP 1APP 2APP 2APP 3APP 3

Asynchronous

ReplicationSync Replication

or Mirroring

Metropolitan HA(Campus Cluster)

Wide-Area DR(Global Cluster)

Local HA

Bunker Site

Synchronous

Replication

Asynchronous

Replication

Primary Site Secondary Site

Replication Options

42

Array Based Replication

DRJ - Disaster Recovery Technologies

APP

Appliance Based Replication

Application Based ReplicationAPP

Host Based Replication

Page 22: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Array Based Replication

DRJ - Disaster Recovery Technologies 43

Vendor Synchronous Asynchronous

EMC SRDF/S SRDF/A

Hitachi TrueCopy HUR

IBM MetroMirror Global Mirror

NetApp MetroCluster SnapMirror

Primary Site

Volume

Snapshot

Replication

DR Site

Test Application

Resume Operations

Mount Snapshot

Initiate Fire Drill

Simulate DR Failover

DRJ - Disaster Recovery Technologies 44

Comprehensive Disaster Recovery simulation

No impact on application at production site

Logs for DR readiness audit

The BenefitsFire Drill

Snapshot

Page 23: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Automating Business Applications

DRJ - Disaster Recovery Technologies 45

Web

App

Billing

DB

Start Web TierStart Web Tier

Start App TierStart App Tier

ONON

Start DB TierStart DB Tier

Started

Started

Application Start

Disaster RecoveryHigh Availability

Application Stop

SecurityStatus Summary

Types of Disaster Recovery Tests

46

Walkthrough

Tabletop Exercise

Simulation

Full Test

Key stakeholders meet to review the layout and contents of a plan

Key stakeholders rehearse a specific threat scenario

IT team invokes the plan in a controlled situation without impacting business operations

IT team perform an actual failover of IT systems and end-user processing to the DR site

DRJ - Disaster Recovery Technologies

Page 24: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Disaster Recovery Testing

• Reality – Plan documents and procedures are rarely referenced during an actual disaster

• Testing helps train team members to operate effectively despite the “heat of battle” that often obscures centralized command, control and communications capabilities of emergency decision-makers

47DRJ - Disaster Recovery Technologies

Top DR Testing Rules

• Test regularly, more is better

• Test using different personnel

• Test after significant changes in business or infrastructure

• Test ALL infrastructure / application components

• Re-test when test fails to meet objectives

48DRJ - Disaster Recovery Technologies

Page 25: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

DRJ - Disaster Recovery Technologies 49

Advanced DR Configurations

3DC and 4DC

Hardware Replication

• SAN vendors support various replication configurations which can be cascaded to configure complex solutions to meet a customer’s needs

– Bunker replication

– 3DC – Three data center replication

– 4DC – Four data center replication

DRJ - Disaster Recovery Technologies 50

Page 26: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

3DC Replication

• 3DC solutions are any combination of replication that involve three data centers

DRJ - Disaster Recovery Technologies 51

bunkerbunker

DRDRprimaryprimary

primaryprimary

DR2DR2DR1DR1

4DC Replication

• 4DC solutions are any combination of replication that involve four data centers

DRJ - Disaster Recovery Technologies 52

bunkerbunker

secondarysecondary

tertiarytertiary

primaryprimary

bunker 1bunker 1

secondarysecondary

bunker 2bunker 2

primaryprimary

Page 27: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Bunker Site vs. Asynchronous Replication

• Bunker sites can be costly in terms of capital and operational expenses

• Businesses can instead purchase additional bandwidth to reduce their RPO time to near zero

• Near zero RPO may be good enough to avoid the cost and complexity of running a bunker site

DRJ - Disaster Recovery Technologies 53

Thank You!

Tak!

Dank u!

شكرالك

Kiitos!Kiitos!Kiitos!Kiitos! 謝謝!

Merci! Obrigado!

Děkuji vám!

Danke!

谢谢谢谢谢谢谢谢!

Falemnderit!Falemnderit!Falemnderit!Falemnderit!

תָדה

σας ευχαριστώ!

ध�यवाद!

Köszönöm!

Grazie!

ありがとうございました

감사합니다

Takk!

Спасибо

Gracias!

ขอบคุณ

Tack!

54DRJ - Disaster Recovery Technologies

Page 28: Disaster Recovery Technologies...DRJ -Disaster Recovery Technologies 5 Availability Downtime 99% 3.65 days 99.9% 8.76 hours 99.99% 52 minutes 99.999% 5 minutes 99.9999% 31seconds Availability

Thank you!Thank you!

Seann Herdejurgen

[email protected]

55DRJ - Disaster Recovery Technologies