Download - Tiered Disaster Recovery
Tiered Disaster Recovery
DR SLAs That Work
Bill PeldzusBill PeldzusVice President, Data Center and Business
Continuity/Disaster Recovery ServicesGlassHouse Technologies
It’s all about Risk Avoidance
Audience ResponseAudience Response
Today, at your company, DR is …1. A top priority, with full management support and
budget, dedicated staff and detailed testing.
2. A “best effort” endeavor, with ad hoc testing, some
management support and a limited budget.
3. We ship some of our backup tapes offsite and hope
for the best.
4. What’s DR?
Some Recent DR EventsSome Recent DR EventsUnfortunately, it’s now a question of when, not if, some type of disaster will hit
• Terrorism• 9/11 and ongoing threats
• Internal mistakes• Newbie DBA meant backup, not delete!
• Employee sabotage• Former sys admin planted a logic bomb in financial sector
• Blackouts• Rolling blackouts are the norm the past few summers
• Pandemics• Avian flu threat
• Geographic• Hurricanes, blizzards, tornados, earthquakes…
AgendaAgenda• Today, we will be discussing…
• Business Continuance definition
• RTOs, RPOs and why they are important
• Classifying applications and systems into tiers• Determining how many classifications and service levels
you should consider
• Documenting your tiers and service level agreements
• Getting buy in from management and business units
• Data Protection Options to meet SLAs
• Best practices and experiences in real BC/DR testing
Business ContinuityBusiness Continuity•• Business continuanceBusiness continuance is the
umbrella for many concepts including:• Fault tolerance• High availability• Backup/restore• Redundancy• DR
• For this presentation, the focus is on DR and SLAs, with a business continuity theme
Key Recovery Metrics: RTO/RPO
RTO RTO ---- Recovery Recovery Time ObjectiveTime Objective
Maximum amount of Maximum amount of time to recover from time to recover from a disruption and be a disruption and be up and running againup and running again
““How long before we How long before we are operational are operational
again?again?””
RPO RPO ---- Recovery Recovery Point ObjectivePoint Objective
The maximum The maximum amount of data loss amount of data loss (measured in time) (measured in time) acceptable in the acceptable in the event of a disruptionevent of a disruption
““How much data can How much data can we lose?we lose?””
RTO/RPO: Recovery Types
•• OperationalOperational• Daily recovery within the primary site• Recover from
• Accidental deletion of a file • Corrupted database, virus, etc.
•• DisasterDisaster• Recovery from a catastrophic event to a remote
location• Recover from
• Geographic • Terrorist • Environmental
Defining RTO/RPO Classes
Class 1:RTO & RPO 4 Hours
Class 2:RTO = 24 hoursRPO = 24 hours
Class 3:RTO = 72 hoursRPO = 72 hours
Class 4:RTO = 5 daysRPO = 1 week
Replication Hot standby (dedicated); in-sourced or outsourced facilities
Replication and/or off-site DR backupsIn-house or outsourced
Available hardware; standard recovery from tapeOutsourced/Contracted usually more cost-effective (hot site or mobile)
Quick-ship program mostcost-effective; location TBDStandard recovery from tape
Relative Cost
$$$$$
$$$$
$$$
$$
Recovery Tiers
Picking the Cost-Effective Solution to Recovery…
DR SLAsA More Complex Approach
Complete
1 2 3 4 5 6
Near 0 4 hrs 24 hrs 72 hrs 720 hrs BestEfforts
0 mins0 mins15 mins24 hrs
0 mins15 mins24 hrs48 hrs
48 hrs0 mins15 mins
0 mins15 mins24 hrs48 hrs
Risk Management
Capability
Recovery Class
Maximum downtime Data loss
options
System recovery
speed Instant
Notdefined
Negligible
RTO/RPOConsider “Data” Metric for IT
• Understand the difference between external, service-level RTO/RPOs and what your IT teams are mapping to
• Just restoring the data and having it available does not mean the application is up and running
• DBAs, Application owners and others need to “do their thing” to ensure the application is up, running correctly with the right data and available to the end users• That takes time
• RTO-Data and RPO-Data helps SLAs meet reality
Example: TapeExample: Tape--based DR based DR RecoveryRecovery
4hrsTypical SLA
8-24hrsEquipment
Provisioning
24-48hrs
Tapes Retrieved
Disaster Declared
Data Recovery
Begins
RTO Data = 3 daysData is recovered on
working systems
Application Team Recovers Business
Applications
Tape Retrieval:- Identify Tapes- Request Retrieval- Transport Tapes- Inventory Tapes
Equipment Provisioning:- Identify Required Configs- Procure hardware- Provision
Data Recovery:- Recover Tape Backup System- Load Tapes- Perform Restore
Recovery of Business Capability:- Recover Databases- Ensure Service Consistency- Acceptance testing
RTO = 6
RTO = 6 daysRTO (data) = 3 days
Key Concept:SLAs versus Reality
So when do the apps start getting restored?
Example OR/DR Storage SLA Matrix
Getting Buy-In from Management
• Where’s your BIA?• Looks at the Business, not IT• Directly correlates applications to
RTOs and RPOs (recovery classes)• Compare the investment in DR today
to the inability to recover• Lost Customers • Corporate Reputation• Hit on your Brand
DR Data Protection Options
How to Meet Those SLAs from a Technology Perspective
Operational Recovery: The Old WorldOperational Recovery: The Old World
Disk Tape
Backup
Standard Disk
Intelligent Disk
NAS
SAN
Traditional filer
Grid-basedNAS
VTL
Standalone Integrated
Internal media server
NAS
Disk Tape
Backup
VTC
Operational Recovery: The New WorldOperational Recovery: The New World
Some new technologiesSome new technologies
• CDP (Continuous data protection)• Back up files or blocks every single time they
change• Also Near-CDP with Snapshots and replication• Emerging in Replication & DR
• Data de-duplication backup• Eliminate redundant blocks wherever we can
Snapshots & Mirrors Snapshots & Mirrors
• Snapshots• Usually rely on primary storage/software • Known as copy-on-write; instantaneous• Great for protecting against logical corruption• Cannot protect against hardware failure
• Mirrors• Usually rely on primary storage/software • Full copy of data; must disassociate for protection• Takes more space than Snapshots• Cannot protect against hardware failure
Data Replication:Data Replication:The Core Foundation of DRThe Core Foundation of DR
• A majority of the clients I’ve worked with use some type of data replication for DR due to more aggressive RTOs/RPOs for mission-critical applications
• Let’s explore replication in more detail
Replication: Where Do You Start?Replication: Where Do You Start?
??Array-based
Asynchronous
Host-basedIn-band
Mirrors
Out-of-band
Network-based
Point-in-timecopies
Fabric-based
Synchronous
Sync. versus Sync. versus AsyncAsync. Replication. Replication• Synchronous
• The number of writes are doubled (once to each site) prior to the acknowledgement being sent back to the primary server
• This double-write and wait for dual confirmation introduces latency into application response time
• The impact on application response time is dependent on distance• Primarily used either within an array, within a single site or
between two sites in close proximity
• Asynchronous• Immediate acknowledgement of the write to the primary host once
it is received by the source storage• Then, after the acknowledgement is made, the write I/O is then
replicated to the target site with minimal performance impact onthe host at the source site
• Compared to synchronous mode, asynchronous mode does not ensure that the data at the source and target sites are identical
• Primarily used when there are long distances between sites
““RollingRolling”” DisastersDisasters
• An issue in real-time replication schemas• What happens if the disaster event corrupts my
primary data?• And, via replication, would therefore corrupt my remote,
replicated data
• Requires a protected, consistent and segregated copy of the production data at the DR site
• Number of copies is specific to the RPO
• Hopefully, these copies are not needed
• If no corruption, use the replicated copy!
Ensuring Application Recovery:Ensuring Application Recovery:Consistency GroupsConsistency Groups
• Where does it reside?• Host
• Allows for heterogeneous storage
• Database-only solutions as well
• Fabric/Appliance
• Also allows for heterogeneous storage
• Array
• Storage array and software provided by single vendor
• e.g., EMC SRDF, HDS TrueCopy, IBM PPRC-XD
• There is an array-based data mover that supports heterogeneous storage
Note:
•A NAS appliance can be considered“host + array”
•A NAS “head” can be considered“host”
Replication OptionsReplication Options
HostHost--based Items Of Notebased Items Of Note
• Consumes host resources
• Can affect production application performance
• Issues with consistency groups
• OS dependent
• Windows, Solaris, HP-UX, Linux, AIX
• May require additional host software
• More complexity in setting up mirrors and snapshots in the remote site for rolling disaster protection
• As your replication suite of applications grows, so does the management complexity
Fabric/ApplianceFabric/Appliance--basedbased• Has also been coined “virtualization”
• But not server virtualization – we’ll cover that later
• Intercepts or directs I/O and makes the back-end transparent
• Can be used for replication but also for other activities, such as data pooling, consolidation and migration
• Three basic approaches
• In-band (fabric)
• In-band (array)
• Out-of-band
• Each vendor has its own unique virtualization and replication approach -- therefore the examples are generic
• Lots of new products announced recently in this space
• Some overlapping terminology as well
Three ApproachesThree Approaches
Host Zone
StorageZone Storage
Zone
Host ZoneHost Zone
StorageZone
In-band fabric In-band array Out-of-band fabric
Fabric/ApplianceFabric/Appliance--based based Items of NoteItems of Note
• Introduces additional resources between the servers and the storage in the production environment
• May require specialized drivers on the production hosts
• Issues with consistency groups similar to host-based
• New product offerings -- smaller install base today
• Scaling could be an issue (I/O spread over busses)
• Requires high-availability architectures
• Approaches vary greatly
• Terminology and pros/cons confusing
• I sometimes can argue for or against all three approaches!
D
Frame-based – block-level replication for DRWorkstations
WAN
A
B
A¹
B¹
Production Development
LAN
Test
WANCECE CECE
A B C
SAN SAN
LAN
Workstations
Production Replication
Production
D
D¹
A²
B²
ArrayArray--based Replicationbased Replication
ArrayArray--based Items Of based Items Of NoteNote
• Requires same/similar storage arrays at each site
• Can you replicate your old array to the newest model?
• Requires specialized software from the storage
array vendor
• May require special configurations on the array
• May limit array options (e.g., RAID, disk sizes)
• Usually requires additional cache
• Snapshots vs. mirrors for protection from rolling
disasters can affect total cost
Other Replication NotesOther Replication Notes• Not everything in your most mission critical
application “package” needs to be replicated• Look into update opportunities
• “Reverse Replication” is not just turning a switch• In fact, it often requires a total re-copy and/or re-
synchronization effort
• And can take days, not hours
• Tiered model offers different recovery times for different network services
• ‘Business Critical’ applications recovered first
• Lower tiered applications use different recovery methods
• Allows for ‘expensive’ DR options to be used on only critical systems
• Less critical systems can be recovered faster than physical machines and w/ less on-going cost
• Cost of recovery for an application can be weighed against the needs of the biz
The The ““Virtual ServerVirtual Server”” DR PlanDR Plan
BC/DR TestingBC/DR Testing
• Efficient and effective DR testing is probably the most overlooked (or cut) line items in today’s IT budgets
• You need both processprocess and technologytechnologyto be successful
Validate Your BC/DR PlanValidate Your BC/DR Plan• The non-technical can be a show-stopper
• Who can actually declare a disaster?• Especially when primary site is still up• Who is the primary contact?
• Person or role?• New methods of communications
• Corporate email• VoIP• Cellular
• Travel issues• Staff availability• How/what to test after initial recovery
Testing the Plan!Testing the Plan!• Do I have the data?
• Application consistency groups• Point in time, complete and usable
• One of the primary issues uncovered
• What applications are interdependent on other applications?
• Matched to the RPO
• The network is key• TCP/IP addresses, DNS, DHCP, routing end users, etc.
• The database is key, too• DBAs have granular recovery options as well, including database
rollbacks via recovery and archive logs
• Don’t forget tracking that all-important RTO
Testing the Plan!Testing the Plan! (continued)(continued)
• What about rolling disasters?• When replicating, what if the recovery data is
corrupted?
• Consider using “labs”• Either yours or outsourced• But don’t eliminate testing at your actual recovery site
as well
• Challenge your recovery methodology• Remove a key player or key team to validate
documentation
• Does your production change control ensure DR is also considered and included?
Other Testing ConsiderationsOther Testing Considerations• When to test
• Who’s minding the shop?• How to test
• Protecting the real, production data• Justifying the cost
• Directly relates to RTOs and RPOs• Compare the cost to test to your total investment in DR
today plus the cost of the inability to recover due to lack of testing
• There are some viable options to reduce the cost of testing
• Consider regulatory and associated penalties• Where’s your plan?
• It better not be in your primary data center!
Summary: Tiered DRSummary: Tiered DR
• Business drivers first are foremost
• Technology is second
• Testing your plan is arguably more important than just having a plan• False sense of security
• Benchmark SLAs against reality• Provide Recovery Classes to your Application Owners
• Get outside help from an independent expert with real experience
• Stay flexible -- your plan and tactics will change over time
Thank you!Questions?
Bill [email protected]