datrium cloudshift mobility and dr orchestration€¦ · cloudshift integrates all aspects of...

Datrium CloudShift Mobility and DR Orchestration

BusinessReport385 Moffett Park Dr. Sunnyvale, CA 94089 844-478-8349 www.Datrium.com

2

Introduction Datrium CloudShift is a cloud-based mobility and DR orchestration service for on-premises and cloud environments. CloudShift provides end-to-end orchestration for workload protection, backup and replication to cloud or other on-premises sites, DR plan definition, workflow execution, testing, compliance checks and report generation.

CloudShift DR plans operate on 3 different types of sites: Protected Site, Backup Site and Failover Site. Separate Backup and Failover sites designations enable CloudShift to pass down the economic benefits of cloud elasticity to the user via just-in-time creation of a Software-Defined Data Center (SDDC) in VMware Cloud on AWS. With CloudShift, VMware Cloud on AWS becomes a foundational piece for a complete cloud DR solution with truly transformational economic benefits.

Within CloudShift, the Protected Site is a DVX executing workloads covered by a DR plan. The Backup Site is a physical DVX or a Cloud DVX instance receiving backups from the Protected Site. The Failover Site is a physical site or a cloud-based site which is designated to take over workload execution following a disaster.

Use Case Protected Site Backup Site Failover Site

Prem → Prem → Prem On-premises On-premises On-premises

Prem → Cloud → Prem On-premises Cloud DVX On-premises

Prem → Cloud → Cloud On-premises Cloud DVX VMware Cloud on AWS

3

Legacy storage protection architectures rely on tiers of specialized primary and secondary storage appliances and the accompanying backup software. In many scenarios, DR needs are addressed by dedicated DR orchestration software separate from the backup software. These architectures have evolved during the client-server era and present RPO/RTO, resource utilization, and risk mitigation challenges for modern hybrid cloud environments. To provide a complete data protection solution, it is necessary to tie together many different products from multiple vendors with inherent operational complexity.

2.1 RPO/RTO and Compute Resource Challenges

Traditional backup software runs once a day and delivers 24 hour RPO and RTO. Because of the impact and high resource usage, the “backup window” is most frequently conducted once a day during off-hours. While this might be sufficient for some backup scenarios, DR generally has more demanding RPO and RTO requirements. Application owners demand better SLAs for DR that cannot be met by the legacy backup software because of associated performance bottlenecks and the impact of backups on the production workload execution.

Recovery from backups involves a full data copy from the backup array to the primary storage array resulting in inferior RTO that might even exceed RPO. This data copy may go on for many days following a disaster. 24-hour backup RPO and RTO do not meet modern DR requirements forcing administrators to also deploy other dedicated DR solutions in parallel with backups. A common method for implementing DR with lower RPO and RTO is based on the primary array LUN or Volume mirroring.

Array-based LUN or Volume mirroring is more efficient for protecting entire sites than backup software replication because it involves replicating data in its native storage format by the primary storage controllers without rehydrating and transforming the data multiple times by the backup software. Because of its lower resource usage and a smaller impact on production workloads, array LUN mirroring could run on a more aggressive schedule (e.g. every 30 minutes).

While commonly used for DR, array LUN replication does not eliminate backups because LUN or Volume replicas don’t provide full backup capabilities to satisfy regulatory and operational requirements for data protection: no extended backup storage (arrays can accommodate only a modest number of LUN snapshots and replicas), no backup catalog, no visibility inside a LUN, no recovery of individual VMs or files. The lack of visibility inside the LUN is one of the main reasons for the existence of parallel backup and DR stacks. The types of data protection provided by the legacy backup and DR solutions are often complementary and make up for mutual deficiencies. For example, moving a mission-critical application between array LUNs might adversely affect LUN-based DR (application lost upon the LUN recovery from a snapshot), but it is generally handled correctly by backup software policies that are attached to user visible entities as opposed to storage array LUNs.

Legacy Data Protection Architectures

4

2.2 Complexity and Inefficiency of Juggling Multiple Products

Over time, DR orchestration software has evolved to coordinate DR recovery based on native array mirroring. DR orchestration products such as VMware Site Recovery Manager are complex distributed systems that integrate with array mirroring via installable 3rd party array specific agents (Site Recovery Adapters - SRAs).

The following diagram illustrates the number of data transfers for a typical data protection architecture integrating best-of-breed backup and DR products. The data protection part alone involves 5 different data transfers with a majority requiring IO intensive data transformations. Restoring from backups or a DR failover involves a number of additional data transformations and data transfers not shown in this diagram.

Backup software keeps its data on a backup appliance (a specialized array - a Purpose Built Backup Appliance per Gartner nomenclature). As a part of the backup process, the backup software copies recent changes from the primary array to the backup array with the help of hypervisor changed block tracking APIs. Primary storage and backup arrays have different filesystems. In addition, backup software normally utilizes its own client file system layered on top of the backup array and managing snapshots of protected entities.

This is an example of a common backup and DR stack deployed on both primary and backup sites. These products come from different vendors and require 4 independent management consoles.

5

2.3 Stretched Clusters and CDP Alternatives

Stretched Clusters and Continuous Data Protection (CDP) are the basis for an alternative mechanism for Disaster Recovery. Stretched Clusters aim to provide zero RPO by synchronously replicating every write from the primary to the secondary site. Stretched Clusters impose strict requirements on the inter-site network latency in 1-5ms range. Network or secondary site hiccups have an impact on the primary site.

Because each write is replicated over the network in its entirety, Stretched Clusters also have high network bandwidth requirements. Similarly to LUN replication, Stretched Clusters require DR orchestration software to coordinate recovery on the secondary site. For example, VMware SRM was extended to support Stretched Clusters from several vendors in a way reminiscent of SRM array support. Similar to array LUN mirroring, Stretched Clusters do not eliminate the need for backups for operational recovery. The resulting architecture is very similar to that described in the previous section.

CDP addresses the rigidity of Stretched Cluster network requirements by relaxing replication from synchronous to semi-synchronous. Each write is still replicated to the DR site, but the DR site is allowed to transiently lag behind the primary and network bandwidth requirements remain high. CDP solutions gained some popularity for providing high levels of data protection for a few carefully chosen workloads. It is seldom used as a complete DR solution for the entire enterprise site. CDP products are available as 3rd party software and do not eliminate the need for backup storage appliances.

2.4 Data Integrity Risks

Multiple transfers of data with extensive data transformations between different complex products from multiple vendors has inherent data integrity risks. How can the administrator be sure that the backup created by reading the blocks changed between two vSphere VM snapshots stored on a Dell EMC storage array and eventually copied into a Commvault backup stored on a Data Domain appliance and subsequently replicated over a WAN to a remote DR site actually represents the original point-in-time application state? Similarly, DR orchestration software that relies on 3rd party storage and replication

Primary Storage Array Dell EMC Unity 450F

Backup Array Data Domain DD6300

Backup Software Commvault

DR Orchestration Software VMware SRM + Array SRAs + database

6

has little chance of detecting in-transit data corruption due to misconfiguration or a software or hardware fault. There are no global end-to-end data integrity checks or APIs that can possibly apply across all the multiple hardware and software products from different vendors.

In the end, administrators are left with a complex web of solutions integrating components from 3 or more vendors leading to increased complexity, ample opportunity for misconfiguration, and staggering levels of resource inefficiencies due to multiple rounds of data transformation with no end-to-end integrity checks.

A study by Dell EMC revealed that: Business using three or more vendors to supply data protection solutions lost three times as much data as those who unified their data protection strategy around a single vendor.

CloudShift integrates all aspects of backup and DR into a single centrally managed system. The resulting solution has all the benefits of best-of-breed backup and DR products without the associated complexities and inefficiencies of navigating a web of management consoles and excessive resource usage due to multiple data copies with expensive data transformations.

CloudShift Integrates All Backup & DR Components Into One System

3.1 Low RPO/RTO and Minimal Resource Requirements

CloudShift leverages backups via native storage-level DVX snapshots with RPO measured in minutes, not hours and days. DVX unifies primary and secondary storage environments and natively supports forever incremental replication with no data transformations. This enables very aggressive backup and replication schedules with low resource usage and a minimal impact on the executing workloads.

Use cases: Prem → Prem → Prem Prem → Cloud → Prem Prem → Cloud → Cloud

1

https://www.emc.com/about/news/press/2014/20141202-01.htm1

7

With on-premises DVX, backups are replicated across the WAN once in a native forever incremental format with no rehydration or any other expensive data transformations. A DR failover requires no additional data copy - VMs are restarted directly from backups for any available restore point that resides on the DR site. Due to the efficiencies native to DVX, CPU, storage, and network resource usage for data protection is improved by over 5x.

Since no data copy is required for recovery and protected workloads are restarted directly from backups on the DR site, the resulting RTO is near zero, similar to RTO of array LUN mirroring used with 3rd party DR orchestration products. However, unlike array LUN mirroring, DVX also provides a full-featured backup solution: backups are accessed via a searchable catalogue and are kept on a cost-effective medium with all modern backup data reduction technologies applied at all times. In addition, primary copies and local backups share the same storage pool drastically cutting down physical storage requirements for data protection.

3.2 Simplicity of a Single Data Stack

CloudShift completely eliminates the need for parallel hardware and software backup and DR stacks by integrating all components and aspects of the backup and DR into a single system with unified management. From an architectural perspective, a protected DVX and one or more accompanying DVX systems deployed at another location or in the cloud are managed by a unified cloud orchestration service.

DVX integrates primary and secondary storage making it possible to use a single management console to establish backup and replication policies and to configure, test, and execute DR plans. Both backup policies and DR plans operate on exactly the same abstractions: backups for VMs and groups of VMs. Because snapshots are at the storage level, CloudShift delivers consistent point-in-time backups across many VMs executing on different servers. Such functionality is not available from 3rd party backup software that relies on hypervisor APIs to take snapshots and copy snapshot state into backups.

The system built-in health checks can pinpoint problems anywhere in the backup and DR stack. For example, replication failures due to network connectivity losses will automatically flag all affected DR Plans. CloudShift also automatically performs DR plan compliance checks to assure that the changes in the execution environment do not invalidate DR plans.

8

DR Orchestration As-a-Service

Use cases: Prem → Prem → Prem Prem → Cloud → Prem Prem → Cloud → Cloud

3.2 End-to-End Data Integrity Checks

A single data stack backup and DR solution eliminates the risks associated with multiple data transformations and misconfigurations. Because Datrium controls protected, backup and recovery site endpoints and orchestrates all movements of data, it also automatically performs end-to-end integrity checks to verify backup fidelity regardless of data location or past replication history. Datrium employs an efficient scheme to calculate cryptographic hashes of backups and primary storage to continuously validate data integrity across the entire distributed environment, both on-premises and in the cloud.

DR orchestration software products are complex distributed systems that are generally composed of dedicated DR orchestration servers and their internal databases often augmented with array specific software agents shipped by storage array vendors. These servers and databases are provisioned per site and need to be licensed, secured, monitored, managed, and upgraded which requires additional maintenance and extra operational skills. The initial installation and configuration of DR products often require purchasing additional professional services

9

making the overall solution costly. DR roll-out and upgrade processes are lengthened due to the intricacies of the interactions of multiple cross-vendor products and components.

Datrium CloudShift is delivered as-a-service: there is nothing to install and nothing to manage. The CloudShift orchestration engine runs as an AWS-based service and leverages the public cloud infrastructure to achieve high availability for its internal operation. DR plans and execution states are replicated across multiple availability zones with an automatic failover to a healthy availability zone without any data loss in case of a disaster affecting the public cloud. Monitoring and upgrades are automated and performed by Datrium as a part of the service offering.

The CloudShift service is activated online making it immediately operational and allowing users to focus on designing and testing their DR plans instead of managing the internal complexities of the DR orchestration software itself. CloudShift includes all necessary network connectivity and encryption software and establishes a secure bidirectional channel between protected sites and the orchestration engine. No external VPN is required.

Replacing an on-premises DR site with a Cloud-based DR site has significant CAPEX and OPEX implications. However, prior to CloudShift, the practicality of existing solutions was severely limited by the lack of hypervisor interoperability between private and public clouds and the associated costs of the public cloud infrastructure.

Elimination of the DR Site

Use cases: Prem → Cloud → Prem Prem → Cloud → Cloud

10

5.1 The Same VMs On-Premises and in the Cloud

While VMware ESX hypervisor dominates on-premises private cloud deployments, public clouds use several other incompatible hypervisors: AWS relies on Xen for all but the most recent EC2 instance generation with the latter being based on KVM similar to Google Cloud; Azure relies on the Microsoft proprietary hypervisor. The translation between VM formats is a brittle and time consuming process which goes beyond VM disk format conversion. Complex vSphere enterprise environments rely on many other virtualization abstractions which have no immediate analogues in the public cloud: clusters, resource pools, datastores, virtual switches, port groups, etc.

VMware Cloud on AWS finally makes the transition between private and public clouds robust by presenting an execution environment in AWS that is similar to the on-premises execution environment. No VM conversion needs to take place, VMs retain their native vSphere format, and users get access to the familiar abstractions and management tools following a failover to the cloud - the same management tools that are used on-premises prior to the failover.

As a part of a DR plan creation, users map their on-premises virtual infrastructure abstractions (networks, resource pools, folders, datastores, IP addresses, etc.) to the corresponding entities in VMware Cloud following a process that is identical to that of Prem→Prem DR. The native on-premises VM geometry is fully preserved as are all virtual hardware devices. The existing in-guest OS drivers continue to function the same way following a migration to the cloud eliminating all risks of VM conversion between different hypervisor types and the associated virtual hardware and guest OS driver changes.

5.2 Ahead-of-Time Deployment of a Cloud DR Site

This diagram shows an example of a deployed Cloud DR site that maps to SDDC in VMware Cloud on AWS. In cases where a DR site has a secondary function of executing non-DR workloads during normal operation, an SDDC can be provisioned prior to failover. This configuration also potentially enables steady state as-you-go VM hydration from backups in S3 to SDDC lowering failover RTO.

11

If the sole purpose of the Cloud DR site is to take over in the event of disaster and it remains otherwise unutilized, further significant cost savings are possible by the just-in-time deployment.

5.3 Just-in-Time Deployment of a Cloud DR Site

While replacing the on-premises DR site with a virtual site hosted in the public cloud is attractive for many reasons, by itself this does not necessarily reduce the total costs of the overall DR solution because of the recurring charges for maintaining a cloud DR site. The DR costs are merely shifted from on-premises capital and operational expenses to the recurring costs of maintaining an always-on cloud DR site. A careful TCO analysis is needed to ensure that the overall cloud DR solution is price competitive with the on-premises DR solution.

DR related activities generally don’t contribute to the company top-line performance, but they are necessary to keep businesses running. Optimizing the costs of DR is, therefore, an important TCO consideration. Just-in-time deployment of a cloud DR site presents an attractive alternative to continuously maintaining a warm stand-by cloud DR site. With just-in-time deployment, the recurring costs of a cloud DR site are eliminated in their entirety until a failover occurs and cloud resources are provisioned.

Dedicated on-premises DR sites are normally minimally utilized resulting in resource wastage: real estate, power, cooling, capital expenditures for compute resources and costs of skilled labor to keep DR sites operational.

12

The on-demand nature of public clouds enables CloudShift to drastically reduce the operating costs of disaster recovery by deploying the bulk of the DR infrastructure programmatically following a DR event. During steady state operation, CloudShift maintains a minimal low-cost AWS cloud footprint to accommodate cloud backups with no ongoing charges from VMware Cloud on AWS. The backups are sent to Cloud DVX and, after some processing, land in a cost-effective compressed and deduplicated form in an S3 bucket. In just-in-time mode of deployment, a cloud DR site is created only following a disaster. VMware Cloud Software-Defined Data Center (SDDC), a Cloud DR site with a significantly larger footprint and associated costs, is deployed only as a part of executing a DR plan following a DR event.

To make this possible, CloudShift leverages the space and cost efficiencies of Datrium Cloud DVX. Protected on-premises DVX replicates VMs or Protection Groups in their forever incremental format to Cloud DVX which in turn stores them in a compressed and deduplicated native format within the low-cost S3. During normal operation, the costs of data protection are limited to the costs of the Cloud DVX backup service and the cost of the S3 medium.

Following a DR event, CloudShift deploys a new SDDC in VMware Cloud on AWS and orchestrates the hydration of backups from S3 to SDDC as a part of a DR plan execution. This just-in-time hydration process utilizes a fast high-bandwidth network link from VMware Cloud to AWS S3. The recurring VMware Cloud charges kick in only following the SDDC deployment. The just-in-time deployment of SDDC reduces DR TCO by at least an order of magnitude.

13

CloudShift supports efficient failback following an on-premises site recovery. If upon recovery, the on-premises site retains some pre-disaster data, only data changes incurred while executing in the Cloud DR site are transferred back to on-premises.

Ahead-of-time vs. just-in-time provisioning of SDDC is a trade-off between costs and RTO. With ahead-of-time SDDC provisioning, SDDC creation and hydration time could be mostly hidden since hydration could be an ongoing process during normal operation. Just-in-time SDDC provisioning drastically lowers the costs, but increases the RTO by deploying SDDC and recovering VMs from S3 only following a failover.

Summary Datrium CloudShift is a cloud-based DR and workload mobility orchestration service that leverages the execution and operational efficiencies of a single integrated data stack to orchestrate all aspects of Disaster Recovery. CloudShift is dramatically simpler and significantly less resource intensive than legacy DR solutions resulting in lower RPO and RTO for cloud and on-premises environments.

datrium cloudshift mobility and dr orchestration€¦ · cloudshift integrates all aspects of...

Documents