white paper deduplication and the olsen twins 20110921 01 (1)

7 Technology CircleSuite 100Columbia, SC 29203

Phone: 866.359.5411E-Mail: [email protected]: www.unitrends.com

Deduplication, Incremental Forever, and the Olsen Twins

Oh, and look at the blog, too!

Don’t Get Duped by Deduplication: Introducing Adaptive Deduplication7 Shortcuts to Losing Your Data (and Probably Your Job) Six Fairy Tales of VMware and Hyper-V Backup

mailto:[email protected]

http://www.unitrends.com


http://www.unitrends.com/blog/

http://go.unitrends.com/LP=35






1

Deduplication, Incremental Forever, and the Olsen TwinsDIFATOTWP-20110921-01

IntroductionWhat do deduplication, incremental forever, and the Olsen twins have to do with each other? It’s all about duplicate data. Mary Kate and Ashley Olsen are fraternal twins. Identical twins share the same 100% of their DNA. Fraternal twins share about 50% of their DNA - the same as any other sibling.

If we created a DNA database for a set of identical twins, we would only need to store that DNA information once since we know that the DNA is identical. However, if we created a DNA database for the Olsen twins, we would either need to create two completely unique sets of data or we would need techniques for understanding and categorizing which data is unique and which data is identical. The techniques for this understanding and categorizing is what data deduplication is.

File-level deduplication, block-level duplication, byte-level deduplication, and incremental forever are all techniques that eliminate duplicate data. Theoretically, the data reduction achieved is identical for each when 100% of data is duplicated. Practically, there are significant differences in the time and computational resources required between each of these techniques. It’s when only some data is duplicated, as in the case with the DNA of fraternal twins, that the data reduction varies. The time and computational resources required with each of these techniques also varies in this case.

In this paper, we’ll compare and contrast the advantages and disadvantages of each of these techniques and explain why incremental forever when combined with byte-level deduplication is the superior methodology for reducing redundant data in the most efficient manner possible. Further, we’ll discuss the advantages and disadvantages of both physical and virtual backup appliances versus dedicated deduplication devices.

DeduplicationThe purpose of deduplication is to reduce the amount of data storage necessary for a given data set. Deduplication is a wide-ranging topic and is covered at length in our adaptive deduplication whitepaper. In this chapter, we’re going to explore the basic types of storage-oriented deduplication. However, we’re going to start off by discussing a key concept in storage oriented deduplication - content awareness.

Content AwarenessContent aware deduplication systems simply means that the deduplication algorithms understand the content - sometimes called the “semantics” - of the data that is being deduplicated. Why is this important? The reason has to do with efficiency - not the efficiency of data reduction, but the resources (e.g., processor, memory, I/O, etc.) that are required for a given level of data reduction.


2


The figure below is going to be referenced repeatedly in this chapter; we use it to illustrate many of the key concepts that are discussed. Panels 1 and 2 simply depict two backups with a minor change (depicted via the yellow block) shown in panel 2.

Content awareness simply refers to the fact that the deduplication algorithm can see what’s inside the backup - as depicted in every panel below except for panel 3. In panel 3, the backup appears as the proverbial “black box” - the deduplication algorithm can’t see inside of it.

We’re going to come back to this figure repeatedly in the sections that follow to help explain the differences among various types of deduplication.

File-Level DeduplicationFile-level deduplication is a content-aware deduplication technique and is depicted in panel 5. File-level deduplication operates by comparing files and storing only the first unique file. This is the reason that file-level deduplication was the first “content-aware” type of deduplication because the file-level deduplication algorithms by definition must be aware of the data to the extent that it recognizes what a “file” is.

Panel 1 Panel 2

Panel 3 Panel 4

Panel 5 Panel 6


3


File-level deduplication tends to be fairly effective. The reason for this effectiveness is the tendency for data to be getting “colder.” What this means is that there’s an increasing tendency to create files that are then not only not updated but not even accessed after the creation of the file. File deduplication works best when files don’t change. Or to put it another way, the colder the data, the better file deduplication works.

The problem with file-level deduplication is that it doesn’t more effectively handle changes within files - and remember that files include not only files within file systems but also what is commonly called structured data (e.g., databases and images.) Changes within a file, database, image, or other entity cause the entire entity to be marked as unique - thus no deduplication occurs.

Block-Level DeduplicationBlock-level deduplication is a non-content aware deduplication technique that is depicted in panels 3 and 4. Panel 3 shows what the backup looks like from the standpoint of the deduplication algorithm; panel 4 shows what the backup itself looks like underlying the “black box” that doesn’t allow the deduplication algorithm to be content aware. Both panels 3 and 4 show the large number of passes and processing required to attempt to find “blocks” of data that are duplicates.

Depending upon its implementation, block-level deduplication can work very well to reduce data. The problem with block-level deduplication isn’t the theoretical ability to perform data reduction - it’s the practical consequences of doing so. Imagine a block-level deduplication device with 20TB of physical storage. If your block size is 4K, that means you need a table with 5 billion entries to keep track of each block. If a 256-bit hash key is used to represent each block (hash keys are typically used in deduplication algorithms to provide a unique representation of the data) that means that you need 32 bytes per entry - or 128GB of dedicated memory for the tracking table. Each entry represents computational resource that must be assigned to deduplication, track, and rehydrate that data - and typically the better performing block-level deduplication systems require the tracking table to reside in physical memory.

All of this is absolutely possible - there are vendors that do an excellent job of creating dedicated block-level deduplication devices and optimizing the rough calculations above to make the best use of available computational resources. The problem is that even with these optimizations block-level deduplication devices that perform acceptably tend to be expensive. And the performance has to be quite good - because if you’re using a dedicated block-level deduplication device you’re already going to be incurring performance penalties when you transfer the backup data from your backup server and software to the deduplication device.


4


One last note. There are vendors who attempt to use FUD (Fear, Uncertainty, and Doubt) when talking about block-level deduplication concerning something known as “hash collisions.” This means that two blocks of data could resolve to the same hash key even though the data is different. When using a secure hash such as SHA256, one report notes that the probability of a hash collision is approximately 2^-256 = 10^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. How does this compare to other sources of failures within a computer? The best ECC-based physical memory available today has an error rate that is 50 orders of magnitude (not 50X, but 50 orders of magnitude) greater than this.

Byte-Level DeduplicationByte-level deduplication combines the resource efficiency of file-level deduplication with the data reduction effectiveness of block-level deduplication. Byte-level deduplication requires knowledge of the content of the backup itself and then uses that knowledge to optimize data reduction with fewer required resources. Byte-level deduplication is depicted in the previous figure in panel 6.

The primary benefit of byte-level deduplication is a result of the deduplication algorithm being content aware. This means that those algorithms can detect boundaries that exist in the data. For example, in panel 6, the backup header will almost always change and thus byte-level deduplication algorithms can be built to not even try to deduplicate the backup header. But beyond that, byte-level deduplication algorithms can evaluate objects more intelligently resulting in dramatically fewer tracking table entries per physical storage terabyte. This results in less of a requirement for physical resources such as faster processors with more cores per processor and more physical memory to handle the same amount of physical storage.

There have been meetings in which we’ve described this to buyers and they shake their head and wonder why anyone would not use byte-level deduplication. The reason is less technological than the way that deduplication evolved in the marketplace The pioneers in data deduplication tended to use block-level deduplication because they saw their device properly as a storage device - not a backup appliance. However, because backups tend to contain redundant data, backup was the first and continues to be the foremost use case for deduplication. Thus the marketing departments of deduplication devices began advertising themselves as “backup appliances.”

A second generation of vendors then came on the scene with specialized deduplication devices that performed byte-level deduplication by recognizing the backup content of the biggest backup software vendors. They were able to offer lower costs because the physical resources required for byte-level deduplication was lower. However, they offered much poorer choice in terms of efficacy with different backup software than the first generation of vendors. The focus of this second generation of vendors is currently on doing a better job on handling “generic” data - in other words, attempting to merge attributes of byte-level and block-level deduplication with catchy sounding buzzwords.


5


Next generation backup appliances offer integrated byte-level deduplication at a lower cost without the performance penalties associated with moving data from the backup server to a secondary dedicated deduplication device. That’s the reason that the ability to handle more with less is even more important when deduplication is integrated in a real backup appliance (as opposed to a dedicated deduplication appliance.) The reason for this is that an integrated backup appliance already has to perform quite a bit of work associated with backup, archiving, and data protection - thus efficiency is important to allow the integrated backup appliance to function. At the same time, this is the reason that Unitrends invests so heavily in a grid-based federated monitoring and management approach to backup appliances - so that new backup appliances can be added to handle increasing amounts of data in a scalable fashion without ever requiring a forklift upgrade.

Incremental ForeverIn terms of data reduction, incremental forever isn’t a panacea by any means - and yet it is a valuable tool in the arsenal of increasing backup effectivity and reducing backup costs. Incremental forever is a backup strategy in which a single master (or full) backup occurs once and from that time on only incremental backups occur. The reason that this works is that the backup solution must occasionally perform a “synthetic” backup. This synthetic backup is simply a master that is created “synthetically” on the backup device rather than having to be created from the original data. Synthetic backups are created by combining all of the incremental backups that have occurred since the last available master or synthetic backup with that last available master or synthetic backup.

Incremental backups reduce the effectivity of data deduplication because there is less data to deduplicate. This is the reason that deduplication device vendors don’t talk about backup strategy very often. And yet, that isn’t the whole story.

Different backup solutions that support incremental forever also support different synthetic creation policies. First generation incremental forever systems required the user to create a synthetic creation policy to manage their RTO (Recovery Time Objective). Why? In these systems the longer the period of time between two synthetics, the longer it takes to “reassemble” the backup out of all the incremental backups and thus the longer it takes to recover.

The more synthetic backups that are required the more duplicate data exists - and thus data deduplication is critically important in these first-generation systems.

Next generation backup solutions, such as those offered by Unitrends, don’t have this issue because data deduplication is integrated into the synthetic creation process. Since


6


synthetic backups and incremental backups are essentially a list of deduplicated “chunks” of data, there’s no RTO penalty associated with the length of time between consecutive synthetic backups.

Backup Appliances Versus Dedicated Deduplication DevicesWhy did Gartner recently forecast that by 2014 that “deduplication will cease to be available as a stand-alone; rather, it will become a feature of broader data storage and data management solutions”? The primary reason is that the technology is being subsumed into the backup application itself. The secondary reason is that there are fundamental limits to what can be achieved with dedicated deduplication devices. Let’s review a few of the strengths and weaknesses of dedicated deduplication devices.

The strength of dedicated deduplication devices is that they appear to an existing backup application as just another storage device. Thus dedicated deduplication devices are a quick fix to the pain of having to add large amounts of storage to increase the amount of backup data that can be stored - without changing anything else. This is also the reason that dedicated deduplication device vendors will often ask customers to compare their savings by simply taking the backup retention (the number of backups) and assuming that every backup is a master (also termed a full) backup - because master backups contain the absolute most redundant data it allows the largest possible marketing claims in terms of data reduction.

Unfortunately, there are quite a few more weaknesses of dedicated deduplication devices. These weaknesses include:

• Capital expense costs. Dedicated deduplication devices can end up costing more than simpler RAID-, NAS-, and even SAN-based storage devices. Never forget that at the end of the day what you’re driving to do is to get the most bang for your buck in terms of backup storage. Also make sure to take into account the costs of add-ons as well - costs for things like replication licenses.

• Operational expense costs. Dedicated deduplication devices often increase labor costs because they increase the complexity - and the sheer number of vendors - involved in a solution.

• License costs. You’re already paying for support and license renewals for your backup software, backup server, and backup storage - adding a dedicated deduplication device can increase your license costs as well.

• Data reduction. Dedicated deduplication devices can achieve incredible data reduction ratios - but all too often this is only the case if comparatively “dumb” backup strategies are used that involve increasing the amount of data being backed up (in the worst case, a backup policy that backs up everything every time regardless of whether the data has changed or not.) Make sure you understand the impact of an optimal backup strategy on the actual data reduction ratio achieved.

• Performance. Deduplication vendors tout their “ingest rates” - or how quickly they


7


can receive data from a backup server and backup software. Always remember that your true performance must be measured comprehensively including backup server, backup software, and backup storage performance characteristics. The backup server and backup software in a non-integrated solution is performing a lot more tasks than simply storing data. And the LAN-based link between the backup server/software and the deduplication device also often introduces latency and bandwidth constraints in terms of backup performance.

One way to integrate the strengths of backup appliances with existing dedicated deduplication devices is to either use the existing dedicated deduplication device as an archive device (in a D2D2D - or Disk-to-Disk-to-Disk) archival scheme or to use it to augment the primary backup storage for a backup appliance. Unitrends supports using external storage, including dedicated deduplication devices, for its integrated physical and virtual backup appliances (our Recovery- and vRecovery-series.) Unitrends also supports using dedicated deduplication devices as part of the primary backup storage in its VMware-based vRecovery virtual appliance.

ConclusionSo what in the heck does this have to do with the Olsen twins again? Well, what we’ve shown you in this paper is that regardless of whether your data resembles the DNA of identical twins - with large amounts of redundant data - or fraternal twins - with unique and redundant data - your backup solution must be able to cope and give you the tools to handle in an optimal fashion.

With the right backup solution, it’s possible to both reduce the amount of redundant data that must be backed up in the first place while also providing advanced deduplication techniques without the cost and pain of dedicated deduplication devices. But if dedicated deduplication devices are mandated as either primary or secondary backup storage, then we’ve also discussed how Unitrends also supports that as well.


8

7 Technology Circle, Suite 100Columbia, SC 29203

Phone: 866.359.5411E-Mail: [email protected]: www.unitrends.com

Copyright © 2011 Unitrends. All Rights Reserved.


About Unitrends

Unitrends offers a family of affordable, all-in-one on-premise backup appliances that support virtual and physical system backup and disaster recovery via disk-based archiving as well as electronic vaulting to private- and public-clouds. Unitrends is customer-obsessed, not technology-obsessed, and is focused on enabling its customers to focus on their business rather than on backup.

For more information, please visit www.unitrends.com or email us at [email protected].

mailto:[email protected]


http://www.unitrends.com/

white paper deduplication and the olsen twins 20110921 01 (1)

Documents