rangoli: space management in deduped environments · rangoli: space management in deduped...

P.C. Nagesh and Atish Kathpal

Advanced Technology Group,

NetApp, India

Rangoli: Space management in deduped

environments

Outline

What is the space management problem ?

Intuition behind our solutions

Evaluation and Summary

Space management Objectives

Cluster architecture depiction from OpenStack

Low Free

Ensure adequate free space on volumes

Back end volumes as data containers

Space management problem

Logical View Physical View Volume metadata

Freeing up a deduped volume is hard!

Illustrative example

How do you reclaim 50 GB free space?

Logical View

Which files to

Reclamation

Which files to

Reclamation

Which files to

Reclamation

2,4 10

Which files to

Reclamation

2,4 10

1,2,3 21

Outline

Intuitive solutions and alternatives

Dedupe unaware

strategy

Low space reclamation,

too many unnecessary

side effects

Naïve- du :

Migrate files with

most unique

content

Intutive solution

Move shared files

Together.

Pick good “migration bins”

Side effects

Physical Space bloat

Due to loss of disk sharing

Percentage increase in physical space

consumption

Migration utility

Bandwidth wastage

Amount of reclaimed space per 100 bytes of data

transfer

It is a source centric strategy

Rangoli: Solution overview

1. Compute disk sharing relationships

– Graphical representation

2. Identify groups of highly shared files

– Good migration bins

3. Compute and report the exact metrics

– PSB and Migration Utility

Output the best migration bins (combined

with any higher level logic) 9 NetApp Confidential - Internal Use Only

Fingerprint Database

Inode FBN Fingerprint

1 3 a23b1234

2 5 234c1234

Outline

Evaluation

Evaluation objectives

Comparison against alternate strategies

Datasets from diverse workloads

VM images

Home directories

Engineering document repositories

11 NetApp Confidential - Internal Use Only

Migration Utility (VMDK dataset)

Higher the better

– More space reclamation per unit of data

migration

1 5 10 20

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Physical space bloat (Debian dataset)

Lower the better

– Less percentage increase in physical space

consumption

5 10 20 30

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Summary

Inferences:

– Rangoli offers a scalable solution for space

reclamation in deduped environments

Better than alternatives by upto 35X sometimes.

Future work:

– Explore destination aware strategies

– Combine space reclamation with other desired

features such as load balancing and

performance considerations

Acknowledgements

Our thanks to Gaurav Makkar, Kaladhar

Voruganti, Kiran Srinivasan, Parag Deshmukh

and our anonymous reviewers for the several

insights and valuable feedback received

This work was done as part of Independent

Research Project at Advanced Technology

Group, NetApp at Bangalore, India.

Scalability

28 minutes of end to end running time for the

largest dataset tested

– Synthetic dataset of 4TB size with 12million

files and 85% dedupe.

– Running times on a Laptop grade machine

Solution details

Step 1: FPDB Processing

Algorithm :

– A linear scan of the fingerprint

sorted FPDB

– Output is the bipartite graph

– Time is linearly proportional to the

data set size Inode FBN Fingerprint

1 3 a23b12349870

2 5 a23b12349870

File Id

Amount of

disk sharing

Step 2: Migration binning

Algorithm:

– Seek partitions with minimal edge

– Offers good but not necessarily

optimal partitions

– Time is dependent on the

complexity of disk sharing in the

dataset

given by the number of edges

– Quick union find with weighted

path compression data

structure for bin management

File Id

Amount of

disk sharing

Step 3: Compute the metrics

Algorithm:

– Disk sharing within a bin

contributes to savings in data

migration

– Data sharing across migration

bins contributes to losses

We can compute the metrics for

any arbitrary bins too.

The metrics computed are actuals,

not estimates

Datasets and experimental details

Dataset Size No of

Home Directory 74 GB 78 K 49 %

Debian 261 GB 448 59 %

VMDK 2.4 TB 2.4 K 62 %

EngWebBurt 1.3 TB 4 M 51 %

Synthetic 1 2.6 TB 8 M 77%

Synthetic 2 4 TB 12 M 85 %

Real world datasets from different workloads

Synthetic datasets for testing scalability

Physical space bloat

Migration Utility

Scalability

Dataset Size No of

Dedup Total running

time in parallel

Home Directory 74 GB 78 K 49 % 24 sec

Debian 261 GB 448 59 % 1 min

VMDK 2.4 TB 2.4 K 62 % 13 min

EngWebBurt 1.3 TB 4 M 51 % 9 min

Synthetic 1 2.6 TB 8 M 77% 20 min

Synthetic 2 4 TB 12 M 85 % 28 min

Scales to large datasets

– ~30 minutes for 4TB and 12M file dataset with 85%

dedupe

rangoli: space management in deduped environments · rangoli: space management in deduped...

Documents

rangoli gardens, jaipur

rangoli preschool: complete business proposal

rangoli celebration · 2019-02-13 · rangoli celebration...

rangoli competition drawing competition · 2018. 9. 11. ·...

rangoli creatives

basic dot templates for rangoli and kolambasic dot templates...

best rangoli designs wallpaper

rangoli and bandhanwar making competition · 2019-11-06 ·...

inter school rangoli competition ((((zonalzonalzonal...

rangoli ebook

rangoli hp free

rangoli (1)

rangoli designs -...

montessori inspired activity diwali rangoli...

kolam or rangoli

rangoli gardens jaipur sale

rangoli competition drawing competitionon the first day,...

rangoli (patterns

community arts portfolio community rangoli creation for...

rangoli - art lession