rangoli: space management in deduped environments · rangoli: space management in deduped...

25
P.C. Nagesh and Atish Kathpal Advanced Technology Group, NetApp, India Rangoli: Space management in deduped environments 1

Upload: others

Post on 01-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

P.C. Nagesh and Atish Kathpal

Advanced Technology Group,

NetApp, India

Rangoli: Space management in deduped

environments

1

Page 2: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Outline

What is the space management problem ?

Intuition behind our solutions

Evaluation and Summary

2

Page 3: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Space management Objectives

3

Cluster architecture depiction from OpenStack

Low Free

Space

Ensure adequate free space on volumes

Back end volumes as data containers

Page 4: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Space management problem

4

Logical View Physical View Volume metadata

Data

block

Ref

count

6

9

10

12

Freeing up a deduped volume is hard!

Page 5: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Illustrative example

How do you reclaim 50 GB free space?

Logical View

Which files to

move?

Space

Reclamation

1 3

2

4 5

6

Which files to

move?

Space

Reclamation

6 10

Which files to

move?

Space

Reclamation

6 10

2,4 10

Which files to

move?

Space

Reclamation

6 10

2,4 10

1,2,3 21

Page 6: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Outline

What is the space management problem ?

Intuition behind our solutions

Evaluation and Summary

6

Page 7: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Intuitive solutions and alternatives

7

1 3

2

4 5

6

Dedupe unaware

strategy

Low space reclamation,

too many unnecessary

side effects

Naïve- du :

Migrate files with

most unique

content

Intutive solution

Move shared files

Together.

Pick good “migration bins”

Page 8: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Side effects

Physical Space bloat

Due to loss of disk sharing

Percentage increase in physical space

consumption

Migration utility

Bandwidth wastage

Amount of reclaimed space per 100 bytes of data

transfer

It is a source centric strategy

8

Page 9: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Rangoli: Solution overview

1. Compute disk sharing relationships

– Graphical representation

2. Identify groups of highly shared files

– Good migration bins

3. Compute and report the exact metrics

– PSB and Migration Utility

Output the best migration bins (combined

with any higher level logic) 9 NetApp Confidential - Internal Use Only

1,2,3

6 4,5

Fingerprint Database

Inode FBN Fingerprint

1 3 a23b1234

2 5 234c1234

Page 10: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Outline

What is the space management problem ?

Intuition behind our solutions

Evaluation and Summary

10

Page 11: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Evaluation

Evaluation objectives

Comparison against alternate strategies

Datasets from diverse workloads

VM images

Home directories

Engineering document repositories

11 NetApp Confidential - Internal Use Only

Page 12: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Migration Utility (VMDK dataset)

12 NetApp Confidential - Internal Use Only

Higher the better

– More space reclamation per unit of data

migration

0

10

20

30

40

50

60

70

80

90

100

1 5 10 20

Mig

rati

on

Uti

lity

(%

)

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Page 13: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Physical space bloat (Debian dataset)

13 NetApp Confidential - Internal Use Only

Lower the better

– Less percentage increase in physical space

consumption

0

5

10

15

20

25

30

35

40

5 10 20 30

PS

B (

%)

Space Reclamation (%)

Naïve-du

MinHash

Rangoli

Page 14: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Summary

Inferences:

– Rangoli offers a scalable solution for space

reclamation in deduped environments

Better than alternatives by upto 35X sometimes.

Future work:

– Explore destination aware strategies

– Combine space reclamation with other desired

features such as load balancing and

performance considerations

14

Page 15: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Acknowledgements

Our thanks to Gaurav Makkar, Kaladhar

Voruganti, Kiran Srinivasan, Parag Deshmukh

and our anonymous reviewers for the several

insights and valuable feedback received

This work was done as part of Independent

Research Project at Advanced Technology

Group, NetApp at Bangalore, India.

15

Page 16: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

16

Page 17: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Scalability

28 minutes of end to end running time for the

largest dataset tested

– Synthetic dataset of 4TB size with 12million

files and 85% dedupe.

– Running times on a Laptop grade machine

17 NetApp Confidential - Internal Use Only

Page 18: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

18

Solution details

Page 19: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Step 1: FPDB Processing

Algorithm :

– A linear scan of the fingerprint

sorted FPDB

– Output is the bipartite graph

– Time is linearly proportional to the

data set size Inode FBN Fingerprint

1 3 a23b12349870

2 5 a23b12349870

1

2

3

4 3

3

5

File Id

Amount of

disk sharing

5

6

4

Page 20: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Step 2: Migration binning

Algorithm:

– Seek partitions with minimal edge

cuts

– Offers good but not necessarily

optimal partitions

– Time is dependent on the

complexity of disk sharing in the

dataset

given by the number of edges

– Quick union find with weighted

path compression data

structure for bin management

1

2

3

4 3

3

5

File Id

Amount of

disk sharing

5

6

4 1

2

1

2

3

4

5

6

Page 21: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Step 3: Compute the metrics

Algorithm:

– Disk sharing within a bin

contributes to savings in data

migration

– Data sharing across migration

bins contributes to losses

We can compute the metrics for

any arbitrary bins too.

The metrics computed are actuals,

not estimates

1 3

2

4 5

6

Page 22: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Datasets and experimental details

Dataset Size No of

Files

Dedup

Home Directory 74 GB 78 K 49 %

Debian 261 GB 448 59 %

VMDK 2.4 TB 2.4 K 62 %

EngWebBurt 1.3 TB 4 M 51 %

Synthetic 1 2.6 TB 8 M 77%

Synthetic 2 4 TB 12 M 85 %

22

Real world datasets from different workloads

Synthetic datasets for testing scalability

Page 23: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Physical space bloat

23

Page 24: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Migration Utility

24

Page 25: Rangoli: Space management in deduped environments · Rangoli: Space management in deduped environments 1 . Outline ... Rangoli: Solution overview 1. Compute disk sharing relationships

Scalability

Dataset Size No of

Files

Dedup Total running

time in parallel

mode

Home Directory 74 GB 78 K 49 % 24 sec

Debian 261 GB 448 59 % 1 min

VMDK 2.4 TB 2.4 K 62 % 13 min

EngWebBurt 1.3 TB 4 M 51 % 9 min

Synthetic 1 2.6 TB 8 M 77% 20 min

Synthetic 2 4 TB 12 M 85 % 28 min

25

Scales to large datasets

– ~30 minutes for 4TB and 12M file dataset with 85%

dedupe