safe : structure-aware file and email deduplication for cloud-based storage systems

24
SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek- Young Choi University of Missouri-Kansas City

Upload: sandro

Post on 11-Jan-2016

30 views

Category:

Documents


3 download

DESCRIPTION

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems. Daehee Kim , Sejun Song, Baek -Young Choi University of Missouri-Kansas City. Cloud Storage – Dropbox , Google drive,…. Network : High network bandwidth consumption. Server : - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

SAFE : Structure-Aware File and Email Deduplication for

Cloud-based Storage Systems

Daehee Kim, Sejun Song, Baek-Young Choi

University of Missouri-Kansas City

Page 2: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Cloud Storage – Dropbox, Google drive,…

Individual

Anywhere, Anytime

Sales

Employee Employee

…..

…..

Marketing

..

Server : large storage consumption

Network : High network

bandwidth consumption

Client : High uploading overhead

i.e. Remote Backup

…. ….

Page 3: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Data deduplication Deduplication granularity

Deduplication location

File-level Sub file-level

Fixed-size chunk Variable-size chunk

Server-based Traditionally on the high capacity servers

Client-based Limited by the client capacity

Page 4: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

File-Level

index

index

Index tableindex

storage

X

unique

duplicate

controldata

(File-Level Deduplication)

Page 5: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Sub-File Level : Fixed Size Chunk(Fixed Size Block Deduplication)

nice people, good papers, and good conference, ……

boundary boundary boundary

nice people, go od papers, and good conference

e.g. granularity : 15 byte fixed size

……

welcome, nice people, good papers, and good conference, ……

welcome, nice p eople, good pap ers, and good c

Offset shifting problem No redundancies found

File1

File2

……

Page 6: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Sub-File Level : Variable Size Chunk

Based on content, not fixed offset

(Variable Size Block Deduplication)

nice people, go od papers, and go

boundary boundary

welcome, nice people, go od papers, and go

nice people, good papers, and good conference, ……File1

File2

e.g. matching pattern : “go”

……

welcome, nice people, good papers, and good conference, ……

……

=

Page 7: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Deduplication : Comparisons

Current cloud storage systems Client-based JustCloud, Mozy : file deduplication Dropbox : large fixed size block deduplication (4MB)

File-level < Fixed size << Variable sizeDeduplication ratio

File-level < Fixed size <<<< Variable size Processing time

File-level << Fixed size Variable size Index overhead

better

worse

worse

Good for server-basedGood for client-based

Page 8: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Objective

Develop an efficient client deduplication that achieves High deduplication ratio

Less index overhead

Low processing time

Low network traffic

Page 9: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Outline

Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

Page 10: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Observations structured file can be decomposed to various objects

Fast decomposition without shifting problem e.g. compressed files ( zip, rar, ..), document files (pdf,

doc, ppt, docx, pptx), emails

email

metabody(text)

attachments

text pdf docx images …

[ Example ]

Page<</Type/Page/ …>>

<</Type/.. Image/.. Filter/.. Length>><stream>Encoded image<endstream>

Image object

<</Filter/ .. /Length >><stream>Encoded text<endstream>

Text object

Page 11: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Observations

Large number of structured files exist in cloud-based storage systems

[ dataset ]

Page 12: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Our Approach (SAFE)

Apply file-level deduplication for redundant files Speed up and small index sizes

Apply object-based deduplication for structured files Decompose a file into objects Find redundancies based on decomposed objects. Combine small sized meta data into an object (to

reduce index sizes)

Page 13: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Outline

Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

Page 14: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Email parser

SAFE Architecture

File parser

Files

Emails

File-level dedup

Object-level dedup

StructureLibrary

objectsStore manager

Structured file

All object indexes

Unique object indexes

Object manager

Unstructured fileStructured?

endRedundant file

unique file

objects

objects

(index, object)

imgpdfmeta

Page 15: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

SAFE in Cloud Storage

Client

Server

Indexes (objects)

object-leveldedup

Indexes (unique objects)

unique objects

file-level dedup

::

SAFE

Page 16: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Outline

Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

Page 17: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Setup Collected real data sets

Structured files (docx, pptx, and pdf) From file system and emails of five graduate students

in the same department file system : 4 GB, emails : 2.5 GB

Compared deduplications File-level (like JustCloud, Mozy) Fixed block (4MB, like Dropbox) Variable block (8 KB average chunk size)

Page 18: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Evaluation Metrics Performance

Deduplication ratio Space savings by removing redundancies ( (InputData – ConsumedStorage) / InputData) * 100

Network Traffic Size of data transferred to a storage over network Byte

Overhead Processing time

Relative processing time to File-Level Index size

Relative index size per File-Level

Page 19: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Deduplication Ratio

File system datasets Email datasets

x2

is 2 times higher in SAFE than in “File-level”

x1.5

is even higher in SAFE than Block-V for file system datasets

is about 30% to 60% in SAFE.

is as good in SAFE as variable size block deduplication (Block-V) for email datasets

Page 20: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Network Traffic is the lowest in SAFE for both datasets

File system datasets Email datasets

15%30%

is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets.

Page 21: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Processing Time

is as fast in SAFE as in File-level

File system datasets Email datasets

hundredstimes

hundredstimes

is hundreds times faster in SAFE than in Block-V

Page 22: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Index Size

File system datasets Email datasets

Is proportional to the number of unique blocks (40B per index) i.e. for 4000 emails, index sizes are 0.1 MB (file-level) and 1.3 MB (SAFE)

Is 2 to 3 times less in SAFE (1.3MB) than Block-V (3.7MB) Block-V has 8KB block size in average

Is 2 times more in file system than email datasets SAFE has multiple decomposed objects for a file i.e. file system dataset has more pdf files (pdf file can be decomposed into more

objects than docx)

Page 23: SAFE : Structure-Aware File and Email  Deduplication  for  Cloud-based Storage Systems

Conclusions Developed an efficient structure-aware

client-based deduplication (SAFE) High deduplication ratio: as good as Block-V

Less index overhead 2 ~ 3 times less than Block-V

Low processing time hundreds times than Block-V

Future work Extend to incorporate more structured file types

Low network traffic: as good as Block-V