safe : structure-aware file and email deduplication for cloud-based storage systems
DESCRIPTION
SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems. Daehee Kim , Sejun Song, Baek -Young Choi University of Missouri-Kansas City. Cloud Storage – Dropbox , Google drive,…. Network : High network bandwidth consumption. Server : - PowerPoint PPT PresentationTRANSCRIPT
SAFE : Structure-Aware File and Email Deduplication for
Cloud-based Storage Systems
Daehee Kim, Sejun Song, Baek-Young Choi
University of Missouri-Kansas City
Cloud Storage – Dropbox, Google drive,…
Individual
Anywhere, Anytime
Sales
Employee Employee
…..
…..
Marketing
..
Server : large storage consumption
Network : High network
bandwidth consumption
Client : High uploading overhead
i.e. Remote Backup
…. ….
Data deduplication Deduplication granularity
Deduplication location
File-level Sub file-level
Fixed-size chunk Variable-size chunk
Server-based Traditionally on the high capacity servers
Client-based Limited by the client capacity
File-Level
index
index
Index tableindex
storage
X
unique
duplicate
controldata
(File-Level Deduplication)
Sub-File Level : Fixed Size Chunk(Fixed Size Block Deduplication)
nice people, good papers, and good conference, ……
boundary boundary boundary
nice people, go od papers, and good conference
e.g. granularity : 15 byte fixed size
……
welcome, nice people, good papers, and good conference, ……
welcome, nice p eople, good pap ers, and good c
Offset shifting problem No redundancies found
File1
File2
……
Sub-File Level : Variable Size Chunk
Based on content, not fixed offset
(Variable Size Block Deduplication)
nice people, go od papers, and go
boundary boundary
welcome, nice people, go od papers, and go
nice people, good papers, and good conference, ……File1
File2
e.g. matching pattern : “go”
……
welcome, nice people, good papers, and good conference, ……
……
=
Deduplication : Comparisons
Current cloud storage systems Client-based JustCloud, Mozy : file deduplication Dropbox : large fixed size block deduplication (4MB)
File-level < Fixed size << Variable sizeDeduplication ratio
File-level < Fixed size <<<< Variable size Processing time
File-level << Fixed size Variable size Index overhead
better
worse
worse
Good for server-basedGood for client-based
Objective
Develop an efficient client deduplication that achieves High deduplication ratio
Less index overhead
Low processing time
Low network traffic
Outline
Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
Observations structured file can be decomposed to various objects
Fast decomposition without shifting problem e.g. compressed files ( zip, rar, ..), document files (pdf,
doc, ppt, docx, pptx), emails
metabody(text)
attachments
text pdf docx images …
[ Example ]
Page<</Type/Page/ …>>
<</Type/.. Image/.. Filter/.. Length>><stream>Encoded image<endstream>
Image object
<</Filter/ .. /Length >><stream>Encoded text<endstream>
Text object
Observations
Large number of structured files exist in cloud-based storage systems
[ dataset ]
Our Approach (SAFE)
Apply file-level deduplication for redundant files Speed up and small index sizes
Apply object-based deduplication for structured files Decompose a file into objects Find redundancies based on decomposed objects. Combine small sized meta data into an object (to
reduce index sizes)
Outline
Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
Email parser
SAFE Architecture
File parser
Files
Emails
File-level dedup
Object-level dedup
StructureLibrary
objectsStore manager
Structured file
All object indexes
Unique object indexes
Object manager
Unstructured fileStructured?
endRedundant file
unique file
objects
objects
(index, object)
imgpdfmeta
SAFE in Cloud Storage
Client
Server
Indexes (objects)
object-leveldedup
Indexes (unique objects)
unique objects
file-level dedup
::
SAFE
Outline
Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
Setup Collected real data sets
Structured files (docx, pptx, and pdf) From file system and emails of five graduate students
in the same department file system : 4 GB, emails : 2.5 GB
Compared deduplications File-level (like JustCloud, Mozy) Fixed block (4MB, like Dropbox) Variable block (8 KB average chunk size)
Evaluation Metrics Performance
Deduplication ratio Space savings by removing redundancies ( (InputData – ConsumedStorage) / InputData) * 100
Network Traffic Size of data transferred to a storage over network Byte
Overhead Processing time
Relative processing time to File-Level Index size
Relative index size per File-Level
Deduplication Ratio
File system datasets Email datasets
x2
is 2 times higher in SAFE than in “File-level”
x1.5
is even higher in SAFE than Block-V for file system datasets
is about 30% to 60% in SAFE.
is as good in SAFE as variable size block deduplication (Block-V) for email datasets
Network Traffic is the lowest in SAFE for both datasets
File system datasets Email datasets
15%30%
is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets.
Processing Time
is as fast in SAFE as in File-level
File system datasets Email datasets
hundredstimes
hundredstimes
is hundreds times faster in SAFE than in Block-V
Index Size
File system datasets Email datasets
Is proportional to the number of unique blocks (40B per index) i.e. for 4000 emails, index sizes are 0.1 MB (file-level) and 1.3 MB (SAFE)
Is 2 to 3 times less in SAFE (1.3MB) than Block-V (3.7MB) Block-V has 8KB block size in average
Is 2 times more in file system than email datasets SAFE has multiple decomposed objects for a file i.e. file system dataset has more pdf files (pdf file can be decomposed into more
objects than docx)
Conclusions Developed an efficient structure-aware
client-based deduplication (SAFE) High deduplication ratio: as good as Block-V
Less index overhead 2 ~ 3 times less than Block-V
Low processing time hundreds times than Block-V
Future work Extend to incorporate more structured file types
Low network traffic: as good as Block-V
Thank you!
Questions?{daehee.kim, sjsong, choiby}
@umkc.edu