Understanding and Surpassing Dropbox: EfficientIncremental Synchronization in Cloud Storage Services
Shenglong Li 1 Quanlu Zhang 1 Zhi Yang 1 Yafei Dai 1
1Peking University
(lishenglong, zql, yangzhi, dyf)@net.pku.edu.cn
June 18, 2016
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 1 / 29
Outline
1 IntroductionBackgroundObjectiveContribution
2 Related WorkMeasurement of cloud storageservicesSimilarity detection techniqueState of The Art
3 Understanding Incremental SyncOf Cloud Storage Services
Rsync AlgorithmSync Mechanism on DropboxDetail Measurement andAnalysis
4 System Design andImplementation
System ArchitectureDelta SharingChunk-Based Rsync withSimilarity DetectionEfficient conflict resolution
5 EvaluationModification BenchmarkFile ConflictComparison with other cloudservicesEvaluation of AdditionalOverhead
6 Conclusion
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 2 / 29
Introduction Background
Cloud Storage Services
With increasing demand of users for high data reliability and convenientdata access, cloud storage services have become extremely prevalent andreached phenomenal levels of success. These are famous for file sharingscenarios.
Sea File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 3 / 29
Introduction Objective
Understanding and Surpassing Dropbox
Data synchronization is the heart of cloud storageservices with incremental data synchronization is appliedto minimize network traffic.
Whether the ”modified data = uploaded data” foractive client.
Whether the ”downloaded data (passive client) =uploaded data (active client)”.
Whether both active and passive client still presentsefficiency during file conflict.
Create an improved prototype based on findings.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 4 / 29
Introduction Contribution
3
Measurement on Dropbox
Conduct intensive measurements on Dropbox in filesharing scenarios.
Mechanism on Dropbox
Unravel the sync mechanisms employed in Dropbox onboth active and passive clients.
MinboxDesign several novel mechanisms, which resolve the trafficproblems, and apply them in an efficient incrementalsynchronization system
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 5 / 29
Related Work Measurement of cloud storage services
Measurement of cloud storage services
Drago first uncovers the Dropbox system architectureand data sync mechanism through an ISP-levellarge-scale measurement.
Li reveals the traffic overuse problem in Dropbox whenuser frequently modifies the files in synced folder andhe proposes an efficient batched sync mechanism toavoid massive metadata interaction.
Li focuses on quantifying and understanding trafficusage effectiveness through the measurements ofseveral popular cloud storage services on differentdevices.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 6 / 29
Related Work Similarity detection technique
Similarity detection technique
Xia proposes a new similarity detection
algorithm to better exploit similarity with low
RAM overhead and high throughput.
Google deploys SimHash to improve space
efficiency and query quality for web crawling.
Mark Manasse implements MinHash using
shingle sampling technique to extract features.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 7 / 29
Related Work State of The Art
1
While these previous works cover the data sync
mechanism as one of the key operations, none of
them tries to fully understand the mechanism of
incremental sync technique in file sharing scenario,
and measure the network traffic with different
write behaviours. Moreover, we reveal the network
traffic waste problems that are not explored before
and design several sync mechanisms to solve them.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 8 / 29
Related Work State of The Art
2
Our system design and implementation are
different from these works. Specifically, we design
an efficient chunk-based delta encoding
mechanism embedding similarity detection
technique, which combines locality-sensitive hash
and content defined chunking technique to
optimize the computation overhead while
guaranteeing precision. Moreover, this mechanism
can integrate with other deduplication techniques
seamlessly.Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 9 / 29
Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
An Incremental Data Sync Algorithm
The whole point of rsync is when a file is modified on remote host is notto send the whole file to the client but to send only the modifiedpart.
When a file is modified, the client retrieves a signature it which consists ofstrong checksums (e.g., black2, MD5) and weak checksums (e.g., Adler-32,a type of rolling checksum).
The client first computes weak checksums of the blocks in the changed file.
If the checksum matches one of the retrieved checksums, the clientcalculates its strong checksum to verify if the two blocks are indeed thesame.
While if not match, the client rolls one byte forward and calculates weakchecksum again to find the same blocks,vwhich appeals to finding out theskewed content.
Finally, all the different parts, called delta, can be found and sent back tothe server. The changed file is generated on the server by merging delta andthe original file, which is called patch the new file.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 10 / 29
Understanding Incremental Sync Of Cloud Storage Services Rsync Algorithm
Illustration of Rsync
Old File
signature
New File
Delta(patch)
+
+
ServerClient
signature
signature
Delta(patch)
Old FileDelta
(patch)
New File
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 11 / 29
Understanding Incremental Sync Of Cloud Storage Services Sync Mechanism on Dropbox
Dropbox Index Server and Amazon Data Server
Index Server
Client
Data Server
1. Request
file lo
catio
n
2. Sends f
ile lo
catio
n
3. Sync file using rsync on
certain file location
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 12 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Active and Passive Client
Dropbox ServersPassive Clients
Active Client
10MB + 1B
sync sync
10MB
Add orModified
=
or
10MB1B
1B
10MBor 1B
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 13 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Replacement at Different Positions
1. Divide both files into 4MB chunks. For example:
4MB 4MB 2MB
4MB 4MB 2MB
2. Check each chunks whether they are identical, if not then execute rsync.
4MB 4MB 2MB
Same Same rsync
4MB 4MB 2MB
For Figure 1 rsync is executed on:
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
or
4MB 4MB 2MB
4MB 4MB 2MB
Uplink (A) & Downlink (B) (based on delta) should be the same size:
Downlink for client A is the same because “active client” already stores signature data.
Passive clients have to sent the signature to data server and that’s why there’s uplink.
Uplink when “end” modified is smaller because
4KB block for rsync. Librsync uses 256-bit strong checksum and 32-bit weak checksum ((256b+32b)/8)*(4MB/4KB) =
36KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 14 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 1: Insertion at Different Positions
4MB 4MB 2MB
4MB 4MB 2MBRsync on every block. Signature sent = 36KB + 36 KB + 18KB = 90KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on block 2 and 3. Signature sent = 36 KB + 18KB = 54KB
4MB 4MB 2MB
4MB 4MB 2MB
Rsync on last block only. Signature sent = 18KB
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 15 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 2: Modification with different amounts of data
Replace or insert different amounts of data, ranging from 4KB to4MB, in the middle of a 4MB file and a 8MB file.
When replaced content is larger than 100KB, the amounts of data isless than modified due to data compression in Dropbox.
Data insertion may show waste problem on larger data because of thefixed lenght skewing, where rsync should have been able to deal withit normally.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 16 / 29
Understanding Incremental Sync Of Cloud Storage Services Detail Measurement and Analysis
Experiment 3: File conflict
Figure 3 A and B modifies at the same time and both sync to server.B reaches first and A sync from server.But when A’s modified data reaches the server and sync, B treats it as anew file and redownload whole.For Figure 4 the case is complicated but the case is similar to Figure 3 butwith 3 file conflicts.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 17 / 29
System Design and Implementation System Architecture
System Architecture
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 18 / 29
System Design and Implementation Delta Sharing
Delta Sharing
Usually passive client always executes repetitive rsyncto sync update timely which is a waste.
Since passive clients tends to stay online the deltagenerated by the active client can be reused.
In other words passive clients doesn’t have to executersync but retrieve delta from delta server.
Passive clients doesn’t have to maintain the onlinestate since it can be marked through index server.
If passive clients is offline for long, the previousmechanism is used (execute rsync).
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 19 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Similarity Detection Mechanism
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 20 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 21 / 29
System Design and Implementation Chunk-Based Rsync with Similarity Detection
Algorithm Summary
Use locality-sensitive hash to detect similar chunks.
To reduce computation overhead while guaranteeingdetection precision, it is employed ImpMinHashalgorithm.
The non-deduplicated chunk were sliced intosub-blocks using Rabin fingerprint.
Then find smallest cyclic redundant check (CRC)checksums to identify this chunk. Finally usedJaccard Index to compute similarity between chunks.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 22 / 29
System Design and Implementation Efficient conflict resolution
Efficient conflict resolution
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 23 / 29
Evaluation Modification Benchmark
Modification Benchmark
Replay experiment 1, and the result is unlike Dropbox, no uplink on Minbox’s passiveclient.
Replay experiment 2 that the results can be seen on Figure 8 and Figure 9 where Minboximplements similarity detection algorithm that outperforms Dropbox.
MinboxFD (native) used fixed length chunking 4MB while MinboxVD uses contentdefined chunking (CDC) with average 4MB chunking.
In most cases, MinboxVD performs the best by taking advantage of CDC to avoid theimpact of content skewing.
However, for large modification workloads in 8MB file, MinboxFD outperforms MinboxVD,because MinboxVD slices the new chunks which are not similar to original chunks.
After the matching for these chunks, MinboxVD may generate more redundant deltacompared with MinboxFD.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 24 / 29
Evaluation File Conflict
File Conflict
Dropbox downloads the whole file while Minbox only needs todownload the delta.
High network efficiency on Minbox.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 25 / 29
Evaluation Comparison with other cloud services
Comparison with other cloud services
Figure 12 shows comparison between Seafile and Minbox that also usesCDC.
Seafile have to send the whole modified chunk, and client download wholefile in each case while Minbox only deals with the rsync part.Comparison with others such as Google Drive and One Drive, Minbox tookadvantage of the incremental sync mechanism.For file conflict others downloads the whole file, while Minbox uses rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 26 / 29
Evaluation Evaluation of Additional Overhead
Evaluation of Additional Overhead
Finally, it is necessary to discuss the overhead of
Imp-MinHash in Minbox. We generate
ImpMinHash of a 4MB file and record the
signature size and computation time. The result is
that ImpMinHash has the same size as MinHash
which consumes little bytes compared with Rsync
signature. For computation time of signature,
ImpMinHash consumes two additional CPU ticks
in comparison to Rsync.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 27 / 29
Conclusion
Efficient Incremental Synchronization in Cloud StorageServices
Understanding Dropbox
In this paper, it is conducted comprehensive measurements onDropbox in file sharing scenario and unravel the incremental syncmechanism inside Dropbox.
Surpassing Dropbox
Meanwhile, it is revealed the significant network traffic waste existingin Dropbox, then designed and implemented an efficient incrementalsync system to solve these problems.
In the evaluation, Minbox significantly reduces the network trafficduring sync and solves the problem of file conflict with little overhead.
Presented by: Fajar Purnama (HICC LAB) Kumamoto University, GLOBECOM2015 June 18, 2016 28 / 29