a secure data deduplication scheme for cloud secure data deduplication scheme for cloud storage...
Post on 20-Mar-2018
Embed Size (px)
A Secure Data Deduplication Scheme for CloudStorage
Jan Stanek, Alessandro Sorniotti, Elli Androulaki, and Lukas Kencl
Abstract. As more corporate and private users outsource their datato cloud storage providers, recent data breach incidents make end-to-end encryption an increasingly prominent requirement. Unfortunately,semantically secure encryption schemes render various cost-effective stor-age optimization techniques, such as data deduplication, ineffective. Wepresent a novel idea that differentiates data according to their popular-ity. Based on this idea, we design an encryption scheme that guaranteessemantic security for unpopular data and provides weaker security andbetter storage and bandwidth benefits for popular data. This way, datadeduplication can be effective for popular data, whilst semantically se-cure encryption protects unpopular content. We show that our schemeis secure under the Symmetric External Decisional Diffie-Hellman As-sumption in the random oracle model.
With the rapidly increasing amounts of data produced worldwide, networkedand multi-user storage systems are becoming very popular. However, concernsover data security still prevent many users from migrating data to remote stor-age. The conventional solution is to encrypt the data before it leaves the ownerspremises. While sound from a security perspective, this approach prevents thestorage provider from effectively applying storage efficiency functions, such ascompression and deduplication, which would allow optimal usage of the resourcesand consequently lower service cost. Client-side data deduplication in particularensures that multiple uploads of the same content only consume network band-width and storage space of a single upload. Deduplication is actively used by anumber of cloud backup providers (e.g. Bitcasa) as well as various cloud services(e.g. Dropbox). Unfortunately, encrypted data is pseudorandom and thus can-not be deduplicated: as a consequence, current schemes have to entirely sacrificeeither security or storage efficiency.
In this paper, we present a scheme that permits a more fine-grained trade-off.The intuition is that outsourced data may require different levels of protection,depending on how popular it is: content shared by many users, such as a popularsong or video, arguably requires less protection than a personal document, thecopy of a payslip or the draft of an unsubmitted scientific paper.
Around this intuition we build the following contributions: (i) we present E,a novel threshold cryptosystem (which can be of independent interest), togetherwith a security model and formal security proofs, and (ii) we introduce a scheme
Czech Technical University in Prague. (jan.stanek|lukas.kencl)@fel.cvut.cz.IBM Research - Zurich, Ruschlikon, Switzerland. (aso|lli)@zurich.ibm.com.
that uses E as a building block and enables to leverage popularity to achieveboth security and storage efficiency. Finally, (iii) we discuss its overall security.
2 Problem Statement
Storage efficiency functions such as compression and deduplication afford stor-age providers better utilization of their storage backends and the ability to servemore customers with the same infrastructure. Data deduplication is the processby which a storage provider only stores a single copy of a file owned by sev-eral of its users. There are four different deduplication strategies, depending onwhether deduplication happens at the client side (i.e. before the upload) or atthe server side, and whether deduplication happens at a block level or at a filelevel. Deduplication is most rewarding when it is triggered at the client side,as it also saves upload bandwidth. For these reasons, deduplication is a criticalenabler for a number of popular and successful storage services (e.g. Dropbox,Memopal) that offer cheap, remote storage to the broad public by performingclient-side deduplication, thus saving both the network bandwidth and storagecosts. Indeed, data deduplication is arguably one of the main reasons why theprices for cloud storage and cloud backup services have dropped so sharply.
Unfortunately, deduplication loses its effectiveness in conjunction with end-to-end encryption. End-to-end encryption in a storage system is the process bywhich data is encrypted at its source prior to ingress into the storage system. Itis becoming an increasingly prominent requirement due to both the number ofsecurity incidents linked to leakage of unencrypted data  and the tighteningof sector-specific laws and regulations. Clearly, if semantically secure encryptionis used, file deduplication is impossible, as no oneapart from the owner ofthe decryption keycan decide whether two ciphertexts correspond to the sameplaintext. Trivial solutions, such as forcing users to share encryption keys or usingdeterministic encryption, fall short of providing acceptable levels of security.
As a consequence, storage systems are expected to undergo major restruc-turing to maintain the current disk/customer ratio in the presence of end-to-endencryption. The design of storage efficiency functions in general and of dedupli-cation functions in particular that do not lose their effectiveness in presence ofend-to-end security is therefore still an open problem.
2.1 Related Work
Several deduplication schemes have been proposed by the research commu-nity  showing how deduplication allows very appealing reductions in theusage of storage resources [5, 6].
Most works do not consider security as a concern for deduplicating systems;recently however, Harnik et al.  have presented a number of attacks that canlead to data leakage in storage systems in which client-side deduplication isin place. To thwart such attacks, the concept of proof of ownership has beenintroduced [8, 9]. None of these works, however, can provide real end-user confi-dentiality in presence of a malicious or honest-but-curious cloud provider.
Convergent encryption is a cryptographic primitive introduced by Douceuret al. [10, 11], attempting to combine data confidentiality with the possibility of
data deduplication. Convergent encryption of a message consists of encryptingthe plaintext using a deterministic (symmetric) encryption scheme with a keywhich is deterministically derived solely from the plaintext. Clearly, when twousers independently attempt to encrypt the same file, they will generate the sameciphertext which can be easily deduplicated. Unfortunately, convergent encryp-tion does not provide semantic security as it is vulnerable to content-guessingattacks. Later, Bellare et al.  formalized convergent encryption under thename message-locked encryption. As expected, the security analysis presentedin  highlights that message-locked encryption offers confidentiality for un-predictable messages only, clearly failing to achieve semantic security.
Xu et al.  present a PoW scheme allowing client-side deduplication in abounded leakage setting. They provide a security proof in a random oracle modelfor their solution, but do not address the problem of low min-entropy files.
Recently, Bellare et al. presented DupLESS , a server-aided encryption fordeduplicated storage. Similarly to ours, their solution uses a modified convergentencryption scheme with the aid of a secure component for key generation. WhileDupLESS offers the possibility to securely use server-side deduplication, ourscheme targets secure client-side deduplication.
3 Overview of the Solution
Deduplication-based systems require solutions tailored to the type of data theyare expected to handle . We focus our analysis on scenarios where the out-sourced dataset contains few instances of some data items and many instances ofothers. Concrete examples of such datasets include (but are not limited to) thosehandled by Dropbox-like backup tools and hypervisors handling linked clones ofVM-images. Other scenarios where such premises do not hold, require differentsolutions and are out of the scope of this paper.
The main intuition behind our scheme is that there are scenarios in whichdata requires different degrees of protection that depend on how popular a datumis. Let us start with an example: imagine that a storage system is used by multipleusers to perform full backups of their hard drives. The files that undergo backupcan be divided into those uploaded by many users and those uploaded by oneor very few users only. Files falling in the former category will benefit stronglyfrom deduplication because of their popularity and may not be particularly sen-sitive from a confidentiality standpoint. Files falling in the latter category, mayinstead contain user-generated content which requires confidentiality, and wouldby definition not allow reclaiming a lot of space via deduplication. The same canbe said about common blocks of shared VM images, mail attachments sent toseveral recipients, to reused code snippets, etc.
This intuition can be implemented cryptographically using a multi-layeredcryptosystem. All files are initially declared unpopular and are encrypted withtwo layers, as illustrated in Figure 1: the inner layer is applied using a convergentcryptosystem, whereas the outer layer is applied using a semantically securethreshold cryptosystem. Uploaders of an unpopular file attach a decryption shareto the ciphertext. In this way, when sufficient distinct copies of an unpopular
file have been uploaded, the threshold layer can be removed. This step has twoconsequences: (i) the security notion for the now popular file is downgradedfrom semantic to standard convergent (see ), and (ii) the properties of theremaining convergent encryption layer allow deduplication to happen naturally.Security is thus traded for storage efficiency as for every file that transits fromunpopular to p