secure distributed deduplication systems

Download Secure distributed deduplication systems

Post on 12-Apr-2017




2 download

Embed Size (px)


  • 0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2015.2401017, IEEE Transactions on Computers


    Secure Distributed Deduplication Systems withImproved Reliability

    Jin Li, Xiaofeng Chen, Xinyi Huang, Shaohua Tang and Yang Xiang Senior Member, IEEE andMohammad Mehedi Hassan Member, IEEE and Abdulhameed Alelaiwi Member, IEEE

    AbstractData deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage toreduce storage space and upload bandwidth. However, there is only one copy for each file stored in cloud even if such a file is ownedby a huge number of users. As a result, deduplication system improves storage utilization while reducing reliability. Furthermore,the challenge of privacy for sensitive data also arises when they are outsourced by users to cloud. Aiming to address the abovesecurity challenges, this paper makes the first attempt to formalize the notion of distributed reliable deduplication system. We proposenew distributed deduplication systems with higher reliability in which the data chunks are distributed across multiple cloud servers.The security requirements of data confidentiality and tag consistency are also achieved by introducing a deterministic secret sharingscheme in distributed storage systems, instead of using convergent encryption as in previous deduplication systems. Security analysisdemonstrates that our deduplication systems are secure in terms of the definitions specified in the proposed security model. As a proofof concept, we implement the proposed systems and demonstrate that the incurred overhead is very limited in realistic environments.

    KeywordsDeduplication, distributed storage system, reliability, secret sharing


    1 INTRODUCTIONWith the explosive growth of digital data, deduplicationtechniques are widely employed to backup data andminimize network and storage overhead by detectingand eliminating redundancy among data. Instead ofkeeping multiple data copies with the same content,deduplication eliminates redundant data by keepingonly one physical copy and referring other redundantdata to that copy. Deduplication has received muchattention from both academia and industry because itcan greatly improves storage utilization and save storagespace, especially for the applications with high dedupli-cation ratio such as archival storage systems.A number of deduplication systems have been pro-

    posed based on various deduplication strategies suchas client-side or server-side deduplications, file-level orblock-level deduplications. A brief review is given inSection 6. Especially, with the advent of cloud storage,data deduplication techniques become more attractiveand critical for the management of ever-increasing vol-umes of data in cloud storage services which motivatesenterprises and organizations to outsource data storage

    Jin Li is with the School of Computer Science, Guangzhou University,China, e-mail:

    Xiaofeng Chen is with the State Key Laboratory of IntegratedService Networks (ISN), Xidian University, Xian, China,

    Xinyi Huang is with the School of Mathematics and Computer Science,Fujian Normal University, China, e-mail:

    Shaohua Tang is with the Department of Computer Science, South ChinaUniversity of Technology, China, e-mail:

    Yang Xiang is with the School of Information Technology, Deakin Univer-sity, Australia, e-mail:

    M. M. Hassan, A. Alelaiwi are with College of Computer and InformationSciences, King Saud University, Riyadh, Saudi Arabia.

    to third-party cloud providers, as evidenced by manyreal-life case studies [1]. According to the analysis reportof IDC, the volume of data in the world is expectedto reach 40 trillion gigabytes in 2020 [2]. Todays com-mercial cloud storage services, such as Dropbox, GoogleDrive and Mozy, have been applying deduplication tosave the network bandwidth and the storage cost withclient-side deduplication.

    There are two types of deduplication in terms ofthe size: (i) file-level deduplication, which discovers re-dundancies between different files and removes theseredundancies to reduce capacity demands, and (ii) block-level deduplication, which discovers and removes redun-dancies between data blocks. The file can be divided intosmaller fixed-size or variable-size blocks. Using fixed-size blocks simplifies the computations of block bound-aries, while using variable-size blocks (e.g., based onRabin fingerprinting [3]) provides better deduplicationefficiency.

    Though deduplication technique can save the storagespace for the cloud storage service providers, it reducesthe reliability of the system. Data reliability is actuallya very critical issue in a deduplication storage systembecause there is only one copy for each file stored inthe server shared by all the owners. If such a sharedfile/chunk was lost, a disproportionately large amountof data becomes inaccessible because of the unavailabil-ity of all the files that share this file/chunk. If the valueof a chunk were measured in terms of the amount of filedata that would be lost in case of losing a single chunk,then the amount of user data lost when a chunk in thestorage system is corrupted grows with the number ofthe commonality of the chunk. Thus, how to guaranteehigh data reliability in deduplication system is a critical

    For More Details Contact G.Venkat Rao PVR TECHNOLOGIES 8143271457

  • 0018-9340 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2015.2401017, IEEE Transactions on Computers


    problem. Most of the previous deduplication systemshave only been considered in a single-server setting.However, as lots of deduplication systems and cloudstorage systems are intended by users and applicationsfor higher reliability, especially in archival storage sys-tems where data are critical and should be preservedover long time periods. This requires that the dedupli-cation storage systems provide reliability comparable toother high-available systems.Furthermore, the challenge for data privacy also arises

    as more and more sensitive data are being outsourcedby users to cloud. Encryption mechanisms have usu-ally been utilized to protect the confidentiality beforeoutsourcing data into cloud. Most commercial storageservice provider are reluctant to apply encryption overthe data because it makes deduplication impossible. Thereason is that the traditional encryption mechanisms,including public key encryption and symmetric keyencryption, require different users to encrypt their datawith their own keys. As a result, identical data copiesof different users will lead to different ciphertexts. Tosolve the problems of confidentiality and deduplication,the notion of convergent encryption [4] has been pro-posed and widely adopted to enforce data confidentialitywhile realizing deduplication. However, these systemsachieved confidentiality of outsourced data at the costof decreased error resilience. Therefore, how to protectboth confidentiality and reliability while achieving dedu-plication in a cloud storage system is still a challenge.

    1.1 Our Contributions

    In this paper, we show how to design secure deduplica-tion systems with higher reliability in cloud computing.We introduce the distributed cloud storage servers intodeduplication systems to provide better fault tolerance.To further protect data confidentiality, the secret sharingtechnique is utilized, which is also compatible with thedistributed storage systems. In more details, a file isfirst split and encoded into fragments by using thetechnique of secret sharing, instead of encryption mech-anisms. These shares will be distributed across multipleindependent storage servers. Furthermore, to supportdeduplication, a short cryptographic hash value of thecontent will also be computed and sent to each storageserver as the fingerprint of the fragment stored at eachserver. Only the data owner who first uploads the datais required to compute and distribute such secret shares,while all following users who own the same data copydo not need to compute and store these shares any more.To recover data copies, users must access a minimumnumber of storage servers through authentication andobtain the secret shares to reconstruct the data. In otherwords, the secret shares of data will only be accessible bythe authorized users who own the corresponding datacopy.Another distinguishing feature of our proposal is that

    data integrity, including tag consistency, can be achieved.

    The traditional deduplication methods cannot be directlyextended and applied in distributed and multi-serversystems. To explain further, if the same short value isstored at a different cloud storage server to supporta duplicate check by using a traditional deduplicationmethod, it cannot resist the collusion attack launchedby multiple servers. In other words, any of the serverscan obtain shares of the data stored at the other serverswith the same short value as proof of ownership. Fur-thermore, the tag consistency, which was first formalizedby [5] to prevent the duplicate/ciphertext repl


View more >