secure and efficient cloud data deduplicationsecure and efficient cloud data deduplication p....

SECURE AND EFFICIENT CLOUDDATA DEDUPLICATION

P. SARANYA1, PARTYUSH GUPTA2,SHIVANSH JOSHI3

1Assistant Professor, 2,3StudentSRM Institute of Science and Technology,

Chennai, [email protected]

June 23, 2018

Abstract

Data de-duplication is extracting replicated or duplicatecopy of data to improvise the storage efficiency. To bet-ter ensure information security, this paper makes the mainattempt to formally address the issue of approved data de-duplication. Its not like the usual de-duplication systems,more advantageous than the ordinary system. Implementedwith two novel de-duplication improvements in the hybridcloud duplicated data. Security analysis clearly states thatthe proposed model has proper security structure. Securitysystems assure no data lose in the cloud data storage. Con-vergent encryption is implemented for authorization processfor significant secure management. Data segmentation is ap-plied in data when uploading the data in the hybrid cloud.As a result of data comparison plan using our model, weshowed the lead test bed. Here exhibited that our proposeddata comparison plan brings the insignificant overhead con-trasted with distinctive activities.

Key Words: Data de-duplication, Hybrid-cloud, Simhash-ing, Security

1

International Journal of Pure and Applied MathematicsVolume 120 No. 6 2018, 1795-1812ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1795

1 INTRODUCTION

Cloud storage service management has the complexity in expand-ing the volume of information. To formulate cloud data process,data de-duplication is a significant policy and having high impor-tant. With data de-duplication security policies has to be takeninto model. Data de-duplication clears surplus information by re-taining a unique data (removing duplicates) and suggest other du-plicate information to it. Information de-duplication is given moreimportance with system security, data protection and advantageover security threats from both internal components and externalcomponent threats. Conservation way of encryption with informa-tion private access is distinct with data de-duplication. In general,traditional encryption needs different customers to encrypt theirdata or information on their own to be high possible needs a samedata or document when a copy is available. After the evidence,resulting clients with a similar file will be given a pointer from theserver without expecting to transfer a similar document. User candownload the encoded data or record with the reference pointerfrom the storage serve, that can be decoded by the related dataor information owners with their specified key. A client can down-load the encrypted record with the pointer from the server, whichmust be decoded by the relating information proprietors with theirfocalized keys. Keeping in mind the end goal to spare cost andproductively management, the data will be moved to the storageserver provider (SCSP) in the general population cloud with indi-cated benefits and the deduplication system will be connected tostore just a single duplicate of a similar record.

2 RELATED WORK

Data de-duplication is utilized to enhance stockpiling use and canlikewise be connected to organize information exchanges to lessenthe quantity of bytes that must be sent. Rather than keeping nu-merous information duplicates with a similar substance, de dupli-cation takes out excess keys. Therefore, identical data duplicatesof various clients will prompt distinctive ciphertexts, making de-duplication absurd. Convergent encryption has been proposed toauthorize information secrecy while making de-duplication achiev-

2

International Journal of Pure and Applied Mathematics Special Issue

1796

able[3]. It encrypts/decrypts an information duplicate with a con-vergent key, which is acquired by computing the cryptographic hashestimation of the substance of the information duplicate. Afterkey generation and information encryption, clients hold the keysand send the ciphertext to the cloud. Since the encryption taskis Deterministic and is gotten from the information content, in-distinguishable information duplicates will produce the same focal-ized key and thus similar figure content. To avoid unauthorizedaccess, a protected evidence of proprietorship convention is addi-tionally expected to give the confirmation that the client data byretaining a single (unique) record and indexing the other dupli-cate records to overcome the repetition. Implemented FADE[5], asafe overlay distributed storage framework that accomplishes fine-grained, approach-based access control and guaranteed deletion ofthe record. It partners outsourced documents with record accesspolicies and without a doubt erases records to make them unrecov-erable to tons of record access policies. To accomplish these securityobjectives, the implementation technique is based upon an arrange-ment of cryptographic key tasks that are self-kept up by a majorityof key directors that are free of non-trusted cloud. The broad ex-perimental research shows that FADE gives security insurance tooutsourced information, while presenting just negligible executionand money related cost overhead. While we would now be able tooutsource information reinforcement to outsider distributed stor-age benefits in order to decrease information administration costs,security concerns emerge as far as guaranteeing the protection andhonesty of outsourced information.The random space perturbation (RASP) information technique[5]work with the preserving encryption, dimensionality development,and arbitrary noise projection, to give solid flexibility to threatson the perturbed information and inquiries. The wide organizationof open distributed computing frameworks, utilizing clouds to haveinformation query administrations has become an engaging answerfor the focal points on adaptability and cost-sparing. It addition-ally safeguards multidimensional reaches, which enables existingordering procedures to be connected to speed up run inquiry han-dling. The kNN-R calculation is intended to work with the RASPgo inquiry calculation to process the KNN queries. An additionaltechnique is estimated to manage information safety and confiden-

3


1797

tiality, the ability of inquest administrations and the advantages ofutilizing the mists ought to likewise be saved.Cyber physical frameworks (CPSs), including a tight blend of com-putational and physical components and additionally correspon-dence systems, pulled in concentrated consideration as of late dueto their wide applications in different zones. Two problems arerelated in cooperating the collaborations in a CPS: like to whomthe collaboration has to be done, how to incorporate the collabo-rations. These issues associates with the following problems- com-puting resource and network design. It formulates an offline finite-queue-aware CPS service maximization issue to crowd source nodescomputing tasks. Here exhibited the characteristics usage of De-key using Ramp secret sharing scheme. Used a pattern approachallows each client has their key for encrypting the focalized keysand let them access by others in the cloud. The issue is managinglarge number of keys, as it varies with number of users.Two level encryption is applies to data blocks, which are twisted.Control keys are generated from key deduction tree. Key deduc-tion tree has been encoded by All OR Nothing calculation. Thenit is distributed into DHT[4] organize subsequent, carried with se-cret key sharing. With this, authorized users alone can access thecontrol key and can decrypt the uploaded data around the ownerspecified time limit. Goal of security management is to do low costdeletion. Hopping and sniffing attacks resistance, data dynamicsleads de-duplication.

3 RESEARCH AND METHODOLOGY

In this paper, system framework improved with security. Variousbenefits keys impelled efficient encryption for the document. Rep-etition of data cannot be identified without comparison of clientdata. Unauthorized client access does not allow message duplicatechecking and cannot decode the message even conspire the CSP. Se-curity examination proves that the proposed framework is safe asit is indicated in the proposed system. Convergent encryption hasbeen proposed to safeguard the information secrecy while makingde-duplication done. It encrypts/decrypts an information dupli-cate with a convergent key, which is acquired by computing the

4


1798

cryptographic hash estimation of the substance of the informationduplicate. When key generation and encryption of data is executed,client retains the key and sends the encoded data to the cloud. Asthe encryption process is defined and cannot be differentiated fromthe actual information content, indistinguishable information du-plicates will produce the same focalized key and thus similar figurecontent. To avoid unauthorized access, a protected evidence of pro-prietorship convention is expected to provide the assurance that theclient to be clear in requesting a same record if the copy is found.The features are following:

• Client is allowed to access the copy for checking the doc-uments set which is different from benefits attained by theclient.

• Used a propelled plan to make the base security of the en-crypted data more secure with differential benefit keys.

• It reduces the capacity of the labels for duplicate check andimprove the security of the cloud data and ensure the infor-mation classification.

System Structure

5


1799

Figure 1 System Structure

The owner of the file will upload the file, after encryption thecontent of the file will be checked for de-duplication with the ex-isting files. If the content is same then it’s not updated in thedatabase, if the content is different than the file will be uploaded inthe database and to the cloud. Now a new user can view the fileson the cloud and if he/she wanted to access the file it can requestto the owner of the file to give access. The request will come to theowner who can accept the request or can reject the request. If therequest is accepted then it will generate a key which will be sendto the user who requested for the file by mail. The generated keywill then be decrypt the file and the file will be downloaded in theuser’s database.

Figure 2 Cloud Operation

In User module, each user has authenticated with access keyswhich is stated in proposed ontology system. Every user has toregister with their own credentials to access the cloud. User hasto enter the details like email address, username, password, displayname, and remaining field given in registration portal. The at-tribute display name is used by the server to showcase the name of

6


1800

the user. User can login to the cloud and upload the file and accessthe data in the cloud. To execute the authorized de-duplication inthe key of record and it is resolved by the privilege. Traditionalmethod tag is said to be token for the characteristics difference.De-duplication makes use of identical substance, while encoding ofrecords accomplishment to accomplish all substances appear capri-cious. Same substance encoded with two difference keys results indifferent encrypted text. Accordingly, consolidating the space pro-ductivity of de duplication with the mystery parts of encryption isdangerous. After the cloud storage, the user can download the filebased on key or token.Once the request of key was accepted, the user at sender side canforward the key or user can reject it. Along with the key and re-quest id, this is generated at the time of communication at themoment of sending request for the key and receiver can decode themessage.There are many various ways that machines can attempt to identifyduplicate content.

• Bag of Words Word comparison and frequency comparison ofthe words with one page with another page.

• Shingling This methodology enhances on the ”Bag of Words”method by short phrases comparison, which provides somecontext to the words.

• Hashing Hashing enhances the process by reducing the re-quirement to store the copy of all the content. The phrases arehashed into numbers, which can then be compared to identifyduplication.

• MinHash Helps streamline the process of storing contenthashes.

• SimHash Further streamlines the process of storing contenthashes and duplication detection.

Bag of Words

Consider a set of documents and for each one duplicate has tobe identified. Every document has to be taken individually and

7


1801

compared with one another for the matching records. Find the’match’ and ’not match’ in the documents having set of records.Naively, we may start by regarding each document as a bag ofwords. So, a document like ”a bump on the log in the hole in thebottom of the sea” is changed to the set of the words it contains:Fig 3 a, in, of, on, the, log, sea, bump, hole, bottomFor comparingtwo separate documents, here stated the way to count how manywords are same and how many words are not same in both the setof words. This is commonly known as . These, then, are consideredvery similar.

Figure 3 Jaccard similarity coefficient

Next to it, if we consider a different document your motherdrives you in the car, as we can access That there are not manywords seems to same comparing to the total count of the words.The it is said to be very dissimilar. It works well, but even thenit has some issues in the overall result. When executing this ap-proach it identifies similar words from the sets, as well identifiessimilar documents with different name or reference. Your motherdrives you in the car and In mother Russia, car drives you! Fig 4

8


1802

The documents which ever taken for comparison is totally dissimi-lar and but yet it uses mostly similar words distributed over the twodocuments. It resulted that a some other better method is needed.Shingling

The issue with the previous method Bag of Words is, it doesnot consider the context of the words. In specific, the words sur-rounded by any given word are considered to be matching in theword set. With the example discussed above states the operationof the method used. It is very important to be considered.

Figure 4 comparing it to a different document

In this way, rather than simply regarding each document trans-form each set of words(document) into a set of overlapping phrasesof a few words at a time. This process is called singling. As everydata expression related to the nearby set of words. So for, ”Yourmother drives you in the car,” we’d get a set like this: { ’yourmother drives’, ’mother drives you’,’drives you in’, ’you in the’, ’inthe car’ } By using this method and make comparison between thenumbers of phrases common to the sentences which is related to

9


1803

the overall count of phrases. But, in original set of words, theyare quite different. When we return to our example, we again findthat the documents are relatively same: Results had shown someimprovement in the process execution.HashingWhen taking hashing into consideration, the problem with the ”Bagof phrases” method is that the document which is processing has tobe stored may times over some period of time. In specific, if we usete phrases of k word, then at each word, I will appear in k phrases,), making the storage requirements O(nk),To make it easier, we’llturn each phrase into a number representative of the phrase usinga hash function. So now rather than a bag of phrases, we have abag of numbers on which we can at present play out. The signatureof every documents or records now become the ordered list of theseminimum hashes m0 through mk-1. This approa ch accomplishesan estimation to Jaccard similarity by looking at pairwise everyone of the document’s minimum hashes and giving an area that areindistinguishable

Figure 5 Minimum hashing

as a bag of words, we should rather similar set intersection. Toavoid any misunderstandings, we will refer these as ’phrase hashes’.

10


1804

Figure 6 Hashing

Minimum Hashing

In standard form it is define as, Let h be a hash function thatmaps the members of P and S to distinct integers, and for any setQ define hmin(Q) to be the member z of Q with the minimum valueof h(z). Then hmin(P ) = hmin(S) exactly when the Minimum hashvalue of the union P S lies in the intersection P S. Therefore,Pr[hmin(P ) = hmin(S)] = J(P, S).While hashing helps in reducing the set of words( document) storedrequires the O(n) down in the set of words, but it consumes hightime for processing. Numerous amount of pages results in verylarge data store, and when checking matches in two documents ofind overlapping of words, atleast it must be O(m+n). MinimumHashing is a technique that is equipped for utilizing steady stock-piling independent of the document length and producing a goodestimate of our similarity measure. This method reduces the sizeof each document to its fixed size of hashes as a course overviewof the document. This exploit is achieved by including a set ofrandom values generating hash functions. For every hash functiondenotes as i, We scan through the whole document’s data with theminimum hash (mi).

11


1805

Figure 7 Jaccard similarity

From the above implemented method, it will succeed in twoways: the ability needs for every record is currently O(1), andthe pairwise document correlation is currently likewise O(1), achangeover O(n) and O(m + n) individually. While this enhancesthe unpredictability for comparing any two documents, despite ev-erything we’re looked with the issue of finding any (or all) copiesof a query document given an arrangement of known documents.we need to complete a straight traversal of all the known recordsin O(n) time.Simhashing

It is a method that is used to resolve the problem. Here the or-ganization of hashes generated for information hashes , this methodgenerates a solitary hash which includes interesting property.Two same sets of hash inputs generate same resultant hashes. Mosthashing capacities are decided for the property that marginally ex-traordinary information sources have conceivably altogether differ-ent yields, which makes simhash one of a kind.

12


1806

Figure 8 Simhashing

To determine the similarity of two values of simhashes, thismethod allows to verify the volume of bits that makes differenceamong the two queries as a measure of uniqueness. On the otherhand, the quantity of bits that are the similar, that is the measureof comparability. It can be briefly computed as bit population (ab).

13


1807

Figure 9 Similarity index

This accompanies the undeniable win of instead of putting away128 or so hashes for every document, It just need to store a solitarywhole number as a unique finger impression. Be that as it may, thegreater win accompanies discovering a set of near duplicate wordsin the documents. Suppose for instance, that with a specific endgoal to be thought about copies, reports should have, at most, twobits that contrast. We would reasonably isolate our 64-bit hashinto 4-bit scopes of 16 -bits called A, B, C and D. On the off chancethat two document vary by at most two bits, at that point thedistinctive bits show up in, at most, both of these ranges of bit.The other part of this perception is that the rest of the bit rangesmust be indistinguishable.

Fig 10 Simhashes

It would be A and C which are indistinguishable, or B and C,however how about we imagine for a minute that we realize thatit will dependably be A and B that are indistinguishable. It couldretain all the values of simhashes in arranged request, and afterthat for a question we can discover the scope of components thatoffer that same AB prefix and just contrast with those components.It can found this range in logarithmic time, which is the enormouswin. Since we don’t have a clue about from the earlier which bitreaches will match, we basically keep up a couple of various ar-ranged requests in light of the considerable number of potentialoutcomes where bit extents may match. For this illustration, Ithad been confirmed that any of these mixes may coordinate: BA,AB, BD,AC, AD or BC. So as opposed to keeping one arrangedlist, we’ll keep 6 arranged lists, each with the simhashes reworkedin one of these stages: CDAB , ABCD, BCAD , ADBC, BDAC andACDB.

14


1808

Figure 11 Simhashed reworked

For the given query, we at that point verify a settled number oflist which is already sorted, do a O(d ln(n)) lookup and few corre-lations, and we’ve discovered every single close copy of our query.At the point when showed across distributed storage, every one ofthese inquiries can likewise be results in parallel. This method hasresulted in performance improvements.

4 CONCLUSION AND FUTURE EN-

HANCEMENTS

In this paper, the idea of implemented data or file de-duplicationwas suggested to assure the information security by having vari-ous benefits of clients in the data-comparison scheme. Similarlywe used some new de-duplication enhancements that will support-ing implemented data- comparison in hybrid cloud engineering, inwhich the data-comparison tokens of documents are created by theserver of private cloud along with private keys. Security researchshows that our system techniques are secure as far as internal andforeign threats specified in the proposed security installation. As avalidation of idea, we carry out a model of our proposed approveddata-comparison plan and lead test bed researchs our model. Wedemonstrated that our approved de-duplication plot brings aboutinsignificant overhead contrasted with convergent encryption andsystem exchange. At last, we are ensured that cloud information

15


1809

stockpiling security is still brimming with challenges and of centralsignificance, and numerous examination issues stay to be distin-guished.

References

[1] M. Bellare, S. Keelveedhi, and T. Ristenpart. Server-aided en-cryption for deduplicated storage. In USENIX Security Sym-posium, 2013.

[2] M. Bellare, C. Namprempre, and G. Neven. Security proofs foridentity-based identification and signature schemes, 2009.

[3] W. K. Ng, Y. Wen, and H. Zhu. Private data deduplicationprotocols in cloud storage. Proceedings of the 27th AnnualACM Symposium on Applied Computing, 2012.

[4] P. Anderson and L. Zhang. Fast and secure laptop backupswith encrypted de- duplication. In Proc. of USENIX LISA,2010.

[5] Li, X. Chen, M. Li, J. Li, P. Lee, and W. Lou. Secure dedupli-cation with efficient and reliable convergent key management.In IEEE on Distributed Systems, 2013.

[6] R. D. Pietro and A. Sorniotti. Boosting efficiency and securityin proof of ownership for deduplication. In ACM Symposiumon Computer and Comm. Security, 2012.

[7] T. Jiang, X. Chen, Q. Wu, J. Ma, W. Susilo, and W. Lou.Towards efficient fully randomized message-locked encryption.in Information Security and Privacy - 21st Australasian Con-ference, ACISP 2016.

[8] J. Stanek, A. Sorniotti, E. Androulaki, and L. Kencl. A se-cure data deduplication scheme for cloud storage. In TechnicalReport, 2013.

[9] J. Paulo and J. Pereira. A survey and classification of storagededuplication systems. ACM Comput. Surveys, 2014.

16


1810

[10] P. Puzio, R. Molva, M. Onen, and S. Loureiro. Cloud-Deduplication: Secure deduplication with encrypted data forcloud storage. In Proc. IEEE Int. Conference Cloud Comput-ing Technology. Sci., 2013.

[11] J. W. Yuan and S. C. Yu. Secure and constant cost publiccloud storage Auditing with deduplication. in Proc. IEEE Int.Conference Communication. Network Security, 2013.

17


1811

secure and efficient cloud data deduplicationsecure and efficient cloud data deduplication p....

Documents