secure mining of association rules

14
 Secure Mining of Association Rules in Horizontally Distributed Databases Tamir Tassa Abstract—We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms—one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in [18]. In addition, it is simpler and is signicantly more efcient in terms of communication rounds, communication cost and computational cost. Index Terms Privacy preserving data mining, distributed computation, frequent item sets, association rules Ç 1 INTRODUCTION W E study here the problem of secure mining of associa- tion rules in horizo ntall y parti tione d data bases. In that setting, there are several sites (or players) that hold homogeneous databases, i.e., databases that share the same schema but hold information on differ ent entities. The goal is to nd all association rules with support at least  s  and condence at least c, for some given minimal suppor t size s and condence level  c, that hold in the unied database, while minimizing the information disclosed about the pri- vate databases held by those players. The information that we would like to protect in this context is not only individ- ual transactions in the different databases, but also more global information such as what association rules are sup- ported locally in each of those databases. That goal denes a problem of secure multi-party com- putation. In such problems, there are  M  players that hold private inputs,  x 1 ; ... ; x M , and they wish to securely com- pute  y ¼ f ðx 1 ; ... ; x M Þ  for some public function  f . If there existed a trusted third party, the players could surrender to him their inputs and he would perform the function evalua- tion and send to them the resulting output. In the absence of such a trusted third party, it is needed to devise a proto- col that the players can run on their own in order to arrive at the required output  y. Such a protocol is considered per- fectly secure if no player can learn from his view of the pro- tocol more than what he would have learnt in the idealized setting where the computation is carried out by a trusted third party. Yao [32] was the rst to propose a generic solu- tion for this pr oblem in the case of two pl ayers. Other generic solutions, for the multi-party case, were later pro- posed in [3], [5], [15]. In our problem, the inputs are the partial databases, and the required output is the list of association rules that hold in the unied database with support and condence no smaller than the given thresholds s  and c, respectively. As the above mentioned generic solutions rely upon a descrip- tion of the function  f  as a Bool ean ci rc uit, they can be applied only to small inputs and functions which are realiz- able by simple circuits. In more complex settings, such as ours, other methods are required for carrying out this com- putation. In such cases, some relaxations of the notion of perfect security might be inevitable when looking for practi- cal prot ocol s, pr ovided that the excess information is deemed benign (see examples of such protocols in, e.g., [18], [28], [29], [31], [34]). Kantarcioglu and Clifton studied that problem in [18] and devised a protocol for its solution. The main part of the protocol is a sub-protocol for the secure computation of the union of private subsets that are held by the different play- ers. (The private subset of a given player, as we explain  below, includes the item sets that are s -frequent in his par- tial database.) That is the most costly part of the protocol and its implementat ion relies upon crypt ograp hic primi - tives such as commutative encryption, oblivious transfer, and hash functions. This is also the only part in the protocol in which the players may extract from their view of the pro- tocol informat ion on ot her da ta ba ses , beyond what is implied by the nal output and their own input. While such leakage of information renders the protocol not perfectly secure, the perimeter of the excess information is explicitly  bounded in [18] and it is argued there that such information leakage is innocuous, whence acceptable from a practical point of view. Herein we propose an alternative protocol for the secure computation of the union of private subsets. The proposed protocol improves upon that in [18] in terms of simplicity and efciency as well as privacy. In particular, our protocol does not depend on commutative encryption and oblivious transfer (wha t simpli es it signicantly and contri but es towards much reduced communication and computational costs). While our solution is still not perfectly secure, it leaks  The author is with the Departme nt of Mathematics and Co mputer Science, The Open University, 1 University Road, Ra’anana 43537, Israel.  Manuscript received 18 Feb. 2012; revised 23 May 2012; accepted 4 Mar. 2013; date of publication 6 Mar. 2013; date of current version 18 Mar. 2014. Recommended for acceptance by Y. Koren. For information on obtaining reprints of this article, please send e-mail to: [email protected] , and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TKDE.2013.41 970 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014 1041-4347 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standard s/publications/rights/index.html for more information.

Upload: loguthalir

Post on 08-Oct-2015

9 views

Category:

Documents


0 download

DESCRIPTION

Secure Mining of Association Rules

TRANSCRIPT

  • Secure Mining of Association Rules inHorizontally Distributed Databases

    Tamir Tassa

    AbstractWe propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading

    protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of

    Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel

    secure multi-party algorithmsone that computes the union of private subsets that each of the interacting players hold, and another

    that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect

    to the protocol in [18]. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost

    and computational cost.

    Index TermsPrivacy preserving data mining, distributed computation, frequent item sets, association rules

    1 INTRODUCTION

    WE study here the problem of secure mining of associa-tion rules in horizontally partitioned databases. Inthat setting, there are several sites (or players) that holdhomogeneous databases, i.e., databases that share the sameschema but hold information on different entities. The goalis to find all association rules with support at least s andconfidence at least c, for some given minimal support size sand confidence level c, that hold in the unified database,while minimizing the information disclosed about the pri-vate databases held by those players. The information thatwe would like to protect in this context is not only individ-ual transactions in the different databases, but also moreglobal information such as what association rules are sup-ported locally in each of those databases.

    That goal defines a problem of secure multi-party com-putation. In such problems, there are M players that holdprivate inputs, x1; . . . ; xM , and they wish to securely com-pute y fx1; . . . ; xM for some public function f . If thereexisted a trusted third party, the players could surrender tohim their inputs and he would perform the function evalua-tion and send to them the resulting output. In the absenceof such a trusted third party, it is needed to devise a proto-col that the players can run on their own in order to arriveat the required output y. Such a protocol is considered per-fectly secure if no player can learn from his view of the pro-tocol more than what he would have learnt in the idealizedsetting where the computation is carried out by a trustedthird party. Yao [32] was the first to propose a generic solu-tion for this problem in the case of two players. Othergeneric solutions, for the multi-party case, were later pro-posed in [3], [5], [15].

    In our problem, the inputs are the partial databases, andthe required output is the list of association rules that holdin the unified database with support and confidence nosmaller than the given thresholds s and c, respectively. Asthe above mentioned generic solutions rely upon a descrip-tion of the function f as a Boolean circuit, they can beapplied only to small inputs and functions which are realiz-able by simple circuits. In more complex settings, such asours, other methods are required for carrying out this com-putation. In such cases, some relaxations of the notion ofperfect security might be inevitable when looking for practi-cal protocols, provided that the excess information isdeemed benign (see examples of such protocols in, e.g., [18],[28], [29], [31], [34]).

    Kantarcioglu and Clifton studied that problem in [18]and devised a protocol for its solution. The main part of theprotocol is a sub-protocol for the secure computation of theunion of private subsets that are held by the different play-ers. (The private subset of a given player, as we explainbelow, includes the item sets that are s-frequent in his par-tial database.) That is the most costly part of the protocoland its implementation relies upon cryptographic primi-tives such as commutative encryption, oblivious transfer,and hash functions. This is also the only part in the protocolin which the players may extract from their view of the pro-tocol information on other databases, beyond what isimplied by the final output and their own input. While suchleakage of information renders the protocol not perfectlysecure, the perimeter of the excess information is explicitlybounded in [18] and it is argued there that such informationleakage is innocuous, whence acceptable from a practicalpoint of view.

    Herein we propose an alternative protocol for the securecomputation of the union of private subsets. The proposedprotocol improves upon that in [18] in terms of simplicityand efficiency as well as privacy. In particular, our protocoldoes not depend on commutative encryption and oblivioustransfer (what simplifies it significantly and contributestowards much reduced communication and computationalcosts). While our solution is still not perfectly secure, it leaks

    The author is with the Department of Mathematics and Computer Science,The Open University, 1 University Road, Raanana 43537, Israel.

    Manuscript received 18 Feb. 2012; revised 23 May 2012; accepted 4 Mar.2013; date of publication 6 Mar. 2013; date of current version 18 Mar. 2014.Recommended for acceptance by Y. Koren.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2013.41

    970 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

    1041-4347 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • excess information only to a small number (three) of possi-ble coalitions, unlike the protocol of [18] that discloses infor-mation also to some single players. In addition, we claimthat the excess information that our protocol may leak isless sensitive than the excess information leaked by the pro-tocol of [18].

    The protocol that we propose here computes a parameter-ized family of functions, which we call threshold functions,in which the two extreme cases correspond to the problemsof computing the union and intersection of private subsets.Those are in fact general-purpose protocols that can be usedin other contexts as well. Another problem of secure multi-party computation that we solve here as part of our discus-sion is the set inclusion problem; namely, the problem whereAlice holds a private subset of some ground set, and Bobholds an element in the ground set, and they wish to deter-mine whether Bobs element is within Alices subset, withoutrevealing to either of them information about the otherpartys input beyond the above described inclusion.

    1.1 Preliminaries

    1.1.1 Definitions and Notations

    Let D be a transaction database. As in [18], we view D as abinary matrix of N rows and L columns, where each row isa transaction over some set of items, A fa1; . . . ; aLg, andeach column represents one of the items in A. (In otherwords, the i; jth entry of D equals 1 if the ith transactionincludes the item aj, and 0 otherwise.) The database D ispartitioned horizontally between M players, denotedP1; . . . ; PM . Player Pm holds the partial database Dm thatcontains Nm jDmj of the transactions in D, 1 m M.The unified database is D D1 [ [DM , and it includesN :PMm1Nm transactions.

    An item set X is a subset of A. Its global support,suppX, is the number of transactions in D that contain it.Its local support, suppmX, is the number of transactions inDm that contain it. Clearly, suppX

    PMm1 suppmX. Let

    s be a real number between 0 and 1 that stands for arequired support threshold. An item set X is called s-fre-quent if suppX sN . It is called locally s-frequent at Dmif suppmX sNm.

    For each 1 k L, let Fks denote the set of all k-item sets(namely, item sets of size k) that are s-frequent, and Fk;ms bethe set of all k-item sets that are locally s-frequent at Dm,1 m M. Our main computational goal is to find, for agiven threshold support 0 < s 1, the set of all s-frequentitem sets, Fs :

    SLk1 F

    ks . We may then continue to find all

    s; c-association rules, i.e., all association rules of supportat least sN and confidence at least c. (Recall that if X and Yare two disjoint subsets of A, the support of the correspond-ing association rule X ) Y is suppX [ Y and its confi-dence is suppX [ Y =suppX.)

    1.1.2 The Fast Distributed Mining Algorithm

    The protocol of [18], as well as ours, are based on the FastDistributed Mining (FDM) algorithm of Cheung et al. [8],which is an unsecured distributed version of the Apriorialgorithm. Its main idea is that any s-frequent item set mustbe also locally s-frequent in at least one of the sites. Hence,in order to find all globally s-frequent item sets, each player

    reveals his locally s-frequent item sets and then the playerscheck each of them to see if they are s-frequent also globally.The FDM algorithm proceeds as follows:

    1. Initialization: It is assumed that the players havealready jointly calculated Fk1s . The goal is to pro-ceed and calculate Fks .

    2. Candidate Sets Generation: Each player Pm com-putes the set of all k 1-item sets that are locallyfrequent in his site and also globally frequent;namely, Pm computes the set F

    k1;ms \ Fk1s . He then

    applies on that set the Apriori algorithm in order togenerate the set Bk;ms of candidate k-item sets.

    3. Local Pruning: For each X 2 Bk;ms , Pm computessuppmX. He then retains only those item sets thatare locally s-frequent. We denote this collection ofitem sets by Ck;ms .

    4. Unifying the candidate item sets: Each playerbroadcasts his Ck;ms and then all players computeCks :

    SMm1 C

    k;ms .

    5. Computing local supports. All players compute thelocal supports of all item sets in Cks .

    6. Broadcast mining results: Each player broadcaststhe local supports that he computed. From that,everyone can compute the global support of everyitem set in Cks . Finally, F

    ks is the subset of C

    ks that con-

    sists of all globally s-frequent k-item sets.In the rst iteration, when k 1, the set C1;ms that the mthplayer computes (Steps 2-3) is just F 1;ms , namely, the set ofsingle items that are s-frequent in Dm. The complete FDMalgorithm starts by nding all single items that are globallys-frequent. It then proceeds to nd all 2-item sets that areglobally s-frequent, and so forth, until it nds the longestglobally s-frequent item sets. If the length of such item setsis K, then in the K 1th iteration of the FDM it willnd no K 1-item sets that are globally s-frequent, inwhich case it terminates.

    1.1.3 A Running Example

    Let D be a database of N 18 item sets over a set of L 5items, A f1; 2; 3; 4; 5g. It is partitioned between M 3players, and the corresponding partial databases are:

    D1 f12; 12345; 124; 1245; 14; 145; 235; 24; 24g;D2 f1234; 134; 23; 234; 2345g;D3 f1234; 124; 134; 23g :

    For example, D1 includes N1 9 transactions, the third ofwhich (in lexicographic order) consists of three items1, 2and 4.

    Setting s 1=3, an item set is s-frequent in D if it is sup-ported by at least 6 sN of its transactions. In this case,

    F 1s f1; 2; 3; 4g;F 2s f12; 14; 23; 24; 34g;F 3s f124g;F 4s F 5s ; ;

    andFs F 1s [ F 2s [ F 3s . For example, the item set 34 is indeedglobally s-frequent since it is contained in 7 transactions ofD.However, it is locally s-frequent only inD2 andD3.

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 971

  • In the first round of the FDM algorithm, the three playerscompute the sets C1;ms of all 1-item sets that are locally fre-quent at their partial databases:

    C1;1s f1; 2; 4; 5g ; C1;2s f1; 2; 3; 4g ; C1;3s f1; 2; 3; 4g :Hence, C1s f1; 2; 3; 4; 5g. Consequently, all 1-item sets haveto be checked for being globally frequent; that check revealsthat the subset of globally s-frequent 1-item sets isF 1s f1; 2; 3; 4g.

    In the second round, the candidate item sets are:

    C2;1s f12; 14; 24g;C2;2s f13; 14; 23; 24; 34g;C2;3s f12; 13; 14; 23; 24; 34g :

    (Note that 15; 25; 45 are locally s-frequent at D1 but they arenot included in C2;1s since 5 was already found to be globallyinfrequent.) Hence, C2s f12; 13; 14; 23; 24; 34g. Then, afterverifying global frequency, we are left with F 2s f12; 14; 23;24; 34g.

    In the third round, the candidate item sets are:

    C3;1s f124g ; C3;2s f234g ; C3;3s f124g :So, C3s f124; 234g and, then, F 3s f124g. There are nomore frequent item sets.

    1.2 Overview and Organization of the Paper

    The FDM algorithm violates privacy in two stages: In Step 4,where the players broadcast the item sets that are locallyfrequent in their private databases, and in Step 6, wherethey broadcast the sizes of the local supports of candidateitem sets. Kantarcioglu and Clifton [18] proposed secureimplementations of those two steps. Our improvement iswith regard to the secure implementation of Step 4, which isthe more costly stage of the protocol, and the one in whichthe protocol of [18] leaks excess information. In Section 2 wedescribe Kantarcioglu and Cliftons secure implementationof Step 4. We then describe our alternative implementationand proceed to analyze the two implementations in terms ofprivacy and efficiency and compare them. We show thatour protocol offers better privacy and that it is simpler andis significantly more efficient in terms of communicationrounds, communication cost and computational cost.

    In Sections 3 and 4 we discuss the implementation ofthe two remaining steps of the distributed protocol: Theidentification of those candidate item sets that are globallys-frequent, and then the derivation of all s; c-associationrules. In Section 5 we describe shortly an alternative proto-col, that was already considered in [9], [18], which offersfull security at enhanced costs. Section 6 describes ourexperimental evaluation which illustrates the significantadvantages of our protocol in terms of communication andcomputational costs. Section 7 includes a review of relatedwork. We conclude the paper in Section 8.

    Like in [18], we assume that the players are semi-honest;namely, they follow the protocol but try to extract as muchinformation as possible from their own view. (See [17], [26],[34] for a discussion and justification of that assumption.)We too, like [18], assume that M > 2. (The case M 2 is

    discussed in [18], Section 5]; the conclusion is that the prob-lem of secure computation of frequent item sets and associa-tion rules in the two-party case is unlikely to be of any use.)

    2 SECURE COMPUTATION OF ALL LOCALLYFREQUENT ITEM SETS

    Here we discuss the secure implementation of Step 4 in theFDM algorithm, namely, the secure computation of theunion Cks

    SMm1 C

    k;ms . We describe the protocol of [18]

    (Section 2.1) and then our protocol (Sections 2.2 and 2.3).We analyze the privacy of the two protocols in Section 2.4,their communication cost in Section 2.5, and their computa-tional cost in Section 2.6.

    2.1 The Protocol of Kantarcioglu and Clifton for theSecure Computation of All Locally FrequentItem Sets

    2.1.1 Overview

    Protocol 1 is the protocol that was suggested by Kantarciogluand Clifton [18] for computing the unified list of all locallyfrequent item sets, Cks

    SMm1 C

    k;ms , without disclosing the

    sizes of the subsets Ck;ms nor their contents. The protocol isapplied when the players already know Fk1s the set of allk 1-item sets that are globally s-frequent, and they wishto proceed and compute Fks . We refer to it hereinafter as Pro-tocol UNIFI-KC (Unifying lists of locally Frequent Item setsKantarcioglu and Clifton).

    The input that each player Pm has at the beginning ofProtocol UNIFI-KC is the collection Ck;ms , as defined in Steps2-3 of the FDM algorithm. Let ApFk1s denote the set of allcandidate k-item sets that the Apriori algorithm generatesfrom Fk1s . Then, as implied by the definition of C

    k;ms (see

    Section 1.1.2), Ck;ms , 1 m M, are all subsets of ApFk1s .The output of the protocol is the union Cks

    SMm1 C

    k;ms . In

    the first iteration of this computation k 1, and the playerscompute all s-frequent 1-item sets (here F 0s f;g). In thenext iteration they compute all s-frequent 2-item sets,and so forth, until the first k L in which they find no s-fre-quent k-item sets.

    After computing that union, the players proceed toextract from Cks the subset F

    ks that consists of all k-item sets

    that are globally s-frequent; this is done using the protocolthat we describe later on in Section 3. Finally, by applyingthe above described procedure from k 1 until the firstvalue of k L for which the resulting set Fks is empty, theplayers may recover the full set Fs :

    SLk1 F

    ks of all globally

    s-frequent item sets.Protocol UNIFI-KC works as follows: First, each player

    adds to his private subset Ck;ms fake item sets, in order tohide its size. Then, the players jointly compute the encryp-tion of their private subsets by applying on those subsets acommutative encryption,1 where each player adds, in histurn, his own layer of encryption using his private secretkey. At the end of that stage, every item set in each subset isencrypted by all of the players; the usage of a commutativeencryption scheme ensures that all item sets are, eventually,encrypted in the same manner. Then, they compute the

    1. An encryption algorithm is called commutative if EK1 EK2 EK2 EK1 for any pair of keys K1 and K2.

    972 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

  • union of those subsets in their encrypted form. Finally, theydecrypt the union set and remove from it item sets whichare identified as fake. We now proceed to describe the pro-tocol in detail.

    (Notation agreement: Since all protocols that we presentherein involve cyclic communication rounds, the indexM 1 always means 1, while the index 0 always means M.)

    2.1.2 Detailed Description

    In Phase 0 (Steps 2-4), the players select the neededcryptographic primitives: They jointly select a com-mutative cipher, and each player selects a corre-sponding private random key. In addition, theyselect a hash function h to apply on all item sets priorto encryption. It is essential that h will not experiencecollisions on ApFk1s in order to make it invertibleon ApFk1s . Hence, if such collusions occur (anevent of a very small probability), a different hashfunction must be selected. At the end, the playerscompute a lookup table with the hash values of allcandidate item sets in ApFk1s ; that table willbe used later on to find the preimage of a givenhash value.

    In Phase 1 (Steps 6-19), all players compute a com-posite encryption of the hashed sets Ck;ms ,1 m M. First (Steps 6-12), each player Pm hashesall item sets in Ck;ms and then encrypts them usingthe key Km. (Hashing is needed in order to preventleakage of algebraic relations between item sets, see[18], Appendix].) Then, he adds to the resulting setfaked item sets until its size becomes jApFk1s j, inorder to hide the number of locally frequent itemsets that he has. (Since Ck;ms ApFk1s , the size ofCk;ms is bounded by jApFk1s j, for all 1 m M.)We denote the resulting set by Xm. Then (Steps 13-19), the players start a loop of M 1 cycles, where ineach cycle they perform the following operation:Player Pm sends a permutation of Xm to the nextplayer Pm1; Player Pm receives from Pm1 a permu-tation of the set Xm1 and then computes a new Xmas Xm EKmXm1. At the end of this loop, Pmholds an encryption of the hashed Ck;m1s using allM keys. Due to the commutative property ofthe selected cipher, Player Pm holds the setfEM E2E1hx : x 2 Ck;m1s g.

    In Phase 2 (Steps 21-26), the players merge the listsof encrypted item sets. At the completion of thisstage P1 holds the union set C

    ks SMm1 C

    k;ms hashed

    and then encrypted by all encryption keys, togetherwith some fake item sets that were used for the sakeof hiding the sizes of the sets Ck;ms ; those fake itemsets are not needed anymore and will be removedafter decryption in the next phase.

    The merging is done in two stages, where inthe first stage the odd and even lists are mergedseparately. As explained in [18, Section 3.2.1], notall lists are merged at once since if they were,then the player who did the merging (say P1)would be able to identify all of his own encrypteditem sets (as he would get them from PM ) and

    then learn in which of the other sites they are alsolocally frequent.

    In Phase 3 (Steps 28-34), a similar round of decryp-tions is initiated. At the end, the last player who per-forms the last decryption uses the lookup table T that

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 973

  • was constructed in Step 4 in order to identify andremove the fake item sets and then to recover Cks .Finally, he broadcastsCks to all his peers.

    Going back to the running example in Sec-tion 1.1.3, the set F 2s , consisting of all 2-item sets thatare globally s-frequent, includes the item setsf12; 14; 23; 24; 34g. Applying on it the Apriori algo-rithm, we find that ApF 2s f124; 234g. Therefore,each of the three players proceed to look for 3-itemsets from ApF 2s that are locally s-frequent in hispartial database. Since C3;1s f124g, P1 will hash andencrypt the item set 124 and will add to it one fakeitem set, since jApF 2s j 2. As C3;2s f234g andC3;3s f124g, also P2 and P3 will each use one fakeitem set. At the completion of the protocol, the threeplayers will conclude that C3s f124; 234g. Then, byapplying the protocol in Section 3, they will find outthat only the first of these two candidate item sets isglobally frequent, whence F 3s f124g.

    2.2 A Secure Multiparty Protocol for Computing theOR of Private Binary Vectors

    Protocol UNIFI-KC securely computes of the union of pri-vate subsets of some publicly known ground set (ApFk1s ).Such a problem is equivalent to the problem of computingthe OR of private vectors. Indeed, if the ground set isV fv1; . . . ;vng, then any subset B of V may be describedby the characteristic binary vector b b1; . . . ; bn 2 ZZn2where bi 1 if and only if vi 2 B. Let bm be the binary vec-tor that characterizes the private subset held by player Pm,1 m M. Then the union of the private subsets isdescribed by the OR of those private vectors, b : WMm1 bm.

    Such a simple function can be evaluated securely by thegeneric solutions suggested in [3], [5], [15]. We present herea protocol for computing that function which is much sim-pler to understand and program and much more efficientthan those generic solutions. It is also much simpler thanProtocol UNIFI-KC and employs less cryptographic primi-tives. Our protocol (Protocol 2) computes a wider range offunctions, which we call threshold functions.

    Definition 2.1 Let b1; . . . ; bM be M bits and 1 t M be aninteger. Then

    Ttb1; . . . ; bM 1 if

    PMm1 bm t

    0 ifPM

    m1 bm < t

    8 2, there is no need to invoke neither of thesecure protocols of [13] or [12]. Indeed, as M > 2, the exis-tence of other semi-honest players can be used to verify theinclusion in Eq. (4) much more easily. This is done in Proto-col 3 (SETINC) which we proceed to describe next.

    Protocol SETINC involves three players: P1 has a vectors s1; . . . ; sn of elements in some ground set V; PM ,on the other hand, has a vector Q Q1; . . . ;Qn of sub-sets of that ground set. The required output is a vectorb b1; . . . ; bn that describes the corresponding setinclusions in the following manner: bi 0 if si 2 Qiand bi 1 if si =2 Qi, 1 i n. The computation in theprotocol involves a third player P2. (When Protocol SETINCis called from Protocol THRESHOLD, the ground set isV ZZM1 and the inputs si and Qi of the two playersare as in Eq. (4), 1 i n.)

    The protocol starts with players P1 and PM agreeing on akeyed hash function hK (e.g., HMAC [4]), and a corre-sponding secret key K (Step 1). Consequently (Steps 2-3), P1converts his sequence of elements s s1; . . . ; sn into asequence of corresponding signatures s0 s01; . . . ;s0n, where s0i hKi; si and PM does a similar con-versions to the subsets that he holds. Then, in Steps 4-5, P1sends s0 to P2, and PM sends to P2 the subsets Q0i,1 i n, where the elements within each subset are ran-domly permuted. Finally (Steps 6-7), P2 performs the rele-vant inclusion verifications on the signature values. If hefinds out that for a given 1 i n, s0i 2 Q0i, he mayinfer, with high probability, that si 2 Qi (see more onthat below), whence he sets bi 0. If, on the other hand,s0i =2 Q0i, then, with certainty, si =2 Qi, and thus hesets bi 1.

    Two comments are in order:

    1. If the index i had not been part of the input to thehash function (Steps 2-3), then two equal compo-nents in P1s input vector, say si sj, wouldhave been mapped to two equal signatures,s0i s0j. Hence, in that case player P2 wouldhave learnt that in P1s input vector the ith and jthcomponents are equal. To prevent such leakage ofinformation, we include the index i in the input tothe hash function.

    2. An event in which s0i 2 Q0i while si =2 Qiindicates a collusion; specifically, it implies thatthere exist u0 2 Qi and u00 2 V nQi for whichhKi; u0 hKi; u00. Hash functions are designedso that the probability of such collusions is negli-gible, whence the risk of a collusion can beignored. However, it is possible for player PM tocheck upfront the selected random key K inorder to verify that for all 1 i n, the setsQ0i fhKi; u : u 2 Qig and Q00i fhKi; u :u 2 V nQig are disjoint.

    We refer hereinafter to the combination of ProtocolsTHRESHOLD and SETINC as Protocol THRESHOLD-C; namely, it isProtocol THRESHOLD where the verifications of the inequal-ities in Steps 6-8, which are equivalent to the verification ofthe set inclusions in Eq. (4), are carried out by ProtocolSETINC. Then our claims are as follows:

    Theorem 2.2. Protocol THRESHOLD-C is correct, i.e., it computesthe threshold function.

    Proof. Protocol THRESHOLD operates correctly if the inequal-ity verifications in Step 7 are carried out correctly, sincesi sMi mod M 1 equals the ith component aiin the sum vector a PMm1 bm. The inequality verifica-tion is correct if Protocol SETINC is correct. The latter pro-tocol is indeed correct if the randomly selected key K issuch that for all 1 i n, the sets Q0i fhKi; u :u 2 Qig and Q00i fhKi; u : u 2 V nQig are dis-joint. As discussed earlier, such a verification can be car-ried out upfront, and most all selections of K areexpected to pass that test. tu

    Theorem 2.3. Assume that the M > 2 players are semi-honest.Let C fP1; P2; . . . ; PMg be a coalition of players.

    a. If P2 =2 C and at least one of P1 and PM is not in Ceither, then Protocol THRESHOLD-C is perfectly privatewith respect to C.

    b. If P2 2 C but P1; PM =2 C, the protocol is computation-ally private with respect to C.

    c. Otherwise, the coalition C includes at least two of thethree players P1; P2; PM . In such cases, it may learnthe sum a PMm1 bm, but no information on the pri-vate vectors bm, 1 m M, beyond what is impliedby that sum and the coalitions input vectors.

    (A multiparty computation is perfectly private withrespect to a subset of players if it does not enable those play-ers to learn information on the inputs of other playersbeyond what is implied by the final output and their owninputs, even if they are computationally unbounded. Such acomputation is computationally private if it achieves thesame goal when the players are polynomially-bounded.)

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 975

  • Proof. The view of each single player P consists of bm; forall 1 m 6 M, where bm; is Ps additive share in anM-out-of-M secret sharing scheme for Pms private inputbm. In addition, P1s view includes s2; . . . ; sM1 (Step 4in Protocol THRESHOLD), which are additive shares in anM-out-of-M secret sharing for the sum a; and P2sview includes the signatures s0 and Q0 (Steps 4 and 5 inProtocol SETINC).

    (a) If the coalition C does not include P2 and, in addi-tion, it does not include at least one of P1 and PM , thenCs collaborative view consists of incomplete sets of addi-tive shares in b1; . . . ;bM and a. As the M-out-of-M secretsharing scheme is perfect, those additive shares are inde-pendent and each one is a uniformly distributed randomvector in ZZnM1. Hence, the coalition C may simulate itsview during the protocol, whence the protocol is per-fectly private with respect to C.

    (b) If P2 2 C and P1; PM =2 C, we may repeat the samearguments as before, since P2s additional view of s

    0 andQ0 can also be simulated by independent and uniformlydistributed random vectors; indeed, assuming that thehash function h is secure and that the key K is chosenuniformly at random, then s0 and Q0 are indistinguish-able from vectors chosen uniformly at random from Hn

    and Htn, respectively, where H is the hash range.However, while the coalitions discussed above in

    (a) cannot learn any information about the inputs ofother players, even if their members are computation-ally unbounded, here the privacy guarantee assumesthat P2 is polynomially bounded. If P2 is computation-ally unbounded, he may scan the exponential numberof possible keys K in order to find what key P1 andPM used. To do so, he will compute for each possibleK the hashed values hKi; u for all 1 i n andu 2 V f0; 1; . . . ;Mg. Then, he will check whether thesignature values that he got from P1 and PM (namely,s0 and Q0) are consistent with the values which hecomputed. After finding the true K (assuming that itis the only one that will pass the check), P2 will beable to recover si from s0i, and Qi from Q0i, forall 1 i n. Since Qi reveals sMi (see Eq. (4)), P2may proceed to compute ai si sMi, 1 i n.Hence, if P2 is computationally unbounded, he maybe able to deduce the value of the sum a. (If duringhis check, P2 finds more than one possible K, he willcompute the vector a that corresponds to each ofthem, and then infer that the true a is one of thosevectors.)

    (c) If P1; PM 2 C, then by adding s (known to P1) andsM (known to PM ), they will get the sum a. No furtherinformation on the input vectors b1; . . . ;bM may bededuced from the inputs of the players in such a coali-tion; specifically, every set of vectors b1; . . . ;bM that isconsistent with the sum a is equally likely. (Put differ-ently, such a coalition may simulate its view by selectingat random any set of vectors b1; . . . ;bM whose sumequals a and then generate for each such vector randomadditive shares.)

    Coalitions C that include either P1 and P2 or P2 andPM can also recover the vector a. Indeed, P2 knows s

    0

    and Q0, while P1 or PM knows hK , and K. Hence, if P2

    colludes with either P1 or PM , he may recover from s0

    and Q0 the preimages s and Q. Thus, such a coalition canrecover s and sM , and consequently, it can recover a. Asargued before, the shares available for such coalitions donot reveal any further information about the input vec-tors b1; . . . ;bM . tuThe susceptibility of Protocol THRESHOLD-C to coalitions is

    not very significant because of two reasons:

    The entries of the sum vector a do not revealinformation about specific input vectors. Namely,knowing that ai p only indicates that p out ofthe M bits bmi, 1 m M, equal 1, but it revealsno information regarding which of the M bitsare those.

    There are only three players that can collude in orderto learn information beyond the intention of the pro-tocol. Such a situation is far less severe than a situa-tion in which any player may participate in acoalition, since if it is revealed that a collusion tookplace, there is a small set of suspects.

    2.3 An Improved Protocol for the SecureComputation of All Locally Frequent Item Sets

    As before, we denote by Fk1s the set of all globally frequentk 1-item sets, and by ApFk1s the set of k-item sets thatthe Apriori algorithm generates when applied on Fk1s . Allplayers can compute the set ApFk1s and decide on anordering of it. (Since all item sets are subsets ofA fa1; . . . ; aLg, they may be viewed as binary vectors inf0; 1gL and, as such, they may be ordered lexicographi-cally.) Then, since the sets of locally frequent k-item sets,Ck;ms , 1 m M, are subsets of ApFk1s , they may beencoded as binary vectors of length nk : jApFk1s j. Thebinary vector that encodes the union Cks :

    SMm1 C

    k;ms is the

    OR of the vectors that encode the sets Ck;ms , 1 m M.Hence, the players can compute the union by invoking Pro-tocol THRESHOLD-C on their binary input vectors. Thisapproach is summarized in Protocol 4 (UNIFI).

    In the running example in Section 1.1.3, F 2s f12; 14;23; 24; 34g and ApF 2s f124; 234g. The private sets oflocally frequent item sets are C3;1s f124g, C3;2s f234g,and C3;3s f124g. Those private sets will be encoded asb1 1; 0, b2 0; 1, and b3 1; 0. The OR of these vec-tors is b 1; 1 and, therefore, C3s f124; 234g.

    Comment. Replacing T1 with TM in Step 2 of Protocol 4will result in computing the intersection of the private sub-sets rather than their union.

    976 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

  • 2.4 Privacy

    We begin by analyzing the privacy offered by ProtocolUNIFI-KC. That protocol does not respect perfect privacysince it reveals to the players information that is not impliedby their own input and the final output. In Step 11 of Phase1 of the protocol, each player augments the set Xm by fakeitem sets. To avoid unnecessary hash and encryption com-putations, those fake item sets are random strings in theciphertext domain of the chosen commutative cipher. Theprobability of two players selecting random strings that willbecome equal at the end of Phase 1 is negligible; so is theprobability of Player Pm to select a random string thatequals EKmhx for a true item set x 2 ApFk1s . Hence,every encrypted item set that appears in two different listsindicates with high probability a true item set that is locallys-frequent in both of the corresponding sites. Therefore,Protocol UNIFI-KC reveals the following excess information:

    1. P1 may deduce for any subset of the odd players, thenumber of item sets that are locally supported by allof them.

    2. P2 may deduce for any subset of the even players,the number of item sets that are locally supported byall of them.

    3. P1 may deduce the number of item sets that are sup-ported by at least one odd player and at least oneeven player.

    4. If P1 and P2 collude, they reveal for any subset of theplayers the number of item sets that are locally sup-ported by all of them.

    As for the privacy offered by Protocol UNIFI, we considertwo cases: If there are no collusions, then, by Theorem 2.3,Protocol UNIFI offers perfect privacy with respect to allplayers Pm, m 6 2, and computational privacy with respectto P2. This is a privacy guarantee better than that offered byProtocol UNIFI-KC, since the latter protocol does revealinformation to P1 and P2 even if they do not collude withany other player.

    If there are collusions, both Protocols UNIFI-KC andUNIFI allow the colluding parties to learn forbidden infor-mation. In both cases, the number of suspects is smallinProtocol UNIFI-KC only P1 and P2 may benefit from a collu-sion while in Protocol UNIFI only P1, P2 and PM can extractadditional information if two of them collude (see Theo-rem 2.3). In Protocol UNIFI-KC, the excess informationwhich may be extracted by P1 and P2 is about the numberof common frequent item sets among any subset of the play-ers. Namely, they may learn that, say, P2 and P3 have manyitem sets that are frequent in both of their databases (butnot which item sets), while P2 and P4 have very few itemsets that are frequent in their corresponding databases. Theexcess information in Protocol UNIFI is different: If any twoout of P1, P2 and PM collude, they can learn the sum of allprivate vectors. That sum reveals for each specific item setin ApFk1s the number of sites in which it is frequent, butnot which sites. Hence, while the colluding players in Proto-col UNIFI-KC can distinguish between the different playersand learn about the similarity or dissimilarity betweenthem, Protocol UNIFI leaves the partial databases totallyindistinguishable, as the excess information that it leaks iswith regard to the item sets only.

    To summarize, given that Protocol UNIFI reveals noexcess information when there are no collusions, and, inaddition, when there are collusions, the excess informationstill leaves the partial databases indistinguishable, it offersenhanced privacy preservation in comparison to ProtocolUNIFI-KC.

    2.5 Communication Cost

    Here and in the next section we analyze the communicationand computational costs of Protocols UNIFI-KC and UNIFI.In doing so, we use the following notations and terms:

    K Ks is the size of the longest s-frequent item setin D.

    The kth iteration refers to the iteration in which Fks iscomputed from Fk1s . Both protocols have K 1 iter-ations (where in the last iteration, the players findthat FK1s ; and consequently terminate theprotocol).

    nk is the number of candidate item sets of size k inthe kth iteration. n1 L and nk : jApFk1s j for all1 < k K 1.

    n :PK1k1 nk. k is the number of k-item sets that were s-frequent in

    at least one of the sites.

    :PK1k1 k. B is the size in bits of representing one frequent item

    set. (We use the same notation for frequent item setsof any length for simplicity. In practice, the length offrequent item sets is typically a small number.)

    In evaluating the communication cost, we consider threeparameters: Total number of communication rounds, totalnumber of messages sent, and the overall size of the mes-sages sent. For example, in Step 15 of Protocol UNIFI-KC,every player Pm sends a message to Pm1. Those M mes-sages are sent simultaneously. Hence, each time this step isexecuted, the counter of communication rounds is increasedby 1, the number of messages sent is increased by M, andthe total message size is increased by the sum of sizes ofthose messages.

    2.5.1 Communication Cost of Protocol UNIFI-KC

    Let t denote the number of bits required to represent an itemset. Clearly, t must be at least log2 nk for all 1 k L. How-ever, as Protocol UNIFI-KC hashes the item sets and thenencrypts them, t should be at least the recommendedciphertext length in commutative ciphers. RSA [25], Pohlig-Hellman [24] and ElGamal [10] ciphers are examples ofcommutative encryption schemes. As the recommendedlength of the modulus in all of them is at least 1,024 bits, wetake t 1;024.

    We begin by analyzing the communication costs of Pro-tocol UNIFI-KC in each of the K 1 iterations separately.Each iteration, consists of four phases.

    During Phase 1 of Protocol UNIFI-KC, there are M 1rounds of communication. In each such round, each of theM players sends to the next player a message; the length ofthat message in the kth iteration is tnk. Hence, the communi-cation cost of this phase in the kth iteration is M 1Mmessages of total size of M 1Mtnk bits.

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 977

  • During Phase 2 of the protocol all odd players send theirencrypted item sets to P1 and all even players send theirencrypted item sets to P2. Then P2 unifies the item sets hegot and sends them to P1. Hence, this phase takes two morerounds. The communication cost, in the kth iteration, of thefirst of those two rounds is M 2 messages of total size ofM 2tnk. The communication cost of the second round isa single message whose size is m1tnk where m1 2 1;M=2.(The size of the unified list will equal the lower bound inthat range, i.e., tnk bits, if all lists of all even players coin-cide; it will equal the upper bound of Mtnk=2 if all thoselists are pairwise disjoint.)

    In Phase 3, a similar round of decryptions is initiated.The unified list of all encrypted true and fake item sets maycontain in the kth iteration at least nk item sets but no morethan Mnk item sets. Hence, that phase involves M 1rounds with communication cost of M 1 messages with atotal size of m2M 1tnk, where m2 2 1;M.

    Finally, in Step 34, PM broadcasts Cks to all other players.

    That step adds one more communication round with M 1messages of total size of M 1kB, where k jCks j.

    Adding up the above costs over all iterations,1 k K 1, we find that Protocol UNIFI-KC entails:

    2M 1K 1 communication rounds. M2 2M 3K 1 messages. gMtn M 1B bits of communication, where

    M2 M 2 gM 2M2 M2 2 : (5)

    2.5.2 Communication Cost of Protocol UNIFI

    Protocol UNIFI consists of four communication rounds (ineach of the iterations): One for Step 2 of Protocol THRESHOLDthat it invokes; one for Step 4 of that protocol; one forSteps 4-5 in Protocol SETINC which is used for the inequalityverifications in Protocol THRESHOLD; and one for Step 7 inProtocol SETINC.

    In the kth iteration, the length of the vectors in ProtocolTHRESHOLD is nk; each entry in those vectors represents anumber between 0 and M 1, whence it may be encodedby log2M bits. Therefore:

    The communication cost of Step 2 in Protocol THRESH-OLD is M 1M messages of total size of M1Mlog2Mnk bits. (Since each of the M playerssends a vector of size log2Mnk bits to each of theother M 1 players.)

    The communication cost of Step 4 in Protocol THRESH-OLD is M 2 messages of total size of M 2log2Mnk bits.

    The communication cost of Step 4 in Protocol SETINCis a single message of size jhjnk, where jhj is the sizein bits of the hash functions output. The communi-cation cost of Step 5 in that protocol is also a singlemessage of size jhjnk; indeed, when Protocol SETINCis called from Protocol THRESHOLD-C, the size of thesets Qi and Q0i is t 1 (see Eq. (4)), because theOR function corresponds to t 1.

    The communication cost of Step 7 in Protocol SETINCis M 1 messages of total size of M 1kB, sinceP2 can send to all his peers the actual k k-item setsthat were s-frequent in at least one of the sites.

    Adding up the above costs over all iterations, 1 k K 1, we find that Protocol UNIFI entails:

    4K 1 communication rounds. M2 M 1K 1 messages. M2 2log2 Mn 2njhj M 1B bits of

    communication.

    2.5.3 Comparison

    Comparing the costs of the two protocols as derived in Sec-tions 2.5.1 and 2.5.2 we find that Protocol UNIFI reduces thenumber of rounds by a factor of 2M 1=4 with respect toProtocol UNIFI-KC. The number of messages in the two pro-tocols is roughly the same. As for the bit communicationcost, Protocol UNIFI offers a significant improvement. Theimprovement factor in the bit communication cost, as offeredby Protocol UNIFI with respect to Protocol UNIFI-KC, is

    gMtn M 1BM2 2log2Mn 2njhj M 1B

    ; (6)

    where the range of possible values of gM is given in Eq.(5). The communication cost of the fourth phase, which isM 1B in both protocols, may be neglected, as vali-dated in our experimental evaluation. The reason for thisis that it depends on (the overall number of item setsthat were s-frequent in at least one site), which is muchsmaller than n (the overall number of Apriori-generatedcandidate item sets), and, in addition, it depends onM 1 rather than QM2 as the other costs. Therefore, theratio in Eq. (6) may be approximated by

    gMtM2 2log2M 2jhj

    :

    As discussed earlier, a plausible setting of t would bet 1;024. A typical value of jhj is 160. Hence, For M 4 weget an improvement factor that ranges between 53 and 82,while for M 8 we get an improvement factor that rangesbetween 142 and 247.

    2.6 Computational Cost

    In Protocol UNIFI-KC each of the players needs to performhash evaluations as well as encryptions and decryptions.As the cost of hash evaluations is significantly smallerthan the cost of commutative encryption, we focus on thecost of the latter operations. In Steps 8-10 of the protocol,player Pm performs jCk;ms j nk jApFk1s j encryptions(in the kth iteration). Then, in Steps 13-19, each player per-forms M 1 encryptions of sets that include nk items.Hence, in Phase 1 in the kth iteration, each player performsbetween M 1nk and Mnk encryptions. In Phase 3, eachplayer decrypts the set of items ECks . EC

    ks is the union of

    the encrypted sets from all M players, where each of thosesets has nk itemstrue and fake ones. Clearly, the size ofECks is at least nk. On the other hand, since most of theitems in the M sets are expected to be fake ones, and theprobability of collusions between fake items is negligible,it is expected that the size of ECks would be close to Mnk.So, in all its iterations (1 k K 1), Protocol UNIFI-KCrequires each player to perform an overall number of closeto 2Mn (but no less than Mn) encryptions or decryptions,

    978 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

  • where, as before n PK1k1 nk. Since commutative encryp-tion is typically based on modular exponentiation, theoverall computational cost of the protocol is QMt3n bitoperations per player.

    In Protocol THRESHOLD, which Protocol UNIFI calls,each player needs to generate M 1n (pseudo)randomlog2 M-bit numbers (Step 1). Then, each player per-forms M 1n additions of such numbers in Step 1 aswell as in Step 3. Player P1 has to perform also M 2nadditions in Step 5. Therefore, the computational cost foreach player is QMn log2 M bit operations. In addition,Players 1 and M need to perform n hash evaluations.Compared to a computational cost of QMt3n bit opera-tions per player, we see that Protocol UNIFI offers a sig-nificant improvement with respect to Protocol UNIFI-KCalso in terms of computational cost.

    3 IDENTIFYING THE GLOBALLY s-FREQUENT ITEMSETS

    Protocols UNIFI-KC and UNIFI yield the set Cks that consistsof all item sets that are locally s-frequent in at least onesite. Those are the k-item sets that have potential to be alsoglobally s-frequent. In order to reveal which of those itemsets is globally s-frequent there is a need to securely com-pute the support of each of those item sets. That computa-tion must not reveal the local support in any of the sites.Let x be one of the candidate item sets in Cks . Then x isglobally s-frequent if and only if

    Dx : suppx sN XM

    m1suppmx sNm 0 : (7)

    We describe here the solution that was proposed by Kantar-cioglu and Clifton. They considered two possible settings. Ifthe required output includes all globally s-frequent itemsets, as well as the sizes of their supports, then the values ofDx can be revealed for all x 2 Cks . In such a case, those val-ues may be computed using a secure summation protocol(e.g., [6]), where the private addend of Pm issuppmx sNm. The more interesting setting, however, isthe one where the support sizes are not part of the requiredoutput. We proceed to discuss it.

    As jDxj N , an item set x 2 Cks is s-frequent if and onlyif Dx mod q N , for q 2N 1. The idea is to verify thatinequality by starting an implementation of the secure sum-mation protocol of [6] on the private inputs Dmx :suppmx sNm, modulo q. In that protocol, all playersjointly compute random additive shares of the requiredsum Dx and then, by sending all shares to, say, P1, he mayadd them and reveal the sum. If, however, PM withholdshis share of the sum, then P1 will have one random share,s1x, of Dx, and PM will have a corresponding share,sMx; namely, s1x sMx Dx mod q. It is then pro-posed that the two players execute the generic secure circuitevaluation of [32] in order to verify whether

    s1x sMx mod q N : (8)Those circuit evaluations may be parallelized for all x 2 Cks .

    We observe that inequality (8) holds if and only if

    s1x 2 Qx : fj sMx mod q : 0 j Ng : (9)

    As s1x is known only to P1 while Qx is known only toPM , the verification of the set inclusion in (9) can also be car-ried out by means of Protocol SETINC. However, the groundset V in this case is ZZq2N1, which is typically a large set.(Recall that when Protocol SETINC is invoked from UNIFI, theground set V is ZZM1, which is usually a small set.) Hence,Protocol SETINC is not useful in this case, and, consequently,Yaos generic protocol remains, for the moment, the proto-col of choice to securely verify inequality (8). Yaos protocolis designed for the two-party case. In our setting, as M > 2,there exist additional semi-honest players. An interestingquestion which arises in this context is whether the exis-tence of such additional semi-honest players may be used toverify inequalities like (8), even when the modulus is large,without resorting to costly protocols such as oblivioustransfer.

    4 IDENTIFYING ALL s; cs; c-ASSOCIATION RULESOnce the set Fs of all s-frequent item sets is found, we mayproceed to look for all s; c-association rules (rules withsupport at least sN and confidence at least c), as describedin [18]. For X;Y 2 Fs, where X \ Y ;, the correspondingassociation rule X ) Y has confidence at least c if and onlyif suppX [ Y =suppX c, or, equivalently,

    CX;Y :XM

    m1suppmX [ Y c suppmX 0 : (10)

    Since jCX;Y j N , then by taking q 2N 1, the players canverify inequality (10), in parallel, for all candidate associa-tion rules, as described in Section 3.

    In order to derive from Fs all s; c-association rules in anefficient manner we rely upon the following straightfor-ward lemma.

    Lemma 4.1. If X ) Y is an s; c-rule and Y 0 Y , thenX ) Y 0 is also an s; c-rule.

    Proof. The rule X ) Y 0 has the required support countsince suppX [ Y 0 suppX [ Y sN . It is also c-confident since suppX[Y

    0suppX suppX[Y suppX c. Hence, it is an

    s; c-rule too. tuWe first find all s; c-rules with 1-consequents;

    namely, all s; c-rules X ) Y with a consequent (righthand side) Y of size 1. To that end, we scan all item setsZ 2 Fs of size jZj 2, and for each such item set wescan all jZj partitions Z X [ Y where jY j 1 andX Z n Y . The association rule X ) Y that correspondsto such a given partition Z X [ Y is tested to seewhether it satisfies inequality (10). We may test all thosecandidate rules in parallel and at the end we get the fulllist of all s; c-rules with 1-consequents.

    We then proceed by induction; assume that we found alls; c-rules with j-consequents for all 1 j 1. To findall s; c-rules with -consequents, we rely upon Lemma 4.1.Namely, if Z 2 Fs and Z X [ Y where X \ Y ; andjY j , then X ) Y is an s; c-rule only if X ) Y 0 werefound to be s; c-rules for all Y 0 Y . Hence, we may createall candidate rules with -consequents and test them againstinequality (10) in parallel.

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 979

  • It should be noted that in practice, one usually aims atfinding association rules of the form X ) Y where jY j 1,or at least jY j for some small constant . However, theabove procedure may be continued until all candidate asso-ciation rules, with no upper bounds on the consequent size,are found.

    5 A FULLY SECURE PROTOCOL

    As noted in [18, Section 6], the players may dispense thelocal pruning and union computation in the FDM algo-rithm (Steps 2-4) and, instead, test all candidate item setsin ApFk1s to see which of them are globally s-frequent.Such a protocol is fully secure, as it reveals only the set ofglobally s-frequent item sets but no further informationabout the partial databases. However, as discussed in [18],such a protocol would be much more costly since itrequires each player to compute the local support ofjApFk1s j item sets (in the kth round) instead of only jCks jitem sets (where Cks

    SMm1 C

    k;ms ). In addition, the players

    will have to execute the secure comparison protocol of [32]to verify inequality (8) for jApFk1s j rather than only jCks jitem sets. Both types of added operations are very costly:the time to compute the support size depends linearly onthe size of the database, while the secure comparison pro-tocol entails a costly oblivious transfer sub-protocol. Since,as shown in [9], jApFk1s j is much larger than jCks j, theadded computing time in such a protocol is expected todominate the cost of the secure computation of the unionof all locally s-frequent item sets. Hence, the enhancedsecurity offered by such a protocol is accompanied byincreased implementation costs.

    6 EXPERIMENTAL EVALUATION

    In Section 6.1 we describe the synthetic database that weused for our experimentation. In Section 6.2 we explain howthe database was split horizontally into partial databases. InSection 6.3 we describe the experiments that we conducted.The results are given in Section 6.4.

    6.1 Synthetic Database Generation

    The databases that we used in our experimental evaluationare synthetic databases that were generated using the sametechniques that were introduced in [1] and then used also insubsequent studies such as [8], [18], [23]. Table 1 gives theparameter values that were used in generating the syntheticdatabase. The reader is referred to [8], [18], [23] for adescription of the synthetic generation method and themeaning of each of those parameters. The parameter valuesthat we used here are similar to those used in [8], [18], [23].

    6.2 Distributing the Database

    Given a generated synthetic database D of N transactionsand a number of players M, we create an artificial split of Dinto M partial databases, Dm, 1 m M, in the followingmanner: For each 1 m M we draw a random numberwm from a normal distribution with mean 1 and variance0.1, where numbers outside the interval 0:1; 1:9 areignored. Then, we normalize those numbers so thatPM

    m1 wm 1. Finally, we randomly split D into m partialdatabases of expected sizes of wmN , 1 m M, as follows:Each transaction t 2 D is assigned at random to one of thepartial databases, so that Prt 2 Dm wm, 1 m M.6.3 Experimental Setup

    We compared the performance of two secure implementa-tions of the FDM algorithm (Section 1.1.2). In the first imple-mentation (denoted FDM-KC), we executed the unificationstep (Step 4 in FDM) using Protocol UNIFI-KC, where thecommutative cipher was 1,024-bit RSA [25]; in the secondimplementation (denoted FDM) we used our ProtocolUNIFI, where the keyed-hash function was HMAC [4]. Inboth implementations, we implemented Step 5 of the FDMalgorithm in the secure manner that was described in Sec-tion 3. We tested the two implementations with respect tothree measures:

    1. Total computation time of the complete protocols(FDM-KC and FDM) over all players. That measureincludes the Apriori computation time, and the timeto identify the globally s-frequent item sets, asdescribed in Section 3. (The latter two proceduresare implemented in the same way in both ProtocolsFDM-KC and FDM.)

    2. Total computation time of the unification protocolsonly (UNIFI-KC and UNIFI) over all players.

    3. Total message size.We ran three experiment sets, where each set tested the

    dependence of the above measures on a different parameter:

    Nthe number of transactions in the unifieddatabase,

    Mthe number of players, and sthe threshold support size.In our basic configuration, we took N 500;000, M 10,

    and s 0:1. In the first experiment set, we kept M and sfixed and tested several values of N . In the second experi-ment set, we kept N and s fixed and varied M. In the thirdset, we kept N and M fixed and varied s. The results in eachof those experiment sets are shown in Section 6.4.

    All experiments were implemented in C# (.net 4) andwere executed on an Intel(R) Core(TM)i7-2620M personalcomputer with a 2.7 GHz CPU, 8 GB of RAM, and the 64-bitoperating system Windows 7 Professional SP1.

    6.4 Experimental Results

    Fig. 1 shows the values of the three measures that werelisted in Section 6.3 as a function of N . In all of those experi-ments, the value of M and s remained unchangedM 10and s 0:1. Fig. 2 shows the values of the three measures asa function of M; here, N 500;000 and s 0:1. Fig. 3 showsthe values of the three measures as a function of s; here,N 500;000 and M 10.

    TABLE 1Parameters for Generating the Synthetic Database

    980 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

  • From the first set of experiments, we can see that N haslittle effect on the runtime of the unification protocols,UNIFI-KC and UNIFI, nor on the bit communication cost.However, since the time to identify the globally s-frequentitem sets (see Section 3) does grow linearly with N , and thatprocedure is carried out in the same manner in FDM-KCand FDM, the advantage of Protocol FDM over FDM-KC interms of runtime decreases with N . While for N 100;000,Protocol FDM is 22 times faster than Protocol FDM-KC, forN 500;000 it is five times faster. (The total computationtimes for larger values of N retain the same pattern thatemerges from Fig. 1; for example, with N 106 the totalcomputation times for FDM-KC and FDM were 744.1 and238.5 seconds, respectively, which gives an improvementfactor of 3.1.)

    The second set of experiments shows how the computa-tion and communication costs increase with M. In particu-lar, the improvement factor in the bit communication cost,as offered by Protocol UNIFI with respect to Protocol UNIFI-KC, is in accord with our analysis in Section 2.5.3. Finally,the third set of experiments shows that higher supportthresholds entail smaller computation and communicationcosts since the number of frequent item sets decreases.

    7 RELATED WORK

    Previous work in privacy preserving data mining has con-sidered two related settings. One, in which the data ownerand the data miner are two different entities, and another,in which the data is distributed among several parties who

    Fig. 1. Computation and communication costs versus the number oftransactions N.

    Fig. 2. Computation and communication costs versus the number ofplayers M.

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 981

  • aim to jointly perform data mining on the unified corpus ofdata that they hold.

    In the first setting, the goal is to protect the datarecords from the data miner. Hence, the data owneraims at anonymizing the data prior to its release. Themain approach in this context is to apply data perturba-tion [2], [11]. The idea is that the perturbed data can beused to infer general trends in the data, without reveal-ing original record information.

    In the second setting, the goal is to perform data miningwhile protecting the data records of each of the data ownersfrom the other data owners. This is a problem of securemulti-party computation. The usual approach here is cryp-tographic rather than probabilistic. Lindell and Pinkas [22]showed how to securely build an ID3 decision tree whenthe training set is distributed horizontally. Lin et al. [21]

    discussed secure clustering using the EM algorithm overhorizontally distributed data. The problem of distributedassociation rule mining was studied in [19], [31], [33] in thevertical setting, where each party holds a different set ofattributes, and in [18] in the horizontal setting. Also thework of [26] considered this problem in the horizontal set-ting, but they considered large-scale systems in which, ontop of the parties that hold the data records (resources) thereare also managers which are computers that assist theresources to decrypt messages; another assumption made in[26] that distinguishes it from [18] and the present study isthat no collusions occur between the different networknodesresources or managers.

    The problem of secure multiparty computation of theunion of private sets was studied in [7], [14], [20], as wellas in [18]. Freedman et al. [14] present a privacy-preserv-ing protocol for set intersections. It may be used to com-pute also set unions through set complements, sinceA [B A \B. Kissner and Song [20] present a methodfor representing sets as polynomials, and give several pri-vacy-preserving protocols for set operations using theserepresentations. They consider the threshold set unionproblem, which is closely related to the threshold func-tion (Definition 2.1). The communication overhead of thesolutions in those two works, as well as in [18]s and inour solutions, depends linearly on the size of the groundset. However, as the protocols in [14], [20] use homomor-phic encryption, while that of [18] uses commutativeencryption, their computational costs are significantlyhigher than ours. The work of Brickell and Shmatikov [7]is an exception, as their solution entails a communicationoverhead that is logarithmic in the size of the ground set.However, they considered only the case of two players,and the logarithmic communication overhead occurs onlywhen the size of the intersection of the two sets isbounded by a constant.

    The problem of set inclusion can be seen as a simplifiedversion of the privacy-preserving keyword search. In that prob-lem, the server holds a set of pairs fxi; pigni1, where xi aredistinct keywords, and the client holds a single value w. Ifw is one of the servers keywords, i.e., w xi for some1 i n, the client should get the corresponding pi. In casew differs from all xi, the client should get notified of that.The privacy requirements are that the server gets no infor-mation about w and that the client gets no information aboutother pairs in the servers database. This problem wassolved by Freedman et al. [13]. If we take all pi to be theempty string, then the only information the client gets iswhether or not w is in the set fx1; . . . ; xng. Hence, in thatcase the privacy-preserving keyword search problemreduces to the set inclusion problem. Another solution forthe set inclusion problem was recently proposed in [30],using a protocol for oblivious polynomial evaluation.

    8 CONCLUSION

    We proposed a protocol for secure mining of associationrules in horizontally distributed databases that improvessignificantly upon the current leading protocol [18] in termsof privacy and efficiency. One of the main ingredients inour proposed protocol is a novel secure multi-party

    Fig. 3. Computation and communication costs versus the supportthreshold s.

    982 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 4, APRIL 2014

  • protocol for computing the union (or intersection) of privatesubsets that each of the interacting players hold. Anotheringredient is a protocol that tests the inclusion of an elementheld by one player in a subset held by another. Those proto-cols exploit the fact that the underlying problem is of inter-est only when the number of players is greater than two.

    One research problem that this study suggests wasdescribed in Section 3; namely, to devise an efficient proto-col for inequality verifications that uses the existence of asemi-honest third party. Such a protocol might enable tofurther improve upon the communication and computa-tional costs of the second and third stages of the protocol of[18], as described in Sections 3 and 4. Other research prob-lems that this study suggests is the implementation of thetechniques presented here to the problem of distributedassociation rule mining in the vertical setting [31], [33], theproblem of mining generalized association rules [27], andthe problem of subgroup discovery in horizontally parti-tioned data [16].

    ACKNOWLEDGMENTS

    The author thanks Keren Mendiuk for her help in conduct-ing the experiments.

    REFERENCES[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Associa-

    tion Rules in Large Databases, Proc 20th Intl Conf. Very LargeData Bases (VLDB), pp. 487-499, 1994.

    [2] R. Agrawal and R. Srikant, Privacy-Preserving Data Mining,Proc. ACM SIGMOD Conf., pp. 439-450, 2000.

    [3] D. Beaver, S. Micali, and P. Rogaway, The Round Complexity ofSecure Protocols, Proc. 22nd Ann. ACM Symp. Theory of Computing(STOC), pp. 503-513, 1990.

    [4] M. Bellare, R. Canetti, and H. Krawczyk, Keying Hash Functionsfor Message Authentication, Proc. 16th Ann. Intl Cryptology Conf.Advances in Cryptology (Crypto), pp. 1-15, 1996.

    [5] A. Ben-David, N. Nisan, and B. Pinkas, FairplayMP - A Systemfor Secure Multi-Party Computation, Proc. 15th ACM Conf. Com-puter and Comm. Security (CCS), pp. 257-266, 2008.

    [6] J.C. Benaloh, Secret Sharing Homomorphisms: Keeping Shares ofa Secret Secret, Proc. Advances in Cryptology (Crypto), pp. 251-260,1986.

    [7] J. Brickell and V. Shmatikov, Privacy-Preserving Graph Algo-rithms in the Semi-Honest Model, Proc. 11th Intl Conf. Theory andApplication of Cryptology and Information Security (ASIACRYPT),pp. 236-252, 2005.

    [8] D.W.L. Cheung, J. Han, V.T.Y. Ng, A.W.C. Fu, and Y. Fu, A FastDistributed Algorithm for Mining Association Rules, Proc. FourthIntl Conf. Parallel and Distributed Information Systems (PDIS),pp. 31-42, 1996.

    [9] D.W.L Cheung, V.T.Y. Ng, A.W.C. Fu, and Y. Fu, Efficient Min-ing of Association Rules in Distributed Databases, IEEE Trans.Knowledge and Data Eng., vol. 8, no. 6, Dec. 1996.

    [10] T. ElGamal, A Public Key Cryptosystem and a Signature SchemeBased on Discrete Logarithms, IEEE Trans. Information Theory,vol. IT-31, no. 4, July 1985.

    [11] A.V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, PrivacyPreserving Mining of Association Rules, Proc. Eighth ACMSIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD),pp. 217-228, 2002.

    [12] R. Fagin, M. Naor, and P. Winkler, Comparing Information with-out Leaking It, Comm. ACM, vol. 39, pp. 77-85, 1996.

    [13] M. Freedman, Y. Ishai, B. Pinkas, and O. Reingold, KeywordSearch and Oblivious Pseudorandom Functions, Proc. Second IntlConf. Theory of Cryptography (TCC), pp. 303-324, 2005.

    [14] M.J. Freedman, K. Nissim, and B. Pinkas, Efficient Private Match-ing and Set Intersection, Proc. Intl Conf. Theory and Applications ofCryptographic Techniques (EUROCRYPT), pp. 1-19, 2004.

    [15] O. Goldreich, S. Micali, and A. Wigderson, How to Play AnyMental Game or a Completeness Theorem for Protocols with Hon-est Majority, Proc. 19th Ann. ACM Symp. Theory of Computing(STOC), pp. 218-229, 1987.

    [16] H. Grosskreutz, B. Lemmen, and S. Ruping, Secure DistributedSubgroup Discovery in Horizontally Partitioned Data, Trans.Data Privacy, vol. 4, no. 3, pp. 147-165, 2011.

    [17] W. Jiang and C. Clifton, A Secure Distributed Framework forAchieving k-Anonymity, The VLDB J., vol. 15, pp. 316-333, 2006.

    [18] M. Kantarcioglu and C. Clifton, Privacy-Preserving DistributedMining of Association Rules on Horizontally Partitioned Data,IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1026-1037,Sept. 2004.

    [19] M. Kantarcioglu, R. Nix, and J. Vaidya, An Efficient Approxi-mate Protocol for Privacy-Preserving Association Rule Mining,Proc. 13th Pacific-Asia Conf. Advances in Knowledge Discovery andData Mining (PAKDD), pp. 515-524, 2009.

    [20] L. Kissner and D.X. Song, Privacy-Preserving Set Operations,Proc. 25th Ann. Intl Cryptology Conf. (CRYPTO), pp. 241-257, 2005.

    [21] X. Lin, C. Clifton, and M.Y. Zhu, Privacy-Preserving Clusteringwith Distributed EM Mixture Modeling, Knowledge and Informa-tion Systems, vol. 8, pp. 68-81, 2005.

    [22] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, Proc.Crypto, pp. 36-54, 2000.

    [23] J.S. Park, M.S. Chen, and P.S. Yu, An Effective Hash Based Algo-rithm for Mining Association Rules, Proc. ACM SIGMOD Conf.,pp. 175-186, 1995.

    [24] S.C. Pohlig and M.E. Hellman, An Improved Algorithm for Com-puting Logarithms over gfp and Its Cryptographic Signifi-cance, IEEE Trans. Information Theory, vol. IT-24, no. 1, pp. 106-110, Jan. 1978.

    [25] R.L. Rivest, A. Shamir, and L.M. Adleman, A Method for Obtain-ing Digital Signatures and Public-Key Cryptosystems, Comm.ACM, vol. 21, no. 2, pp. 120-126, 1978.

    [26] A. Schuster, R. Wolff, and B. Gilburd, Privacy-Preserving Associ-ation Rule Mining in Large-Scale Distributed Systems, Proc. IEEEIntl Symp. Cluster Computing and the Grid (CCGRID), pp. 411-418,2004.

    [27] R. Srikant and R. Agrawal, Mining Generalized AssociationRules, Proc. Intl Conf. Very Large Data Bases (VLDB), pp. 407-419,1995.

    [28] T. Tassa and D. Cohen, Anonymization of Centralized and Dis-tributed Social Networks by Sequential Clustering, IEEE Trans.Knowledge and Data Eng., vol. 25, no. 2, pp. 311-324, Feb. 2013.

    [29] T. Tassa and E. Gudes, Secure Distributed Computation of Ano-nymized Views of Shared Databases, Trans. Database Systems,vol. 37, article 11, 2012.

    [30] T. Tassa, A. Jarrous, and J. Ben-Yaakov, Oblivious Evaluation ofMultivariate Polynomials, J. Mathematical Cryptology, vol. 7, pp.1-29, 2013.

    [31] J. Vaidya and C. Clifton, Privacy Preserving Association RuleMining in Vertically Partitioned Data, Proc. Eighth ACM SIGKDDIntl Conf. Knowledge Discovery and Data Mining (KDD), pp. 639-644, 2002.

    [32] A.C. Yao, Protocols for Secure Computation, Proc. 23rd Ann.Symp. Foundations of Computer Science (FOCS), pp. 160-164, 1982.

    [33] J. Zhan, S. Matwin, and L. Chang, Privacy Preserving Collabora-tive Association Rule Mining, Proc. 19th Ann. IFIP WG 11.3 Work-ing Conf. Data and Applications Security, pp. 153-165, 2005.

    [34] S. Zhong, Z. Yang, and R.N. Wright, Privacy-Enhancing k-Ano-nymization of Customer Data, Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 139-147,2005.

    Tamir Tassa received the PhD degree in appliedmathematics from the Tel Aviv University in1993. He is an associate professor at the Depart-ment of Mathematics and Computer Science atThe Open University of Israel. Previously, heserved as a lecturer and a researcher in theSchool of Mathematical Sciences at Tel Aviv Uni-versity, and in the Department of Computer Sci-ence at Ben Gurion University. During the years1993-1996, he served as an assistant professorof computational and applied mathematics at the

    University of California, Los Angeles. His research interests includecryptography, privacy-preserving data publishing and data mining.

    TASSA: SECURE MINING OF ASSOCIATION RULES IN HORIZONTALLY DISTRIBUTED DATABASES 983

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice