evaluating privacy threats in released database views by ...xwang4/publications/jcsyao.pdf ·...

Evaluating privacy threats in released database

views by symmetric indistinguishability

Chao Yao∗, Lingyu Wang†, X. Sean Wang‡,

Claudio Bettini§, Sushil Jajodia¶

Abstract

A privacy violation occurs when the association between an individ-ual identity and data considered private by that individual is obtainedby an unauthorized party. Uncertainty and indistinguishability are twoindependent aspects that characterize the degree of this association be-ing revealed. Indistinguishability refers to the property that the attackercannot see the difference among a group of individuals, while uncertaintyrefers to the property that the attacker cannot tell which private value,among a group of values, an individual actually has. This paper inves-tigates the notion of indistinguishability as a general form of anonymity,applicable, for example, not only to generalized private tables, but to re-lational views and to sets of views obtained by multiple queries over aprivate database table. It is shown how indistinguishability is highly in-fluenced by certain symmetries among individuals, in the released data,with respect to their private values. The paper provides both theoreticalresults and practical algorithms for checking if a specific set of views overa private table provide sufficient indistinguishability.

1 Introduction

In many data applications, it is necessary to measure privacy disclosure in re-leased data to protect individual privacy while satisfying application require-ments. A privacy violation occurs when the association between an individualidentity and the data considered private by that individual is obtained by anunauthorized party. Uncertainty and indistinguishability are two independentaspects that characterize the degree of this association being revealed. Indis-tinguishability refers to the property that the attacker cannot see the differenceamong a group of individuals, while uncertainty refers to the property that the

∗Bloomberg L.P.†Concordia University‡University of Vermont§University of Milano¶George Mason University

1

attacker cannot tell which private value, among a group of values, an individualactually has.

The measurement metrics used in prior work have mainly been based onuncertainty of private property values, i.e., the uncertainty what private valuean individual has. These metrics can be classified into two categories: non-probabilistic and probabilistic. The non-probabilistic metrics are based onwhether the private value of an individual can be uniquely inferred from thereleased data [1, 23, 8, 18, 6, 17] or whether the cardinality of the set of possibleprivate values inferred for an individual is large enough [29, 32]. The proba-bilistic metrics are based on some characteristics of the probability distributionof the possible private values inferred from the released data [3, 2, 11, 10, 16, 4](see Section 5 for more details).

However, uncertainty is only one aspect of privacy and it alone does not pro-vide adequate protection. For example, we may reveal employee John’s salaryto be in a large interval (say, 100K to 300K annually). There may be enoughuncertainty. However, if we also reveal that the salaries of all other employeesare in ranges that are totally different from John’s range (say, all are subrangesof 50K to 100K), then John’s privacy may still be violated.

To adequately protect privacy, we need to consider the other aspect, namely,indistinguishability. Indeed, the privacy breach in the above example can beviewed as due to the fact that from the released data, an individual is differentfrom all other individuals in terms of their possible private values. In otherwords, the example violates a privacy requirement, namely, the “protectionfrom being brought to the attention of others” [12]. What we need is to haveeach individual belong to a group of individuals who are indistinguishable fromeach other in terms of their possible private values derived from the releaseddata. In this way, an individual is hidden in a crowd that consists of individualswho have similar/same possible private values. For instance, in the above salaryexample, to protect John’s privacy, we may want to make sure that attackerscan only derive from the released data that a large group of employees have thesame range as John’s for their possible salaries.

The notion of k-anonymization [27, 28, 24, 5, 20] aims at achieving a certaindegree of indistinguishability. More specifically, the idea of k-anonymizationis to recode, mostly by generalization, publicly available quasi-IDs in a singlereleased table, so that at least k individuals will have the same recoded quasi-IDs. (Quasi-IDs are combination of attributes whose values can be used toidentify groups of (and possibly single) individuals through external sources[27, 28].)

While k-anonymity is an interesting notion, it only applies to anonymizedtables. What can we say about privacy disclosure, in terms of the indistin-guishability of individuals associated to specific private values, when a sequenceof queries on private data is answered? How can we check for indistiguishabilityfor a view based on projection and selection, possibly including sensitive at-tributes, is released? How indistinguishability is reduced when the informationof two or more views can be combined? While uncertainty has been extensivelystudied [1, 8, 18, 6, 29, 17], even in the case of the release of multiple views [32],

2

the answers to the above questions are essentially open and this paper focuseson these issues.

We illustrate the problem of indistinguishability for multiple views by Ex-ample 1, which also introduces the set of private data that will be consideredthroughout the paper in the examples illustrating definitions and verificationprocedures.

Example 1 As a running example, we consider data from the database of amedical institution including general census data on the individual patient aswell as the total amount that has been charged to the patient by the institutionand the last problem regarding her health condition, in the form of a diagnosis.The private table Tbl is reported in Figure 1. The public attributes (denoted withPA in the paper) are Zip, Age, Race, Gender, and Charge, while the sensitiveattribute is Diagnosis. We use t1, . . . , t12 to denote the tuples in the table andwe also assume that PA is a quasi-ID such that each tuple can re-identify asingle individual; hence ti[PA] identifies a particular individual for each i. Inthe sequel, we use ti[PA] and the individual identified by ti[PA] interchangeably.

Zip Age Race Gender Charge Diagnosis

t1 22030 39 White Male 1K Cold

t2 22030 39 White Male 12K AIDS

t3 22030 55 White Male 5K Obesity

t4 22030 53 Black Male 5K AIDS

t5 22031 28 Black Female 8K Chest Pain

t6 22031 37 White Female 10K Hypertension

t7 22031 49 Black Female 1K Obesity

t8 22031 52 White Male 8K Cold

t9 22032 30 Asian Male 10K Hypertension

t10 22032 40 Asian Male 9K Chest Pain

t11 22033 30 White Male 10K Hypertension

t12 22033 40 White Male 9K Chest Pain

Figure 1: A private table about patients (Tbl)

Consider the set of two views in Figure 2 obtained from the private table inFigure 1. The first view is obtained by the query

ΠGender,DiagnosisσZip=′22030′(Tbl)

and returns the projection on Gender and Diagnosis of the first 4 tuples ofthe private table. (Note that we assume projection preserves duplicates.) Thesecond view is obtained by the query

ΠRace,GenderσAge≤′49′and Diagnosis=′Obesity′ (Tbl)

which returns the projection on the attributes Race,Gender of the seventh tuplein the private table.

3

The first view by itself leaves the four patients living in Zip 22030 as indis-tinguishable, hence each of them may be associated with one of the Diagnosis. Itwould be considered 4-anonymous according to k-anonymity definition and also4-SIND, i.e., having indistinguishability in a group of 4 individuals, accordingto our proposal (SIND is defined in Section 2.3).

Consider the second view by itself. From the view, with the knowledge of thepublic part of the private table ΠPA(Tbl), we can only know that one of the twoblack females has Obesity, but these two individuals are indistinguishable, andtherefore, this view provides indistinguishability in a group of 2 individuals, i.e.,2-SIND. It’s not clear how k-anonymity notion may apply here.

Now consider the case in which an adversary obtains both views at the sametime. He now knows that, among other things, that none of the two 39-year-oldpatients at Zip 22030 have Obesity (from the second view and since they are notfemales), and one of the patients identified by t3[PA] and t4[PA] must be the onewith Obesity as there must be one patient having Obesity (from the above factand the first view). This conclusion should reduce the anonymity of the involvedindividuals. However, it is not clear how to use the notion of k-anonymity toevaluate this effect. In contrast, we will show how it can be evaluated accordingto our definition that when two views are considered, we only achieve 2-SIND,even for the four patients at Zip 22030.

Gender Diagnosis

Male ColdMale AIDSMale ObesityMale AIDS

(a) The first view in the releaseddataset.

Race Gender

Black Female(b) The second view in thereleased dataset.

Figure 2: A released dataset composed of two relational views

In this paper, we define three variants of indistinguishability, and the corre-sponding privacy metrics, that can be applied to general situations, includinganonymized tables and relational views. We show that k-anonymization is aspecial case of one kind of indistinguishability under a certain assumption (seeSection 2.4).

The first notion of indistinguishability is based on probability. This defini-tion probably provides the best characterization and it is theoretically elegant.While it is useful as a reference notion, it has the drawback of not admittingpractical automatic methods for its verification. This is mainly due to the diffi-culty in obtaining required a priori distributions of private values. The secondand the third are based on certain symmetries between individuals and theirprivate values in the released data. More specifically, the second definition re-quires symmetry for all possible private values while the third definition requiressymmetry only with respect to certain subsets of possible private values.

4

Based on the formal framework, we study the problem of deciding whether aset of database views provides enough indistinguishability. We study the worst-case computational complexity of deciding each kind of indistinguishability andidentify several subcases that admit polynomial verification algorithms. Wepresent in detail some of these algorithms and show that they also provide aneffective solution for the general case, if a conservative approach is considered.We also provide preliminary results for the dynamic private table situation,i.e., how to ensure indistinguishability when the private table is updated. Wepresent a transformation from the case of dynamic private tables into the caseof static tables.

Finally, we should note that uncertainty and indistinguishability are twoindependent aspects for providing privacy; we have shown that uncertainty byitself is insufficient, and it has been observed that the form of indistinguishabilityprovided by k-anonymity does not ensure uncertainty either [22, 21]. The sameapplies to general indistinguishability; referring to our John’s salary exampleearlier, if in the released data many employees have the same single possiblesalary value, then these employees are all indistinguishable from each other interms of their salaries, but there is not enough uncertainty to protect theirprivacy (all their salaries are the same and revealed!).

While the notion of indistinguishability proposed in this paper takes intoaccount the private values associated to individuals, it is not the purpose of thispaper to provide an integrated solution for indistinguishability and uncertainty;we have illustrated above that there are several issues still open concerningindistinguishability alone. Extensions and integration with other solutions willbe discussed in Section 6.

We summarize the contributions of this paper as follows: (1) We identifyindistinguishability as a more generally applicable notion of anonymity, pro-vide formal definitions of different kinds of indistinguishability, and study theirproperties. (2) We analyze the computational complexity and introduce practi-cal verification methods for deciding whether a set of database views providesenough indistinguishability.

The rest of paper is organized as follows. We give formal definitions ofindistinguishability and privacy metrics in Section 2. We then focus on checkingdatabase views against these privacy metrics in Section 3. We investigate themetrics and checking methods in the special case in which private tables aresubject to updates in Section 4. In Section 5 we review the related work, andwe conclude the paper in Section 6 presenting also some interesting researchdirections.

2 Theoretical and practical characterizations of

indistinguishability

In this section we define a formal framework that will be used to provide a the-oretical characterization of the indistinguishability property. We will start with

5

a probabilistic characterization, and then introduce two alternative characteri-zations that admit more practical methods to check if a set of views provides acertain degree of indistinguishability. The relationships between these alterna-tive characterizations will be explained.

2.1 Preliminaries

We consider releasing data from a single private table Tbl with schema Attr(Tbl).The attributes in Attr(Tbl) are partitioned into two sets, PA and SA. The setPA consists of the Public Attributes; SA consists of the Sensitive Attributes,i.e., attributes whose value if released together with the identity of the user leadsto privacy violation. For simplicity and without loss of generality, we assumeSA only has one attribute.

We assume that the projection on PA, ΠPA(Tbl), is publicly known. Webelieve this assumption is realistic in many situations. In other situations wherethis is not true, we may view our approach as providing a conservative privacymeasure.

Given a relation rPA on PA, we will use IPA to denote the set {I|ΠPA(I) =rPA}, i.e., the set of the relations on Attr(Tbl) whose PA-projection coincideswith rPA. The domain of SA is denoted by Dom(SA). A tuple of an instancein IPA is denoted by t or (b, p), where b is in ΠPA(Tbl) and p is in Dom(SA).The set IPA corresponds to all possible private table instances by only knowingΠPA(Tbl).

Furthermore, we assume PA is a key in Attr(Tbl), which means that eachcomposite value on PA appears at most once in the private table. Intuitively,each composite value on PA is assumed to uniquely identify an individual (possi-bly using also external public information), and hence the tuples in Tbl actuallydescribe associations of the sensitive attribute values with individuals. Suchassociations are the private information to be protected.

We assume that the data in Tbl are being released with a publicly-knownfunction M . We also use v to denote the result of M() on the private table, i.e.,v = M(Tbl). Examples of function M() include an anonymization procedure,and a set of queries (views) on a single table on Attr(Tbl). The table in Figure 1will be used throughout the paper as an example of private table from whichdifferent sets of views will be considered for release.

2.2 Probabilistic Indistinguishability

An accurate characterization of indistinguishability between two individuals iscaptured by a probabilistic definition. In this setting, given a relation rPA onPA, we denote with IPA the set of relations I on Attr(Tbl) whose PA-projectioncoincides with rPA, i.e., IPA = {I |ΠPA(I) = rPA}.

Given the domain of SA, and possibly other knowledge, there will be dif-ferent probabilities of certain sensitive values to appear associated with certaincombinations of PA values. This intuition is captured by considering the prob-ability P (I) of a relation instance I in IPA to be equal to Tbl.

6

Then, given b ∈ ΠPA(Tbl), the probability that a tuple (b, p), with p beinga value for SA, is a tuple in the actual Tbl, denoted Pb(p), can be expressed asPb(p) =

∑

I:(b,p)∈I P (I).

Intuitively, the probability that (b, p) is a tuple in the original private tableis given by the sum of the probabilities that a certain possible instance I ofIPA, containing that tuple, is actually the private table.

As v = M(Tbl) is released, we denote by Iv the subset of possible instancesin IPA that yield v. The a posteriori probability of (b, p) appearing in aninstance of Iv, denoted P v

b (p), can be calculated based on Iv by the Bayesianformula as P v

b (p) = (∑

I:I∈Iv∧(b,p)∈I P (I))/∑

I:I∈Iv P (I).Based on the above probability model, we introduce the concept of proba-

bilistic indistinguishability.

Definition 1 (Probabilistic Indistinguishability)Given a released data view v and two tuples bi and bj in ΠPA(Tbl), we say bi andbj are indistinguishable with respect to v if Pbi

(p) = Pbj(p) and P v

bi(p) = P v

bj(p)

for each p in Dom(SA).

Intuitively, the definition says that the two individuals “identified” by bi andbj are indistinguishable if for each possible private value, they have the samea priori probability of being associated with that value, and considering thereleased dataset v, they also have the same a posteriori probability of beingassociated with that value in any relation Iv. In other words, the definitionrequires that two said PA-tuples, bi and bj , have the same distribution for theirsensitive values before and after the release of v. Hence, indistinguishabilityupon release of v means that knowing v does not make bi and bj differentin terms of their private values, although the probability distributions of theprivate values may have changed due to the released data.

Example 2 Suppose M(Tbl) = ΠDiagnosisσZip=′22033′ (Tbl), where Tbl is thetable in Figure 1. Suppose that PA-tuples t11[PA] and t12[PA] have the sameand independent uniform distribution for their Diagnosis values over Dom(Diagnosis).After the view is released, although the Diagnosis distributions of t11[PA] andt12[PA] are changed, they are still the same, i.e., the uniform distribution over{Hypertension, ChestPain}, since each of t11[PA] and t12[PA] can have oneof the two problems with equal probability. Hence, we say t11[PA] and t12[PA]are indistinguishable with respect to the released data.

We abbreviate Probabilistic Indistinguishability as PIND. PIND is a binaryrelation, i.e., PIND(bi, bj) is true if and only if bi is indistinguishable from bj .Obviously, due to the use of equality in the definition of PIND, PIND is reflexive,symmetric and transitive. That is, PIND is an equivalence relation. Thus, allthe PA-tuples that are indistinguishable from each other form a partition ofthe PA tuples. The cardinality of each set in the partition, which is called aPIND equivalence class, provides an indication of how much privacy protectioncan be achieved based on indistinguishability. Hence, we introduce the followingmetric.

7

Definition 2 (k-PIND) Given a released data set v, if each PIND equivalenceclass has a cardinality of at least k, we then say v provides k-PIND.

Unfortunately, it is very unlikely that a practical algorithm to evaluate k-PIND can be devised. This is due in part to the difficulty of estimating an apriori probability distribution for the association of sensitive values to PA-tuplevalues (or to individuals, in general), and due also to the complexity of calcu-lating the a posteriori probability distribution (shown to be intractable laterin the paper). Hence, in the following subsections we propose two alternativenotions, based on “local” properties and hence more practical for automaticevaluation. We also show under which conditions verifying these properties itis also possible to ensure that PIND holds.

2.3 Symmetric Indistinguishability

Given a released dataset v and the corresponding subset of possible instances inIPA that yield v, denoted by Iv, the definition of Symmetric Indistinguishabilityrequires that for each possible instance (i.e., for each possible private tableyelding the same dataset), if the two PA-tuples identifying the indistinguishableindividuals swap their private values while keeping other tuples unchanged, theobtained table is still one of the instances (i.e., a possible private table).

Definition 3 (Symmetric Indistinguishability) Given a released dataset vand two tuples bi and bj in ΠPA(Tbl), we say the tuples are SymmetricallyIndistinguishable with respect to v if for each pi, pj ∈ Dom(SA) and instanceI in Iv with (bi, pi) and (bj , pj) in I, there exists an instance I ′ in Iv withI ′ = I \ {(bi, pi), (bj , pj)} ∪ {(bi, pj), (bj , pi)}.

We abbreviate Symmetric Indistinguishability as SIND. In the sequel, we saytwo PA-tuples t1[PA] and t2[PA] can swap their private values in an instance,or simply t1[PA] swaps with t2[PA], if the resulting instance can still yieldv. Note that such a swap is required for all the instances yielding v, hencethis definition is in terms of v, not the current table Tbl (although we usedthe projection ΠPA(Tbl) in the definition, this projection is not Tbl itself andit is assumed publicly known). In other words, for two individuals (tuples inΠPA(Tbl)) to be SIND is to be able to swap their private values in all the possibleprivate tables, including Tbl, that would yeld the same released dataset. Notethat it is not sufficient, nor necessary, for two tuples to have the same privatevalue in Tbl to be SIND, as shown in Example 3.

Example 3 Consider the released view v in Figure 3 obtained from the tablein Figure 1.

The two PA-tuples t9[PA] and t10[PA] are SIND, because they can swaptheir Diagnosis values in any instance that yields v while still yielding thesame v. Similarly, the two PA-tuples t11[PA] and t12[PA] are also SIND.However, t9[PA] and t11[PA] are not SIND, even though they have the same

8

Zip Diagnosis

t9 22032 Hypertensiont10 22032 Chest Paint11 22033 Hypertensiont12 22033 Chest Pain

Figure 3: A released view ΠZip,DiagnosisσZip=′22032′or′22033′(Tbl).

Diagnosis value Hypertension in the current private table. To show this, con-sider an instance obtained by swapping the Diagnosis values of t9 and t10 in Tbl(while other tuples remain unchanged). So now t9 has ChestPain while t10 hasHypertension. Denote the new instance Tbl′. Clearly, Tbl′ also yields the viewv. However, in Tbl′, if we swap the Diagnosis values of t9 (i.e., ChestPain)with that of t11 (i.e., Hypertension), then both Zip = 22032 tuples in the re-sulting view will have Hypertension. Therefore, the new instance obtained fromTbl′ does not yield v, and hence t9[PA] and t11[PA] are not SIND.

The definition of SIND requires a complete symmetry between two PA-tuples in terms of their private values. The sets of possible private values of theSIND tuples are the same, because in each possible instance two SIND PA-tuplescan swap their private values without changing the views. Furthermore, thedefinition based on swapping makes SIND between two PA-tuples independenton other PA-tuples. That is, even if attackers can guess the private values ofall other PA-tuples, they still cannot distinguish between these two PA-tuplesbecause the two PA-tuples still can swap their private values without affectingthe views.

Symmetric indistinguishability as defined in Definition 3 implies PIND un-der a zero-prior-knowledge assumption, i.e., when no a priori distribution of theassociation of individuals with specific sensitive values is known. More techni-cally, by zero-prior-knowledge we mean that for each pair of distinct bi and bj inΠPA(Tbl), Pbi

(p) and Pbj(p) are equal and independent for each sensitive value

p.Referring to our running example in Figure 1, this assumption means that

each patient has the same probability to have a particular diagnosis. Also, theprobability of one patient to be associated with a diagnosis is independent fromthe probability of another patient.

The following proposition says SIND implies PIND under the zero-prior-knowledge assumption. However, PIND does not imply SIND. Intuitively, thisis because even though two PA-tuples have the same probability distribution fortheir private values, it does not guarantee they are able to switch their privatevalues without affecting a released data v as discussed earlier.

Proposition 1 Under the zero-prior-knowledge assumption, given a releaseddata v, if two values bi and bj in ΠPA(Tbl) are SIND with respect to v, then bi

and bj are PIND with respect to v; the inverse implication does not hold.

9

Proof We show that if bi and bj are SIND, then P vbi

(p) = P vbj

(p) holds for any

p ∈ Dom(SA). The probability of a tuple is equal to the sum of probabilities ofall the instances in Iv that include the tuple. Let I ⊆ Iv be the set of instancescontaining (bi, p), and I ′ ⊆ Iv the set of instances containing (bj , p). We havetwo facts. First, there exists a bijection between I and I ′. Second, if I ∈ Imaps to I ′ ∈ I′, we have P (I) = P (I ′).

To justify the claims, we map any I ∈ I to I ′ ∈ I′ by switching the privatevalues p and p′ associated with bi and bj in I, respectively. The only differencebetween I and I ′ is that I has (bi, p) and (bj, p

′), whereas I ′ has (bi, p′) and

(bj , p). Under the zero-prior-knowledge assumption, the probability of an in-stance is equal to the product of the probabilities of all tuples in the instance.Hence, Pbi

(p) = Pbj(p) and Pbi

(p′) = Pbj(p′) jointly imply P (I) = P (I ′).

The mapping between I and I ′ is injective. Any two different instancesI1 ∈ I and I2 ∈ I must have at least one PA value associated with differentprivate values. If this PA value is neither bi nor bj, then I1 and I2 cannot mapto the same instance in I ′ , because the mapping preserves this difference (theswitching doesn’t change private values of any PA values other than bi or bj).Similarly, if I1 and I2 have bi or bj associated with different private values, thenthey will also map to different instances in I ′. The mapping is also surjective,because the definition of SIND requires that switching the private values of band b′ in any instance I ′ ∈ I′ yields a valid instance I ∈ I.

We use a counter example to show that PIND doesn’t imply SIND. Con-sider the query in Figure 3. Under the zero-prior-knowledge assumption, eachof the four tuples t9, t10, t11, and t12 is equally likely to have Hypertensionor ChestPain as its private value. Hence, all the four tuples are PIND. How-ever, t9 and t10 are not SIND to t11 and t12. Consider an instance wheret9 has Hypertension, t10 has ChestPain, t11 has ChestPain and t12 hasHypertension. Switching the private values of t9 and t11 results in a newinstance where both t9 and t10 have ChestPain. This instance cannot yield thesame result to the query in Figure 3. Thus, t9 and t11 are not SIND whereasthey are PIND. �

Proposition 2 Symmetric Indistinguishability (SIND) is an equivalence binaryrelation.

Proof The relation is clearly reflexive and symmetric. We prove transitivity asfollows. Intuitively we show that if (1) PA-tuple b1 can switch with PA-tupleb2 and (2) b2 can switch with b3, then (3) b1 can switch with b3. Indeed, byDefinition 3, if (1) holds then for each I in Iv, if b1 is associated with p1 andb2 is associated with p2 in I, there exists I ′ in Iv that associates b1 with p2

and b2 with p1. Similarly, for (2), for each I in Iv, if b2 is associated with p2

and b3 is associated with p3 in I, there exists I ′ in Iv that associates b2 withp3 and b3 with p2. Consider I1 in Iv. Since (1) holds, we find I ′1 in Iv obtainedby swapping the private values of b1 and b2 in I1. Now since I ′1 is in Iv, and(2) holds, we can find I ′′1 in Iv obtained by swapping the private values of b2

and b3 in I ′1. Now inspect I ′′1 . The only difference between I ′′1 and I1 is that

10

the private values of b1 and b3 are swapped. Since I1 has been taken as anarbitrary instance in Iv, the above reasoning holds for all instances, and hence,by Definition 3, (3) holds. �

Since SIND is an equivalence relation, similarly to PIND, it will induce apartition on the set of individuals identifying SIND equivalence classes. A metrick-SIND is defined in a similar fashion as k-PIND.

Definition 4 (k-SIND) Given a released dataset v, if each SIND equivalenceclass has a cardinality of at least k, we then say v provides k-SIND.

By Proposition 1, easily follows that, under the zero-prior-knowledge as-sumption, if an individual is part of a k-SIND equivalence class is also part ofa k-PIND one, and more generally, that if a released dataset provides k-SIND,it also provides k-PIND. The opposite does not hold.

Example 4 Consider the released view in Figure 3 obtained from the table inFigure 1. Under the zero-prior-knowledge assumption, observing the view weknow that the private values of t9[PA], t10[PA], t11[PA] and t12[PA] have thesame a posteriori distribution over {Hpertension, ChestPain}. Thus, this viewprovides 4-PIND. But the view only provides 2-SIND, because for each instance,it is only possible to switch private values between t9[PA] and t10[PA], andbetween t11[PA] and t12[PA].

2.4 Relationship with k-Anonymity

In this subsection, we discuss the relationship between k-SIND and k-anonymity.In the k-anonymity literature (e.g., [27, 28, 24, 5, 20]), the released data is ananonymized table. Anonymization is a function from quasi-IDs to recoded quasi-IDs, and the anonymization process (the function M in Section 2.1) is to replacequasi-IDs with recoded quasi-IDs. We assume that the anonymization algorithmand the input quasi-IDs are known. In fact, we make a stronger assumption,called “mapping assumption”, which says that (1) each quasi-ID maps to onerecoded quasi-ID and (2) given a recoded quasi-ID, attackers know which set ofquasi-IDs map to it.

Example 5 Consider the table and the corresponding anonymized view in Fig-ure 4. The tuples on (Zip, Race) are quasi-IDs. Under the mapping assumption,attackers know which quasi-ID maps to which recoded quasi-ID. For instance,(22031, White) maps to (2203∗, ∗) but not (220∗∗, White). (In contrast, with-out the mapping assumption, only from the anonymized table, (22031, White)may map to either (2203∗, ∗) or (220∗∗, White).)

Note that the above mapping assumption always holds when quasi-ID at-tribute values are uniformly generalized using the same degree of generalization(e.g., 2 digits of the zip code are obfuscated in all records), and this is a quitecommon practice. Under the above assumption, considering all attributes PA asquasi-IDs, we have the following result about the relationship between k-SINDand k-anonymity.

11

Zip Race Diagnosis

22021 White Cold22031 White Obesity22032 White AIDS22033 Black Headache

Zip Race Diagnosis

220∗∗ White Cold220∗∗ White Obesity2203∗ ∗ AIDS2203∗ ∗ Headache

Figure 4: A private table, and an anonymized version that is 2-SIND but onlyhas 1-anonymity.

Proposition 3 Under the mapping assumption, given a released dataset inthe form of a table v such that PA attributes are quasi-ID, if v provides k-anonymity, with k ≥ 2, then v provides k-SIND, while the opposite does nothold.

Intuitively, if v provides k-anonymity, then at least k PA-tuple values map toeach recoded quasi-ID in v. In any instance yielding v, suppose two PA-tuples b1

and b2 map to the same recoded quasi-ID. Then, swapping the private values ofb1 and b2 in the original table gives an instance yielding the same v. Therefore,v provides k-SIND.

Example 6 shows that a view may be k-SIND but not k-anonymous.

Example 6 The released view in Figure 5 is 2-SIND since the first two tuplesform a 2-SIND set, and similarly do the second two tuples. However, the viewis not 2-anonymous because the value 2202∗ for the PA attribute appears onlyonce in the released view.

Zip Age Diagnosis

22021 39 Cold22041 42 Cold22032 35 Chest Pain22033 26 Headache

Zip Diagnosis

2202* Cold2204* Cold2203* Chest Pain2203* Headache

Figure 5: A private table and a 2-SIND released view.

By definition, k-anonymity is applicable only to a single anonymized table,but not to other kinds of released data such as multiple relational views, whilethis paper shows that the notion of indistinguishability applies to sets of viewstoo.

2.5 Restricted Symmetric Indistinguishability

Since SIND requires symmetry in terms of all possible private values, it is arather strict metric. We define another metric based on the symmetry in termsof not all possible private values but only a subset that includes the actualprivate values in the current private table. If PA-tuples are symmetric in termsof this subset of private values, even though they are not symmetric in terms of

12

other values, we may still take them as indistinguishable. The intuition here isthat we intend to provide more protection on a specific set of private values.

Suppose each PA-tuple is associated with a set of private values including itscurrent private value. These sets form a collection. More specifically, we call acollection P of Dom(SA) value sets P1, ..., Pn a private value collection, wheren = |ΠPA(Tbl)| and ΠPA(Tbl) = b1, ..., bn, if for each s, where s = 1, ..., n,ΠP σPA=bs

(Tbl) ∈ Ps.If two PA-tuples are symmetric with respect to a private value collection,

then we consider them indistinguishable according to Restricted Symmetric In-distinguishability which is abbreviated as RSIND.

Definition 5 (RSIND) Given a released dataset v on the private table Tbland a sensitive value collection P1, ..., Pn, we say two PA-tuples bi and bj areRSIND with respect to that collection if the following conditions are satisfied:(1) Pi = Pj and (2) for each pi in Pi and each pj in Pj , if (bi, pi) and (bj , pj)are in an instance I of Iv, then I ′ is in Iv where I ′ = (I −{(bi, pi), (bj , pj)})∪{(bi, pj), (bj , pi)}.

In this definition, unlike SIND, which requires swapping all possible sensi-tive attribute values, RSIND only requires swapping private values in a subsetincluding the actual private values. RSIND becomes SIND if Pi = Dom(SA)for each i.

Example 7 Consider the two views in Figure 6 which are obtained by the pri-vate table Tbl in Figure 1. From the views, we can deduce that in Tbl, t1[PA]cannot be associated with Obesity but can be associated with Cold and AIDS,while t2[PA] can be associated with all the three diagnosis. Clearly, t1[PA] andt2[PA] are not SIND. However, there exists a private value collection P1, ..., P4

with P1 = P2 = {Cold, AIDS} and P3 = P4 = {Cold, AIDS, Obesity}, suchthat t1[PA] and t2[PA] are RSIND with respect to that collection. Indeed, P1

and P2 are identical, and they both include the current private values of t1[PA]and t2[PA]: Cold and AIDS. In any instance yielding the views, if t1[PA] andt2[PA] have Cold and AIDS, or AIDS and Cold, respectively, then swappingtheir private values results in an instance yielding the same views.

Given a private value collection P1, ..., Pn, RSIND is also a binary equiva-lence relation, hence induces a partition over the PA-tuples; and each set in thepartition is called an RSIND equivalence class with respect to the collection.

Definition 6 (k-RSIND) Given a released data v, if there exists a privatevalue collection P such that each RSIND equivalence class in the induced parti-tion has a cardinality of at least k, we then say v provides k-RSIND.

The following result formally states that the k-SIND property is strongerthan the k-RSIND one.

Proposition 4 A released dataset v that provides k-SIND also provides k-RSIND.

13

Proof If v provides k-SIND, we can set P1, ..., Pn to be the collection of allpossible private values, i.e., Ps = {p|∃I ∈ Iv (bs, p) ∈ I}, where s = 1, ..., n.Then, any two tuples that are SIND are RSIND with respect to P1, . . . , Pn;hence the SIND partition is the RSIND partition with respect to P1, ..., Pn.Clearly, the cardinality of each equivalence class in this RSIND partition is atleast k since each equivalence class in the SIND partition is so. �

By Proposition 3 and 4 we also derive that, under the mapping assumption,A released table v that provides k-anonymity also provides k-RSIND.

From the definition of RSIND, we can derive some interesting properties.Given a set of tuples T in the private table Tbl, each private value collection,with respect to which the PA-tuples in ΠPA(T ) are RSIND from each other,must include all of its actual private values; the tuples in ΠPA(T ) are RSINDfrom each other with respect to that set of private values if there exists a col-lection such that they are RSIND from each other. More formally, we haveProposition 5.

Proposition 5 Given a private value collection P = P1, ..., Pn, released datasetv, and a set T of tuples in the private table, if the tuples in ΠPA(T ) are RSINDfrom each other with respect to P, then we have the following two facts. First,for each bi in ΠPA(T ), if ΠP (T ) is its set of private values, ΠP (T ) ⊆ Pi. Second,for each bi in ΠPA(T ), if we replace Pi with ΠP (T ) to get a new private valuecollection P ′, then all the PA-tuples in ΠPA(T ) are still RSIND with respect toP ′.

Proof To prove the first claim, let ΠPA(T ) be RSIND from each other w.r.t.P . By the definition of private value collection, Pi ∈ P must include the privatevalue of bi. The definition of RSIND further requires that for any value bj ∈ΠPA(T ), Pj = Pi. Therefore, Pi must include all the private values of the tuplesin T .

To prove the second claim, let I ⊆ Iv be the set of instances that have theirprivate values in P , and let I ′ ⊆ Iv be the set of instances that have theirprivate values in P ′. By the first claim, since each set P ′

i in P ′ is a subset ofPi in P , we have I ′ ⊆ I. Since the tuples in ΠPA(T ) are RSIND from eachother, by the definition of RSIND, each two PA-tuples in ΠPA(T ) can swaptheir private values in each instance in I. Then, since we have I ′ ⊆ I, each twoPA-tuples in ΠPA(T ) must be able to swap their private value in each instancein I ′. Therefore, by the definition of RSIND, the tuples in ΠPA(T ) are stillRSIND w.r.t. P ′. �

Example 8 illustrate the property of RSIND stated in Proposition 5.

Example 8 Consider Figure 6. t3[PA] and t4[PA] are RSIND with respectto the private value collection. Hence, in the collection, the corresponding setsof t3[PA] and t4[PA] are identical and have both their actual private values,Obesity and AIDS. If we take their actual private values as a collection, whichmeans dropping Cold from their corresponding sets, t3[PA] and t4[PA] are stillRSIND.

14

Diagnosis

t1 Coldt2 AIDSt3 Obesityt4 AIDS

t1[PA]t2[PA]t3[PA]t4[PA]

(a)The first view (b)The SIND partition

t1[PA] {Cold, AIDS}t2[PA] {Cold, AIDS}t3[PA] {Cold, AIDS, Obesity}t4[PA] {Cold, AIDS, Obesity}

(c) The RSIND partition w.r.t. a collection

Figure 6: Two released views ΠDiagnosis σZip=′22030′ (Tbl) and ΠDiagnosis

σt1 and Diagnosis=′Obesity′(Tbl) = ∅. The second view makes t1 not have Obesity whichothers may have. The views do not provide 2-SIND, but do provide 2-RSIND.

Proposition 5 also implies the following property of RSIND. If the tuplesin ΠPA(T ) are RSIND from each other, then by Proposition 5, the tuples inΠPA(T ) are RSIND from each other w.r.t. ΠP (T ). By a repeated use of thedefinition of RSIND, for each set of tuple T ′ such that ΠPA(T ′) = ΠPA(T ) andthe private values in T ′ is a permutation of the private values (with duplicatespreserved) in T , we know there exists an instance I in Iv with T ⊆ I. Thisexplains why we say these tuples are indistinguishable in terms of the actualprivate values.

Example 9 Consider the SIND partition of Figure 6(b) as a RSIND parti-tion (note again that there are many RSIND partitions with difference pri-vate value collections and the SIND partition is one of them). We have thatt2[PA], t3[PA] and t4[PA] are RSIND from each other with respect to P2 = P3

= P4= {Obesity, AIDS}. Then for each of the three different (repeated) permu-tations of t2[PA], t3[PA], and t4[PA] with Obesity, AIDS and AIDS values(i.e., 〈(t2[PA], Obesity), (t3[PA], AIDS), (t4[PA], AIDS)〉, 〈(t2[PA], AIDS),(t3[PA], Obesity), (t4[PA], AIDS)〉, and 〈(t2[PA], AIDS), (t3[PA], AIDS), (t4[PA], Obesity)〉),there exists at least one instance in Iv that contains that permutation.

The size of each set in a private value collection matters in measuring privacydisclosure, which is not reflected in k-RSIND. Generally, the more private valuesin the collection, the better indistinguishability we achieve, since we ignore thefewer private values that may make PA-tuples distinguishable. Also, moreprivate values may mean better uncertainty, but as explained in Section 5, thisis an orthogonal issue.

3 Checking Database Views

In this section, we focus on checking released data that are in the form of aview set for indistinguishability. A view set is a pair (V, v), where V is a list

15

of selection-projection queries (q1, q2, ..., qn) on Tbl, and v is a list of relations(r1, r2, ..., rn) that are the results, with duplicates preserved, of the correspond-ing queries. We may abbreviate (V, v) to v if V is understood. In this paper,“view”, “query” and “query result” are used interchangeably when no confusionarises. Note all query results preserve duplicates, hence, are multisets and allrelational operations in this paper are on multisets.

3.1 k-PIND verification

As anticipated in Section 2.2, checking if a view set is k-PIND is intractable.This is implied by the intractability of verifying whether a private value can beassociated with a particular PA-tuple by just looking at the view set.

Lemma 1 Given a view dataset v containing only selection and projection, anda private value p in Dom(SA), it is NP-hard to decide whether there exists aninstance I ∈ Iv such that (b, p) is in I.

Proof Sketch. The Boolean auditing problem, has been shown as coNP-hard [18].We show a reduction to our problem from its complement (Complement BooleanAuditing Diagnosis) defined as follows: Given n 0-1 variables {x1, ..., xn}, afamily of subsets S = {S1, ..., Sm} of {1, ..., n}, m integers b1, ..., bm, and anyi ≤ n, among all 0-1 solutions to the system of equations

∑

i∈Sjxi = bj, j =

1, ..., m, can every xi(1 ≤ i ≤ n) be both 0 and 1?Consider a query that selects n tuples (the selection condition is on the

public attributes) and projects them on the private attribute of Boolean type.Knowing the sum of n 0-1 variables is equivalent to knowing the result to thisquery, because the sum of 0-1 variables is equal to the number of 1s in thequery result. Hence, we can reduce the Complement Boolean Auditing problemto our problem by having the same database schema, a private attribute ofthe Boolean type, queries having the same selection condition on the publicattributes and projection on the private attribute. It then follows that eachinstance yielding the view set v in the Complement Boolean Auditing Problemimmediately yields the view set v′ in our problem and viceversa. Therefore, ourproblem is NP-hard. �

We can now have a complexity result on the problem of checking k-PIND.

Theorem 1 Given a view set v, checking whether v provides k-PIND is coNP-hard.

Proof Sketch. We reduce the complement of the problem considered in Lemma 1(that is, determining if a tuple (b, p) appears in at least one instance in Iv) to theproblem of checking k-PIND. Given any table Tbl and view set v, we constructanother table Tbl′ and view set v′, such that v′ violates 2-PIND iff (b, p) appearsin at least one instance in Iv. Because it is NP-hard to determine the latter byLemma 1, it is coNP-hard to determine if v′ satisfies 2-PIND.

16

We construct Tbl′ and v′ as follows. For each tuple t ∈ Tbl, (1) insert both tand a new tuple t′ into Tbl′, such that t′ has distinct public value b′ and privatevalue p′ from all other tuples including the newly inserted ones (the domain ofPA and P can be expanded if necessary); (2) modify each query q that selectst before insert it into v′, such that the modified query selects both t and t′;(3) insert into v′ a new Boolean query q′ whose selection condition requires thePA-tuples to be t[PA] and the P value to be p′; clearly, q′ returns true. (4)Finally, for the given PA-tuple b, suppose step (1) has inserted a correspondingtuple t′ whose PA tuple is b′; we insert into v′ a new Boolean query whoseselection condition requires the PA-tuple to be b′ and the private value to be p;clearly, this query returns false.

The view set v′ has following properties. The queries inserted by step (2)and step (3) ensure that any private values a tuple t ∈ Tbl can have will alsobe possible for t ∈ Tbl′ to have, including the private value p′ of t′. Moreover,the query inserted by step (3) implies that the unique value p′ can only betaken by tuple t and t′. Hence, the set of private values that t can have w.r.t.v′ will be different from the set of private values of any other tuples exceptt′. Consequently, t can only be indistinguishable from t′ w.r.t. v′. The queryinserted by step (4) implies that t′ cannot have the private value p. Under thezero-knowledge assumption, t cannot be indistinguishable from t′ and hence v′

violates 2-PIND, iff t can have the private value p. Because it is NP-hard todecide the latter by Lemma 1, it is coNP-hard to determine k-PIND even fork = 2. �

Considering that it is difficult to get the a priori probability distribution andto calculate the a posteriori probability, we will not further study how to checkfor k-PIND but will concentrate on the other two metrics.

3.2 k-SIND verification methods

In general, checking for k-SIND turns out to be also intractable, even if thereare interesting tractable subcases. We first provide the intractability result andthen present in Section 3.2.2 the basic mechanism adopted for two subcases,discussed in Section 3.2.3 and 3.2.4, respectively. In Section 3.2.4, we alsopresent two sufficent conditions for k-SIND.

3.2.1 Complexity

Checking for k-SIND is intractable. Indeed, this also follows from the factthat it is intractable to know whether a private value can be associated with aparticular PA-tuple by just looking at the view set.

Theorem 2 Given a view set v, checking whether v provides k-SIND is coNP-hard.

Proof Sketch. This theorem is proved by a reduction similar to that in Theo-rem 1. Referring to the construction of Tbl′ in the proof of Theorem 1, each

17

tuple t can only be SIND to the new tuple t′. Add a new Boolean query to selectthe public value b′ of t′ and the private value p of t, which returns false. Then,b and b′ cannot be SIND, iff b can be associated with p, because switching theprivate values of t and t′ will result in an invalid instance w.r.t. the view set.Therefore, determining k-SIND is coNP-hard. �

3.2.2 A fundamental property used for k-SIND verification

Proposition 6 introduces an important property of SIND that will be used inthe proposed verification methods.

Proposition 6 Given a view set v and two tuples b1 and b2 in ΠPA(Tbl), b1

and b2 are SIND with respect to v, if and only if for each pair of SA values p1

and p2 associated with b1 and b2, respectively, in an instance in Iv, and eachquery q in v, we have q({(b1, p1), (b2, p2)}) = q({(b1, p2), (b2, p1)})

Proof We recall that all query results are multisets and relational operationsare multiset operations. Assume b1 and b2 are SIND. Then, for each view q in v,q({(b1, p1), (b2, p2)}∪Io) = q({(b1, p2), (b2, p1)} ∪Io) where Io is an instance suchthat {(b1, p1), (b2, p2)} ∪ Io ∈ Iv. Since q only contains selection and projection,q({(b1, p1), (b2, p2)} ∪Io) = q({(b1, p1), (b2, p2)} ∪ q(Io) and q({(b1, p2), (b2, p1)}∪ Io) = q({(b1, p2), (b2, p1)} ∪ q(Io). Thus, we have q({(b1, p1), (b2, p2)}) =q({(b1, p2), (b2, p1)}). A similar reasoning can be applied to prove the otherdirection. �

We call the equation in Proposition 6 swap equation. This proposition sug-gests that SIND for selection-projection views has the property of being “local”.Indeed, given two PA-tuples, in order to check SIND we do not need to see otherPA-tuples.

More specifically, this proposition says that given v and two SIND PA-tuples b1 and b2, for each query q in v, if the tuples (b1, p1) and (b2, p2) are inan instance that yields v, and we swap their private values to get the two newtuples, i.e., (b1, p2) and (b2, p1), then we know that q yields the same result on{(b1, p2), (b2, p1)} as on {(b1, p2), (b2, p1)}. This is a necessary and sufficientcondition.

As a simple example, given two PA-tuples b1 and b2, if in all the instancesin Iv, we know they associate either with p1 and p2, respectively, or p2 and p3,respectively. Then b1 and b2 are SIND if and only if

q

(

(b1, p1)(b2, p2)

)

= q

(

(b1, p2)(b2, p1)

)

& q

(

(b1, p2)(b2, p3)

)

= q

(

(b1, p3)(b2, p2)

)

To satisfy the swap equation

q

(

(b1, p1)(b2, p2)

)

= q

(

(b1, p2)(b2, p1)

)

there are only two possibilities: one is

q((b1, p1)) = q((b1, p2)) & q((b2, p2)) = q((b2, p1))

18

and the other is

q((b1, p1)) = q((b2, p1)) & q((b1, p2)) = q((b2, p2))

If a view has a projection on SA and p1 is distinct from p2, we can easily provethat we only need to check the latter condition. Moreover, if the projectionof q contains SA, and b1 and b2 have more than one possible private values(i.e., there is an instance in Iv) such that b1 and b2 have two different values),then it is a necessary and sufficient condition for b1 and b2 being SIND thatq((b1, p)) = q((b2, p)) holds for each possible value p.

Example 10 Consider the view q in Figure 3 with the projection on SA. Clearly,q((t9[PA], H)) = q((t10[PA], H)) and q((t9[PA], C)) = q((t10[PA], C)), whereH = Hypertension and C = ChestPain. Since H and C are the only possiblevalues by looking at the view, we know t9[PA] and t10[PA] are SIND.

The specific k-SIND verification methods that will be presented in the paperrely on the result of Proposition 6.

3.2.3 k-SIND verification of views based on selection on PA at-

tributes only

The case in which each query in the view set has a selection condition only onthe attributes in PA is quite common, especially in statistical databases, andhence it is extensively studied with uncertainty measures[1, 18, 17]. In this case,checking for k-SIND can be done in polynomial time in the size of the privatetable and the number of views.

We assume the projection of each view contains the sensitive attribute SA;otherwise, no sensitive information is involved, since we are considering the casein which the selection condition does not contain SA. From Proposition 6, wecan derive the following result.

Proposition 7 Given a view set v with selection conditions only on the at-tributes in PA, two PA-tuples b1 and b2 are SIND if for each query q =ΠXσC(Tbl) in v, we have ΠX−{SA}σC(b1) = ΠX−{SA}σC(b2). The inverse(“only if”) holds if b1 and b2 have at least two distinct possible private values.

Proof To prove the if part, we show the conditions imply that b1 and b2 areSIND w.r.t. each view q (then by Proposition 8, b1 and b2 are SIND w.r.t. theview set). If ΠX−{SA}σC(b1) = ΠX−{SA}σC(b2) holds, there are the followingtwo cases. First, if q selects neither b1 nor b2, then b1 and b2 can then take anypossible values, and they will still fail the selection condition C of q after theyswitch their private values. Thus, q({(b1, p1), (b2, p2)}) = q({(b1, p2), (b2, p1)})holds for any p1 and p2. Second, if q selects both b1 and b2 and b1 and b2 havethe same value in the projection on attributes in PA of q, then both of them willstill satisfy the selection condition of q after they switch their private values,because the selection condition is only on the attributes in PA. Moreover,q((b1, p)) = q((b2, p)) holds for any p, because b1 and b2 have the same value in

19

the projection. Thus, we also have q({(b1, p1), (b2, p2)}) = q({(b1, p2), (b2, p1)}).Therefore, b1 and b2 are SIND.

To prove the only if part, we show that if the condition is not satisfied, thenb1 and b2 are not SIND. The negation of the conditions is that there existsa view q such that one of b1 and b2 makes the selection condition false andthe other makes it true, or they both make the condition true but they havedifferent values in the projection on the attributes in PA of q. Clearly, in bothcases q((b1, p)) = q((b2, p)) cannot hold. Because q contains projection on SA,by the conclusion from Proposition 6, b1 and b2 cannot be SIND if q((b1, p))= q((b2, p)) is not true (assuming b1 and b2 have at least two distinct possibleprivate values). �

(22030, White) t1[PA], t2[PA], t3[PA](22030, Black) t4[PA]

Others not selected t5[PA], t6[PA], ..., t12[PA]

(a) By the first view

(White, Male) t1[PA], t2[PA], t3[PA],t8[PA], t11[PA], t12[PA]

(White, Female) t6[PA]Others not selected t4[PA], t5[PA], t7[PA],

t9[PA], t10[PA]

(b) By the second view

t1[PA], t2[PA], t3[PA]

t4[PA]

t5[PA], t7[PA], t9[PA], t10[PA]

t6[PA]

t8[PA], t11[PA], t12[PA]

(c) The final partition

Figure 7: The partition of PA-tuples by views

It is quite common in anonymization techniques to avoid to have the sameprivate value for the tuples in the same equivalence class (with respect to indis-tinguishability); this is to avoid so called homogeneity attacks due to the lack ofuncertainty [22]. Then, if the private value of two SIND tuples is different, themain condition of Proposition 7 becomes both necessary and sufficient, sincethe only-if part holds. When this is not the case it is only a sufficient condition,but it can still be used for conservative checking.

Based on Proposition 7, we present an efficient k-SIND checking methodthrough partitioning. The basic idea is that for each view, we partition tuplessuch that each set of the partition is SIND with respect to this view, and wethen intersect these partitions. Example 11 illustrates this procedure.

20

Example 11 Consider the two viewsΠRace,Diagnosis σZip=′22030′ (Tbl) andΠGender,Diagnosis σRace=′White′ (Tbl).

We partition the PA-tuples as in Figure 7(a) by the first view and as in Fig-ure 7(b) by the second view; the final result in Figure 7(c) is the intersection ofthe two partitions shown in (a) and (b). For each view, the selected tuples thathave the same values on the projection are grouped in the same set of the parti-tion, (Zip, Race) for the first and (Race, Gender) for the second; the tuples thatare not selected are grouped into another set in the partition. If two PA-tuplesare in the same block of the final partition, they are SIND. In this case, we onlyhave 1-SIND.

An optimized partition and intersection procedure is shown in Figure 8,which keeps partitioning PA-tuples using a view at a time.

Procedure Checking a view set for k-SIND

with the selection only on PA attributes

Input: v, Tbl, and integer k

Output: True (provides k-SIND) or FalseLet S be {{b|b ∈ ΠPA(Tbl)}}For each view q = ΠXσC(Tbl) in v

Let S′={∅}For each set s in S

Let R={∅}For each PA-tuple b in s

If a set r is found in R such thatΠX−{SA}σC(b) = ΠX−{SA}σC(b′) forany b′ ∈ r

Let r = r ∪ {b}Else Let R = R ∪ {{b}}

Let S′=S′ ∪ R

Let S=S′

If there exists s in S such that |s| < k

Return FalseElse Return True

Figure 8: A partitioning procedure for checking a view set for k-SIND with theselection only on PA attributes

Theorem 3 The procedure in Figure 8 checks view sets for k-SIND in O(nS)time, where S is the size of Tbl and n is the number of views in the view set.

Proof Regarding complexity, the procedure searches for each view q the set ofPA-tuples yielding the same result for q. Such searching can be done using ahash data structure, hence is constant time. For each partition, it is necessary

21

to scan all the PA-tuples in Tbl once. Thus the computing time is O(nS), whereS is the size of Tbl and n is the number of views in the view set. �

3.2.4 k-SIND verification of single view released datasets

When the view set v only contains a single view with projection and any kindof selection condition, checking for k-SIND can be done in polynomial time inthe size of the private table. That is, the data complexity of checking a singleview for k-SIND is polynomial time.

To check k-SIND, we need first to be able to check whether two given PA-tuples b1 and b2 are SIND.

Based on Proposition 6, then, we need to check whether, for each (b1, p1) and(b2, p2) in an instance yielding v, we have q({(b1, p1), (b2, p2)}) = q({(b1, p2),(b2, p1)}). Clearly, checking whether this equation holds is trivial; the difficultyis to find the set Rb1b2 of all such pairs of p1 and p2, which is generally intractableby Lemma 1. However, if the view set contains a single view, Rb1b2 can becomputed in polynomial time.

Now we first show how to find Rb1b2 . The basic idea is to use bipartite graphmatching to describe the constraints by v on the possible instances. Indeed,each tuple in Tbl can yield at most one tuple in v and each tuple in v mustbe yielded by one tuple in Tbl; if the tuples in Tbl and in v map to the twosets of nodes in a bipartite graph, respectively, then each instance yielding vcan map to a one-to-one matching in the bipartite graph (i.e., a “matching”is a mapping from one set of nodes to the other set of nodes in a bipartitegraph), hence, each pair (b1, p1) and (b2, p2) in a possible instance must exist ina matching. Example 12 illustrates this procedure.

Example 12 Consider the Tbl in Figure 9. Its schema is 〈Zip, Charge, Salary〉,where Salary is the sensitive attribute. In Tbl there are three tuples. We thenconsider a released single view obtained by ΠSalary σCharge>=Salary(Tbl) pro-viding the result {95K, 75K}.

Zip Charge Salary

t1 22030 100K 95K

t2 22030 90K 75K

t3 22030 70K 85K

Figure 9: The private table in which Salary is the private attribute

We construct a bipartite graph G as follows. There are four sets of nodes U ,W , X and Y in the bipartite graph as in Figure 10. One collection of nodes inG consists of U and X; the other consists of W and Y .

Each PA-tuple b in Tbl maps to a node in U , denoted by N(b). Thereare three PA-tuples in Tbl, t1[PA], t2[PA] and t3[PA], mapping to the nodesin U , u1, u2 and u3, as shown in the figure. These nodes are marked by the

22

corresponding PA-tuples. In the figure, Zip values are omitted since the Zipvalues are the same.

Each tuple in v maps to a node in Y . The two tuples in v, (Salary : 95K)and (Salary : 75K), map to the nodes y1 and y2, respectively. Whether there isan edge between a node in U and in Y is determined by whether it is possible thatthe corresponding PA-tuple in Tbl has a private value to yield the correspondingtuple in v. More specifically, because (Charge : 100k) can have a Salary value95K yielding the tuple (Salary : 95K) in v, there is an edge (u1, y1) connectingu1 and y1. On the contrary, (Charge : 90K) cannot have any Salary value suchthat it yields (Salary : 95K) in v; thus, no edge exists between u2 and y1. Forthe same reason, there are edges (u1, y2) and (u2, y2), and there are no edgesconnecting u3 and the nodes in Y . In addition, each edge is marked by the setof possible values that make the PA-tuple yield the tuple in v.

Each node ui in U has one and only one node in W as adjacent to ui.This represents the case where a tuple in a possible instance does not satisfy theselection condition C of v, hence does not yield any tuple in v. The PA-tuple(Charge : 100k) can have a Salary value greater than 100K so that the resultingtuple does not satisfy C, hence does not yield any tuple in v. So there is the nodew1 in W adjacent to u1, and the edge is marked by the set of Salary values thatmake the resulting tuple not satisfy the selection condition C of v. Similarly, w2

and w3 are only adjacent to u2 and u3, respectively, and both edges are markedby the corresponding set of Salary values.

Finally, since a node in U may not match with the corresponding node in Win a matching, we construct a set X of dummy nodes to collect the un-matchingnodes in W . The cardinality of X is |W | + |Y | − |U | and each node in Xis adjacent to each node in V . Here there are two dummy nodes, x1 and x2,adjacent to each of three nodes in W .

Since each tuple in Tbl either yields a tuple in the view or makes C false,and each tuple in the view must be yielded by one and only one tuple in Tbl,each possible instance in Iv maps to a matching in the bipartite graph G.

Actually, each matching corresponds to a set of possible instances becauseeach edge is marked by a set of Salary values. For the matching in the figure,t1[PA] and t2[PA] both have only one Salary value, respectively. But t3[PA]can choose any Salary value greater than 70K. The current table is one of theinstances mapping to this matching.

Therefore, given two PA-tuples b1 and b2, and two edges e1 and e2 incidentto N(b1) and N(b2), if there exists a matching having e1 and e2. Then SA(e1)×SA(e2) are a subset of the possible private value pairs, where SA(e) denotes theset of Salary values associated with the edge e. For instance, from the matchingin the figure, SA(u2, w2)×SA(u3, w3), which is {75K}×{> 70K}, is the subsetof Rt2[PA]t3[PA] for t2[PA] and t3[PA]. The set of private value pairs is theunion of all such subsets.

Checking whether two edges are in a matching can be done by removingthese edges and their incident nodes to see whether there exists a matching inthe induced graph. After we get all the sets of private value pairs, we can checkwhether two PA-tuples are SIND.

23

100K

90K

>100K

>90K

arg SalaryeChSalary >=Π σ

70K >70K

95K

75K

U

X

W

Y

Charge Salary

u1

u2

u3

w1

w2

w3

x1

x2

y1

y2

Figure 10: The bipartite graph for the specific view. The edges in solid line form amatching and correspond to a set of possible instances.

The procedure illustrated in Example 12 is reported in Figure 11.

Procedure Checking if two tuples are SIND wrt a single view v

Input: v, b1 and b2

Output: True (b1 and b2 are SIND) or FalseConstruct a bipartite graph G by v.For each pair of edges e1 and e2 incidentto N(b1) and N(b2), respectively

If there exists a matching containing e1 and e2

For each p1 ∈ P (e1) and each p2 ∈ P (e2)If q({(b1, p1), (b2, p2)}) 6= q({(b1, p2), (b2, p1)})

Return False.Return True;

Figure 11: A Procedure for checking if two PA-tuples are SIND with respect to asingle view

One step in the procedure of Figure 11 needs to check whether for each p1 ∈P (e1) and each p2 ∈ P (e2), we have q({(b1, p1), (b2, p2)}) = q({(b1, p2), (b2, p1)}).This step can be done in constant time as shown by Example 13.

Example 13 Referring to Example 12, we show the last step of the proce-dure considering t2[PA] and t3[PA]. One subset of private value pairs for

24

t2[PA] and t3[PA] is from (u2, y2) and (u3, w3), which are in a matching.This subset is {75K} × {> 70K}, which means P (e1) is {75K} and P (e2) is{> 70K}. Since we must have q({(t2[PA], p1), (t3[PA], p2)}) = q({(t3[PA], p2),(t3[PA], p1)}) for each pair p1 and p2 in {75K} × {> 70K}, we must haveq(t2[PA], 75K) = q(t3[PA], 75K) and ¬C(t3[PA], p) → ¬C(t2[PA], p), recall-ing that {> 70K} is obtained from ({p|¬C(t3[PA], p)} = {> 70K} and that Cis the selection condition. The former one is easy. The latter one is equivalentto that C(t3[PA], p) ∧ ¬C(t2[PA], p) is true for all Salary values p in the do-main. Both expressions can be checked in constant time since the size of C isconsidered constant for the purpose of this work.

To check a single view v for k-SIND, we can use the above procedure tocheck whether each pair of PA-tuples are SIND, and then partition PA-tuplesto check whether v provides k-SIND by a procedure similar as in Figure 8.

Now we analyze the computational complexity of this method.

Theorem 4 The procedure in Figure 11 checks a single view for k-SIND inO(S13/2) time, where S is the number of tuples in Tbl.

Proof Given two PA-tuples b1 and b2, in the worst case, checking SINDbetween them is to check in the bipartite graph associated to v whether each pairof edges adjacent to the corresponding nodes N(b1) and N(b2) are contained ina matching. The number of edges incident to a node is bounded by the numberof the nodes in the graph, which is bounded by 2S where S is the size of Tbl.Thus, there are O(S2) pairs. And it is known that finding a matching for eachpair is O(M5/2) time [14], where M is the number of the nodes in the graphthat is bounded by 2S. So the computational time of the procedure in Figure 8is O(S9/2). Further, in the worst case, checking for k-SIND needs to check allpairs of PA-tuples, the number of which is O(S2). Therefore, the total checkingtime is O(S13/2). �

3.2.5 Conservative k-SIND verification for the general case

In the case we are releasing multiple views and some of these views are obtainedby selection on the sensitive attribute, the procedures illustrated above do notapply, and we know that the general problem of checking a view set for k-SIND isintractable. However, it is still possible to perform a conservative-style checking,i.e., applying a procedure that will always catch k-SIND violation if it occurs,but may not recognize when k-SIND holds in some cases. Since we know thatchecking each single view requires polynomial time, we can do a conservativechecking based on Proposition 8.

Proposition 8 Given a view set v, if given PA-tuples b1 and b2 are SIND withrespect to each view in v, then b1 and b2 are SIND with respect to v.

Proof Given a view set v, let Iv be the set of instances yielding v. For eachview vi ∈ v, denote the set of instances yielding vi as Ivi . Clearly, we have Iv

25

= ∩iIvi , because each instance in Iv must yield all the views in v. Suppose two

PA-tuples b1 and b2 are SIND w.r.t. each view in v. For any instance I in Iv,let (b1, p1) and (b2, p2) be the tuples having b1 and b2 in I, respectively. Becauseb1 and b2 are SIND w.r.t. each view vi and I is in each Ivi , by the definition ofSIND, the instance I ′ = (I − {(b1, p1), (b2, p2)}) ∪ {(b1, p1), (b2, p2)} must alsobe in Ivi . Thus, I ′ is in ∩iI

vi = Iv. By the definition of SIND, b1 and b2 areSIND w.r.t. v. �

We can get a SIND partition over the PA-tuples with respect to each singleview in v, and then intersect these partitions to get the final partition. All thePA-tuples in the same set of the final partition must be SIND by the aboveproposition. Then if the cardinality of each set of the final partition is at leastk, the view set provides k-SIND. This intersection of the partitions for eachsingle view is exactly the same as the example in Figure 7. Thus, we can usethe same procedure as in Figure 8 except that checking each view for SINDbetween PA-tuples applies the procedure in Figure 11.

Example 14 Consider Example 1 presented in the introduction of this paper.We can prove that the released view set reported in Figure 2 provides 2-SINDby the conservative method. It is easily seen that the first view provides 4-SINDwith a single equivalence class composed by t1[PA], . . . , t4[PA]. In the secondview, t1[PA] is SIND with t2[PA], but not with t3[PA] nor with t4[PA]. Indeed,in the first case (t3[PA]), there is an instance, Tbl itself, that if swapping theprivate values of t1 and t3 would not be in Iv anymore. In the second case,(t4[PA]) the same applies considering the instance in Iv obtained by swappingthe private values of t3 and t4. It is easily seen that t3[PA] and t4[PA] areSIND with respect to the second view since the selection condition (age ≤ 50)excludes them from the view. Hence, by intersecting the partitions we concludethat the dataset provides 2-SIND with the first SIND partition including t1[PA]and t2[PA], and the second one containing t3[PA] and t4[PA]. the practicaldifference.

Clearly, checking in this way can be done in polynomial time. And since thecondition of SIND with respect to each view in v is a necessary but not sufficientcondition of SIND with respect to v, this is a conservative checking method.Example 15 illustrates a counter-example for being a necessary condition.

Example 15 Consider Example 7 in Section 2.5. The PA-tuples t3[PA] andt4[PA] are not SIND with respect to each view, since, only from the first view,t3[PA] can have any value of {Cold, AIDS, Obesity, AIDS} and t4[PA] canhave any value of {AIDS, Obesity, AIDS}, and from the second view t3[PA]cannot have Cold and t4[PA] can have any possible Diagnosis value. If wecombine the two views, however, t3[PA] and t4[PA] are SIND.

In some cases, checking each single view is still costly, especially when theprivate table is large. In these cases, we can use a conservative checking methodfor each single view q. The basic idea is that if two PA-tuples have the same

26

characteristics in the selection condition and have the same value on the PAattributes in the projection of q, then they are SIND. In particular, we have thefollowing result.

Proposition 9 Given a view set containing a single view ΠXσC(Tbl) and twoPA-tuples b1 and b2, if ΠX−{SA}(b1) = ΠX−{SA}(b2) and C(b1, p) and C(b2, p)have the same set of SA values that make them true, then b1 and b2 are SIND.

Proof We show that if ΠX−{SA}(b1) = ΠX−{SA}(b2) is true and the sameset of SA values satisfy C(b1, p) and C(b2, p), then for any instance I in Iv,the tuples (b1, p1) ∈ I and (b2, p2) ∈ I satisfy ΠXσC({(b1, p1), (b2, p2)} =ΠXσC({(b1, p2), (b2, p1)}. Indeed, we have both ΠXσC((b1, p1)) = ΠXσC

((b2, p1)) and ΠXσC((b1, p1)) = ΠXσC ((b2, p1)). Without loss of generality,we prove ΠXσC((b1, p1)) = ΠXσC((b2, p1)). Because the same set of SA val-ues satisfy C(b1, p) and C(b2, p), the two conditions must either both evaluateto true or both evaluate to false. If they evaluate to false, ΠXσC((b1, p1))= ΠXσC((b2, p1)) trivially holds. If they evaluate to true, ΠXσC((b1, p1)) =ΠXσC((b2, p1)) also holds, because ΠX−{SA}(b1) = ΠX−{SA}(b2) is true. �

This proposition says that if two PA-tuples b1 and b2 have the same valueson the PA attributes in the projection of q, and after substituting PA with b1

and b2, respectively, in the selection condition, the two substituted conditionshave the same set of SA values making the conditions true, then they are SIND.We can see that this method does not look for the possible private values. Thus,a similar procedure as the checking method for the case where v selects onlyPA attributes can be applied. That is, generate a partition for each view bythe corresponding attribute values of PA tuples and intersect these partitions.

Example 16 Consider the view ΠZipσCharge>Salary(Tbl) on the table Tbl inFigure 9. Each distinct Charge value c has the different set of Salary valuesmaking the selection condition true when you substitute Charge with c in thecondition. Thus, if two PA-tuples have the same (Zip, Charge) value, then wetake them as SIND; otherwise, we do not.

We believe this conservative checking is practical, since we do not check whatpossible private values each individual has. In fact, it has the similar idea ask-anonymization methods [27, 24, 5, 20]. Indeed, this checking looks only atthe public values, and similarly k-anonymization recodes only the public valuesof tuples to achieve k-anonymity.

3.3 k-RSIND verification methods

In this section, we consider the problem of verifying if a released set of viewsprovides restricted SIND as defined in Section 2.5. Technically, we have to checkwhether there exists an RSIND partition such that the cardinality of each setin the partition is at least k. By Proposition 5, this is equivalent to looking forthe PA-tuples in each set that are RSIND with each other with respect to theircurrent private values.

27

To do this, given a set T of tuples in Tbl, we need to check whether for eachpair of PA-tuples b1 and b2 in ΠPA(T ), each pair of SA values p1 and p2 inΠSA(T ), and each instance I in Iv that contains (b1, p1) and (b2, p2), there existsan instance I ′ in Iv such that I ′ = (I −{(b1, p1), (b2, p2)}) ∪ {(b1, p2), (b2, p1)}.By a reasoning similar to the one in Proposition 6, this swap is equivalent tohave for each query q in v, q({(b1, p1), (b2, p2)}) = q({(b1, p2), (b2, p1)}).

For each pair of PA-tuples in ΠPA(T ) and each pair of private values inΠSA(T ), this swap equation needs to be checked. Then, for n tuples, it needsto be checked O(n4) times (there are O(n2) pairs of PA-tuples and O(n2) pairsof private values), where n is the cardinality of T . Obviously, this is costly.

However, in most cases, if each two PA-tuples in ΠPA(T ) can swap theircurrent values, then the two PA-tuples can swap every pair of two private valuesin ΠSA(T ). For instance, given T = {(b1, p1), (b2, p2), (b3, p3)}, if b1 and b2 canswap for p1 and p2, b2 and b3 for p2 and p3, and b3 and b1 for p3 and p1, thenany two PA-tuples, for example, b1 and b2, can swap for all pairs of the currentprivate values, p1 and p2, p2 and p3, and p3 and p1.

For convenience, we introduce another concept.

Definition 7 Given two PA-tuples b1 and b2, and (b1, p1) and (b2, p2) in theprivate table, we say b1 and b2 are CSIND (currently SIND) if for each queryq in v, we have either (1) q contains the projection on SA, and q((b1, p1)) =q((b2, p1)) and q((b1, p2)) = q((b2, p2)), or (2) q does not contain the projectionon SA, but q((b1, p1)) = q((b1, p2)) and q((b2, p1))) = q((b2, p2)).

Intuitively, if b1 and b2 are CSIND, we can swap their private values in thecurrent table without affecting the view set.

Clearly, if each pair of tuples in ΠPA(T ) is CSIND, then each pair of PA-tuples in ΠPA(T ) satisfies the swap equation for all the private values in ΠSA(T ),hence, the tuples in ΠPA(T ) are RSIND from each other. Indeed, if each q con-tains the projection on SA, this is a necessary and sufficient condition; otherwise,it is a sufficient condition.

Proposition 10 Given a view set v and a set of tuples T in the private table,there exists a possible private value collection P such that ΠPA(T ) is a subsetof an RSIND set with respect to P if for each pair of PA-tuples b1 and b2 inΠPA(T ), b1 and b2 are CSIND. The inverse (“only if”) holds if each query inv contains the projection on SA.

Proof There are two cases. First, if the view q contains a projection on SA,then by the definition of CSIND, we have q((b1, p1)) = q((b2, p1)) and q((b1, p2))= q((b2, p2)) hold for any two tuples (b1, p1) and (b2, p2) in T . Thus, for eachvalue pi in ΠSA(T ), we have q((b1, pi)) = q((b2, pi)) = . . . = q((bn, pi)), where b1,..., bn are all the PA values in ΠPA(T ). So for each pair of PA-tuples b1 and b2

in ΠPA(T ) and each pair of SA values pi and pj, we have q((b1, pi)) = q((b2, pi))and q((b1, pj)) = q((b2, pj)), hence q({(b1, pi), (b2, pj)}) = q({(b1, pj), (b2, pi)}).Then any b1 and b2 are RSIND w.r.t. ΠSA(T ). This concludes the if part of theproof for this case. For the only if part, if any two PA-tuples in ΠPA(T ) b1 and

28

b2 are RSIND w.r.t. ΠSA(T ), by the definition of RSIND and similar reasonsas in the proof of Proposition 6, clearly, for any two tuples (b1, p1) and (b2, p2)in T , we have q((b1, p1)) = q((b2, p1)) and q((b1, p2)) = q((b2, p2)).

Second, if a view q does not contain a projection on SA, then by the definitionof CSIND, we have q((b1, p1)) = q((b1, p2)) and q((b2, p1)) = q((b2, p2)) for anytwo tuples (b1, p1) and (b2, p2) in T . Thus, for each value bi in ΠPA(T ), we haveq((bi, p1)) = q((bi, p2)) = . . . = q((bi, pn)) = f(bi). Consequently, or each pair ofPA-tuples b1 and b2 in ΠPA(T ) and each pair of SA values pi and pj, we haveq({(b1, pi), (b2, pj)}) = q({(b1, pj), (b2, pi)}). This concludes the if part for theproof of this case. However, the only if part does not hold in this case, becauseq({(b1, pi), (b2, pj)} = q({(b1, pj), (b2, pi)}) can also be satisfied if q((b1, pi)) =q((b2, pi)) and q((b1, pj)) = q((b2, pj)) both hold. �

Therefore, we apply the following checking method. If we can find a maximalpartition over the PA-tuples ΠPA(Tbl) such that each pair of PA-tuples in eachset in the partition are CSIND, then this partition is an RSIND partition. Here,“maximal” means that the union of any two sets in the partition cannot resultin a set in which each pair of PA-tuples are still CSIND. In this way, we canfind an RSIND partition by checking whether each pair of tuples in the currenttable is able to swap their private values. This provides a conservative checkingalgorithm for k-RSIND as follows.

We construct a graph G with one node for each tuple. If two tuples areCSIND, which can be easily checked based on the current private table, an edgeis drawn between the corresponding nodes. Then, a complete subgraph of G is asubset of an RSIND set. Therefore, the problem of finding an RSIND partitionbecomes the problem of finding a maximal clique partition. If each query of vcontains the projection on SA, the above checking algorithm is a precise (notconservative) algorithm.

Considering the special case where each query of v contains the projectionon SA, we obtain a negative result about the complexity of the general problemof checking k-RSIND.

Theorem 5 Given a released view set v, it is NP-hard to decide whether vprovides k-RSIND.

Proof We showed that finding an RSIND partition is equivalent to find amaximal clique partition. Since we know that the problem of finding a cliquepartition with each block’s size of at least k is NP-hard [15], we can concludethat deciding whether a view set v provides k-RSIND is also NP-hard. �

Nevertheless, we can use the heuristic algorithms in [15] to find a cliquepartition with each block of size at least k. This will result in a conservativealgorithm even for the special case where each query in v contains the projectionon SA.

For example, consider the views in Figure 6. We construct a graph as inFigure 12. Each edge represents that the PA-tuples corresponding to the twoadjacent nodes are CSIND. An RSIND partition maps to a maximal cliquepartition in the graph.

29

t3[B]t1[B]

t2[B] t4[B]

Figure 12: An RSIND partition for the views in Figure 6 maps to a maximal cliquepartition.

4 Indistinguishability verification in the pres-

ence of updates

In this section we illustrate preliminary results on the problem of verifying indis-tinguishability in the case updates are performed on the private table betweenthe times when two queries are issued.

Example 17 Consider the table in Figure 1 and the view ΠDiagnosis σZip=′22030′ (Tbl)which is considered to provide sufficient indistinguishability and hence has al-ready been released. After the release of this view, a new tuple t with Zip value22030 and Diagnosis value ChestPain is inserted. After this insertion, thesame view on the modified table is asked to be released again. Each of the tworeleases are safe independently, but it is not safe if attackers combine the tworeleases with the modification. More specifically, attackers can deduce the pri-vate value of the new tuple t from the difference of the result of the second queryand that of the first query if they know t is the new tuple. Recall that we as-sume ΠPA(Tbl) is public information and hence attackers know the insertion ofΠPA(t).

We use (qτ , rτ ) to denote a query qτ on the table Tbl at the time τ , de-noted Tblτ , and the corresponding result rτ = qτ (Tblτ). We want to determinewhether a view set 〈(q0, r0), ..., (qc, rc)〉 provides k-SIND (k-RSIND) for Tbl ateach time τ , where τ = 0, 1, ..., c. More specifically, let v′τ be the released datathat includes the released views 〈(q0, r0), ..., (qc, rc)〉 and information about themodifications (insertion, deletion and updates) at each time (without of courseexplicit knowledge about the private values of the inserted, deleted and updatedtuples). Then, we require v′τ provides k-SIND (or k-RSIND) for each τ = 0, ..., c.

Let IτPA consist of all the PA-tuples appearing in Tblτ , i.e., Iτ

PA = ΠPA(Tblτ).

Let v′τ be the v′ for time τ . Then Iv′

τ = {Iτ is a relation on Attr(Tbl) such thatΠPA(Iτ ) = Iτ

PA and there exist relations I0, ..., Ic on Attr(Tbl) such that foreach 0 ≤ i < c, Ii and Ii+1 are consistent with the modifications known to thepublic, and ri = qi(Ii) for each i = 0, ..., c}. If v′τ provides k-SIND (k-RSIND,respectively), then we say 〈(q0, r0), ..., (qc, rc)〉 provides k-SIND (k-RSIND, re-spectively) at time τ . And if for every τ , v′τ provides k-SIND (k-RSIND, re-spectively), then we say 〈(q0, r0), ..., (qc, rc)〉 provides k-SIND (k-RSIND, re-spectively).

If attackers do not know anything about modifications and assume any se-quence of modifications is equally possible between two queries, then the com-

30

bination of two queries is useless for the inference of private values. This isbecause the status of the two tables are completely uncorrelated, and this situ-ation is equivalent to the one in which two queries are executed on two differenttable instances.

Thus, we assume for two consecutive views (qτ , rτ ) and (qτ+1, rτ+1), it ispublic information that ΠPA(Tblτ+1) is partitioned into 3 sets PAo, PAu andPAi, where the private values for PAo-tuples do not change between Tblτ andTblτ+1, the private values for PAu-tuples may be updated, and PAi-tuples arethe newly inserted ones in Tblτ+1. The common PA-tuples between Tblτ andTblτ+1 are PAo ∪ PAu. And we denote PAd the set of PA-tuples that aredeleted from Tblτ , i.e., PAd = ΠPA(Tblτ)−ΠPA(Tblτ+1), which is also knownbecause ΠPA(Tblτ ) and ΠPA(Tblτ+1) are known.

Under the above assumptions, we present the general checking method fork-SIND and k-RSIND in the case of updates to the private table. We willtransform this case into the case of static tables, and then apply the checkingmethods in the previous section for the case of static tables. Given a view setv and the table instances Tbl0...Tblc, we construct a table instance Tbls and aview set vs as follows.

• Tbls has the same the schema as the private table except that there aretwo additional public attributes Start and End.

• At time 0, copy all the tuples in Tbl0 into Tbls. Mark the Start value ofeach tuple by 0 and the End value by ∞.

• At time τ + 1, where 0 ≤ τ ≤ c− 1, copy all the inserted and the updatedtuples from τ to τ + 1 into Tbls, and keep the deleted and the old tuplesthat are updated in Tbls. For each inserted tuple whose PA-tuple is inPAi and each new updated tuple whose PA-tuple is in PAu, mark theStart value by τ + 1 and the End value by ∞; for each deleted tuplewhose PA-tuple is in PAd and each old updated tuple whose PA-tuple isin PAo, mark the End value by τ .

• For each view (qτ , rτ ) in v, where qτ = ΠXτσCτ

and rτ = qτ (Tblτ ), trans-form it into a view (q′τ , r′τ ) in vs, where q′τ = ΠXτ

σCτ&Start≤τ&End≥τ andr′τ = q′τ (Tbls).

We have the following conclusion about this construction.

Theorem 6 Two PA-tuples b1 and b2 are SIND (CSIND) at time τ with respectto 〈(q0, r0), ..., (qc, rc)〉, if and only if there exist two tuples with the Start valuesless than or equal to τ and the End values greater than or equal to τ , and theirprojection on PA equal to b1 and b2, such that they are SIND (CSIND) withrespect to vs on the table Tbls.

Proof Sketch. From the construction procedure, we can see that for each view(qi, ri) on Tbli in the view set, if we remove Start and End attributes fromthe tuples selected by the selection condition of q′i regarding Start and End,

31

the resulting tuples are the same as the tuples in Tbli. And each old tuplethat keeps the same among a serial of consecutive Tbl instances and hence isqueried by the corresponding consecutive views is also queried by the corre-sponding constructed consecutive views. Thus, the modification constraints arekept in v′ and Tbls. Therefore, each serial of instances 〈I0...Ic〉 yielding v one-to-one maps to an instances Is yielding v′ such that for each time τ , Iτ =ΠAσStart≤ττ & End≥τ (Is). Therefore, by the definition of SIND, this theoremholds. �

By Theorem 6, checking a view set for SIND or RSIND (we use CSINDto check for RSIND) on a dynamic table can be transformed into checking aview set for indistinguishability on a static table. Example 18 illustrates theapplication of this result.

Example 18 Consider a view ΠDiagnosisσZip=′22030′(Tbl) being released twice,where Tbl is the private table in Figure 1. Between the two releases, the privatetable is modified as follows. The tuple t1 is deleted, the tuple t12’s Diagnosisvalue is updated from ChestPain to Cold, and a new tuple t13 is inserted.We construct a table as in Figure 13. The tuple t1 exists in the first release

... Diagnosis Start End

t1 ... Cold 0 0

t2 ... AIDS 0 ∞. . . . .. . . . .. . . . .

t12 ... Chest Pain 0 0

t′12 ... Cold 1 ∞t13 ... AIDS 1 ∞

Figure 13: The constructed static table.

and not in the second release, hence both its Start and End values are 0;the tuple t12 is modified, hence there are the two tuples t12 and t′12 with dif-ferent private values and different Start and End values; the tuple t13 is in-serted, hence its Start value is 1 and its End value is ∞. Then, the view istransformed into two queries ΠDiagnosisσZip=′22030′&Start≤0&End≥0(Tbls) andΠDiagnosisσZip=′22030′&Start≤1&End≥1(Tbls).

5 Related Work

The work on k-anonymity [27, 28, 24, 5, 20] is probably the closest to theinvestigation proposed in this paper, since its main goal is to guarantee theindistinguishability of single individuals in groups of at least k individuals. Wehave devoted Section 2.4 to analyse the relationship between k-anonymity andthe notion of indistinguishability we propose. Apart from minor technical differ-ences, our proposed notion can be considered a generalization of k-anonymity,

32

since it can be used to evaluate indistinguishability of individuals not only withrespect to an anonymized table, but with respect to an arbitrary set of viewsobtained by projection-selection queries, possibly involving both public and sen-sitive attributes.

Recently, there has been other work aiming to achieve good uncertainty whilegaining k-anonymity by imposing additional requirements on anonymization. Inparticular, Li et al. [21] proposed t-closeness for measuring privacy disclosure asan improvement over l-diversity, originally proposed by Machanavajjhala et al.[22]; the main idea is to measure the difference between the overall distribution ofsensitive attribute values in the released data and the one in specific equivalenceclasses (QuasiID groups). We can see some relationship between this work andours since we also consider the values of sensitive attributes in order to evaluatethe indistinguishability of individuals (as opposed to k-anonymity which onlyfocuses on Quasi-ID attributes). In particular, the definition of probabilisticindistinguishability (PIND) considers the a-posteriori probability of associatingan individual with a given sensitive value, based on the release of a set of views.However, both l-diversity and t-closeness have been defined and evaluated onsingle anonymized tables, while we consider also projection-selection relationalviews and sets of views. The problem of guaranteeing uncertainty over sensitiveattribute values has also been considered when multiple views are released byYao et al. [32]. However, that work is actually focused on uncertainty only, andcan be considered as orthogonal to the one presented in this paper, which isfocused on the aspect of identity protection. It may be integrated in a generalapproach, as briefly discussed in Section 6, but this will be subject of futurework.

The idea of exploiting symmetries among sensitive values, originally pro-posed in the preliminary version of this paper [31] seems to have influencedanother recent proposal of anonymization techniques. Indeed, Koudas et al.[19] considered the problem of anonymizing data in order to answer aggregatequeries and proposed a technique based on the permutation of sensitive attributevalues to achieve anonymization, as an alternative to generalization.

An other approach presenting an alternative to generalization based on theanalysis of associations of sensitive attribute values with QuasiID groups identi-fying sets of individuals has been proposed by Xiao and Tao [30]; they proposethe publication of two different tables to separate sensitive attribute values fromthe specific sequences of public attribute values (QuasiID values) that may re-identify individuals. While the analysis of the above mentioned associations canbe considered related to the analysis of symmetries made in this paper, that pro-posal does not provide tools to evaluate the indistinguishability of individualsbased on the release of multiple views (except for the two provided by theiranonymization technique).

Previous work exists about the issue of privacy or secrecy disclosure throughthe release of general database views. The conditions of perfect secrecy arestudied in [25, 9] using a probability model, and in [33] using query conditionalcontainment. In this paper, we addressed the case in which some partial disclo-sure is tolerated (actually desired), and hence the “undesired” disclosure needs

33

to be evaluated.Except for the study of k-anonymity and of the derived notions mentioned

above, the privacy metrics used in prior work have mainly been based only on theuncertainty of private property values, i.e., the uncertainty about what privatevalue an individual has associated with. These metrics can be classified into twocategories: non-probabilistic and probabilistic. The non-probabilistic metricsare mainly used in the fields of inference problem of statistical databases [1, 18,17, 29], multilevel databases [23, 6] and general purpose databases [8, 6, 32]. Themost often used one is the following: if the private value of an individual cannotbe uniquely inferred, the release of data about the individual is considered safe[1, 23, 8, 18, 6, 17]. The other one is the cardinality of the set of possible privatevalues for each individual, among which attackers cannot determine which one isthe actual one [29, 32] (The metric used in [32] is an uncertainty metric in spiteof the notion of k-anonymity introduced). Probabilistic metrics are also used.Authors use the probability value associated with the actual value [13, 17] orthe variance of the probability distribution of private values [1, 26]. Most workin privacy-preserving data mining uses probability-based metrics. Their metricsare based only on the characteristics of the a posteriori probability distributionof private values [3, 2, 11], or on both a priori and the a posteriori distribution[2, 10, 16, 4]. The work in [7] uses indistinguishability based on probability“distance” as a privacy metric.

6 Conclusions and Future Research Directions

In this paper, we identified a requirement of privacy in data release, namely in-distinguishability, in addition to uncertainty. We first gave three definitions ofindistinguishability, namely, PIND, SIND, and RSIND. Then we concentratedon checking database views against these indistinguishability metrics. We be-lieve that in most cases PIND is impractical due to the difficulty of obtainingthe a priori probability distribution and calculating the a posteriori probabil-ity distribution. Generally, checking for k-SIND is intractable. We presentedtwo cases where polynomial algorithms are possible. Furthermore, we presentedtwo conservative checking methods. Checking for RSIND is pretty easy, butchecking for k-RSIND is intractable and can be done in a conservative way withheuristic polynomial algorithms.

Our work can be extended in several directions. In our running example weused a discrete domain for the sensitive attribute (Diagnosis). In general thesensitive attributes can be drawn also from an infinite or a continuous domain;it should not be difficult to extend our study to infinite discrete, or continuousdomains.

We also assumed that each PA-tuple uniquely identifies an individual, how-ever even if the PA attributes act as a quasi-identifier, it is entirely possiblethat a PA-tuple re-identifies a set of individuals and not a single one. Note thatour definitions of indistinguishability hold also in this case, and we believe thatthe proposed verification methods apply as well.

34

In addition to these minor extensions, there are four major extensions thatwe foresee.

Firstly, our definitions require “perfect” indistinguishability, i.e., they re-quire complete symmetry in terms of their private attribute values betweenindividuals. But indistinguishability could have a degree. That is, the fact thattwo individuals are mostly symmetric in terms of sensitive attribute values, maylead to consider them as indistinguishable (ignoring at this level the diversityproblem). This is an interesting future research topic, and the introduction ofthe notion of RSIND is towards this direction.

Secondly, we may want a crowd, i.e., a set of indistinguishable individuals, tohave some additional properties in order to introduce some “diversity” amongthem. For example, we may require that k indistinguishable individuals haveat least l distinct Zip values, which introduces a territorial diversity into theindistinguishable crowd. The extension of the privacy metric to include uncer-tainty may be inspired by the work on diversity on single released tables [22] orby the work in [32] that is specifically focused on multiple views. Alternatively,an approach similar to t-closeness [21] may be extended to the case of multipleviews, measuring the difference between the distribution of private values withineach SIND/RSIND partition and the distribution of the private values in thewhole private table.

Thirdly, it would be interesting to study methods to modify views to achievesufficient indistinguishability and uncertainty accordingly to an integrated met-ric obtained along the lines illustrated above.

Finally, a deeper investigation is needed for the case in which updates to theoriginal private data are considered, following the preliminary results illustratedin Section 4.

References

[1] Nabil R. Adam and John C. Wortmann. Security-control methods forstatistical databases: a comparative study. ACM Computing Surveys,21(4):515–556, December 1989.

[2] Dakshi Agrawal and Charu C. Aggarwal. On the design and quantifica-tion of privacy preserving data mining algorithms. In Proceedings of theTwenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principlesof Database Systems (PODS), 2001.

[3] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data min-ing. In Proceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD Conference), pages 439–450, 2000.

[4] Shipra Agrawal and Jayant R. Haritsa. A framework for high-accuracyprivacy-preserving mining. In Proceedings of the 21st International Con-ference on Data Engineering (ICDE), pages 193–204, 2005.

35

[5] Roberto J. Jr. Bayardo and Rakesh Agrawal. Data privacy through optimalk-anonymization. In Proceedings of the 21st International Conference onData Engineering (ICDE), pages 217–228, 2005.

[6] Alexander Brodsky, Csilla Farkas, and Sushil Jajodia. Secure databases:Constraints, inference channels, and monitoring disclosures. IEEE Trans-actions on Knowledge and Data Engineering, 12(6):900–919, 2000.

[7] Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, andHoeteck Wee. Toward privacy in public databases. In Theory of Cryptog-raphy, Second Theory of Cryptography Conference (TCC), pages 363–385,2005.

[8] Harry S. Delugach and Thomas H. Hinke. Wizard: A database inferenceanalysis and detection system. IEEE Transactions on Knowledge and DataEngineering, 8(1):56–66, 1996.

[9] Alin Deutsch and Yannis Papakonstantinou. Privacy in database publish-ing. In Database Theory - ICDT 2005, 10th International Conference, pages230–245, 2005.

[10] Alexandre V. Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant.Limiting privacy breaches in privacy preserving data mining. In Proceed-ings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems (PODS), pages 211–222, 2003.

[11] Alexandre V. Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Jo-hannes Gehrke. Privacy preserving mining of association rules. In Proceed-ings of the Eighth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD), pages 217–228, 2002.

[12] Ruth Gavison. Privacy and the limits of the law. In Deborah G. John-son and Helen Nissenbaum, editors, Computers, Ethics, and Social Values,pages 332–351. Freeman, San Francisco, 1995.

[13] John Hale and Sujeet Shenoi. Catalytic inference analysis: Detecting infer-ence threats due to knowledge discovery. In Proceedings of the 1997 IEEESymposium on Security and Privacy, pages 188–199, 1997.

[14] John E. Hopcroft and Richard M. Karp. An n5/2 algorithm for maximummatchings in bipartite graphs. SIAM Journal on Computing, 2(4):225–231,1973.

[15] Xiaoyun Ji and John E. Mitchell. Branch-and-price-and-cut on clique par-tition problem with minimum clique size requirement. In IMA SpecialWorkshop: Mixed-Integer Programming, 2005.

[16] Murat Kantarcioglu, Jiashun Jin, and Chris Clifton. When do data miningresults violate privacy? In Proceedings of the Tenth ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining (KDD),pages 599–604, 2004.

36

[17] Krishnaram Kenthapadi, Nina Mishra, and Kobbi Nissim. Simulatable au-diting. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems (PODS), pages 118–127,2005.

[18] Jon M. Kleinberg, Christos H. Papadimitriou, and Prabhakar Ragha-van. Auditing boolean attributes. In Proceedings of the Nineteenth ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys-tems (PODS), pages 86–91, 2000.

[19] N. Koudas, D. Srivastava, T. Yu, and Q. Zhang. Aggregate query answeringon anonymized tables. In Proceedings of the 23rd International Conferenceon Data Engineering (ICDE), 2007.

[20] Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. Incognito:Efficient full-domain k-anonymity. In Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMOD Conference),pages 49–60, 2005.

[21] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness:Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rdInternational Conference on Data Engineering (ICDE), 2007.

[22] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthu-ramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on DataEngineering (ICDE), pages 24–35, 2006.

[23] Donald G. Marks. Inference in MLS database systems. IEEE Transactionson Knowledge and Data Engineering, 8(1):46–55, 1996.

[24] Adam Meyerson and Ryan Williams. On the complexity of optimal k-anonymity. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages223–228, 2004.

[25] Gerome Miklau and Dan Suciu. A formal analysis of information disclosurein data exchange. In Proceedings of the ACM SIGMOD International Con-ference on Management of Data (SIGMOD Conference), pages 575–586,2004.

[26] Krishnamurty Muralidhar and Rathindra Sarathy. Security of random dataperturbation methods. ACM Transactions on Database Systems (TODS),24(4):487–493, 1999.

[27] Pierangela Samarati. Protecting respondents’ identities in microdata re-lease. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010–1027, 2001.

37

[28] Latanya Sweeney. Achieving k-anonymity privacy protection using gener-alization and suppression. International Journal on Uncertainty, Fuzzinessand Knowledge-based Systems, 10(5):571–578, 2002.

[29] Lingyu Wang, Duminda Wijesekera, and Sushil Jajodia. Cardinality-basedinference control in sum-only data cubes. In Proceedings of 7th EuropeanSymposium on Research in Computer Security (ESORICS), pages 55–71,2002.

[30] Xiaokui Xiao and Yufei Tao. Anatomy: Simple and effective privacy preser-vation. In Proceedings of the 32nd International Conference on Very LargeData Bases (VLDB), pages 139–150, 2006.

[31] Chao Yao, Lingyu Wang, Xiaoyang Sean Wang, and Sushil Jajodia. In-distinguishability: The other aspect of privacy. In Proc. of Secure DataManagement, Third VLDB Workshop, volume LNCS 4165, pages 1–17.Springer, 2006.

[32] Chao Yao, Xiaoyang Sean Wang, and Sushil Jajodia. Checking for k-anonymity violation by views. In Proceedings of the 31st InternationalConference on Very Large Data Bases (VLDB), pages 910–921, 2005.

[33] Zheng Zhang and Alberto O. Mendelzon. Authorization views and condi-tional query containment. In Database Theory - ICDT 2005, 10th Interna-tional Conference, pages 259–273, 2005.

38

evaluating privacy threats in released database views by ...xwang4/publications/jcsyao.pdf ·...

Documents