anonymity for continuous data publishing benjamin c. m. fung concordia university montreal, qc,...
TRANSCRIPT
Anonymity for Continuous Data Publishing
Benjamin C. M. Fung
Concordia University Montreal, QC, Canada
http://www.ciise.concordia.ca/~fung
Ke Wang
Simon Fraser University Burnaby, BC, Canada
Ada Wai-Chee Fu
The Chinese University of Hong Kong
Jian Pei
Simon Fraser University Burnaby, BC, Canada
The 11th International Conference on Extending Database Technology (EDBT 2008)
2
Privacy-Preserving Data Publishingk-anonymity [SS98] 2-anoymous patient table
Birthplace Job Disease
UK Professional Flu
UK Professional Diabetes
France Professional Diabetes
France Professional Flu
Raw patient table
Quasi-Identifier (QID) Sensitive
Birthplace Job Disease
UK Engineer Flu
UK Lawyer Diabetes
France Engineer Diabetes
France Lawyer Flu
(Hospital)
3
Privacy Requirement k-anonymity [SS98]
Every QID group contains at least k records.
Confidence bounding[WFY05, WFY07] Bound the confidence
QIDsensitive value within h%.
l-diversity [MGKV06] Every QID group contains
l well-represented distinct sensitive values.
Patient table
QID Sensitive
Birthplace Job Disease
UK Professional Flu
UK Professional Diabetes
UK Professional Diabetes
UK Professional Diabetes
France Professional Diabetes
France Professional Diabetes
France Professional Flu
France Professional Flu
4
Continuous Data Publishing Model
At time T1,Collected a set of raw data records D1
Published a k-anonymous version of D1, denoted release R1.
At time T2, Collect a new set of raw data records D2 Want to publish all data collected so far.Publish a k-anonymous version of D1UD2,
denoted release R2.
Birthplace Job Disease
(a1) Europe (UK) Lawyer Flu
(a2) Europe (UK) Lawyer Flu
(a3) Europe (UK) Lawyer Flu
(a4) Europe (France) Lawyer Diabetes
(a5) Europe (France) Lawyer Diabetes
Birthplace Job Disease
(b1) UK Professional (Lawyer) Flu
(b2) UK Professional (Lawyer) Flu
(b3) UK Professional (Lawyer) Flu
(b4) France Professional (Lawyer) Diabetes
(b5) France Professional (Lawyer) Diabetes
(b6) France Professional (Lawyer) Diabetes
(b7) France Professional (Doctor) Flu
(b8) France Professional (Doctor) Flu
(b9) UK Professional (Doctor) Diabetes
(b10) UK Professional (Lawyer) Diabetes
R1
R2
D1
D2
D1
Continuous Data Publishing Model
6
Correspondence Attacks
An attacker could “crack” the k-anonymity by comparing R1 and R2.
Background knowledge: QID of a target victim (e.g., Alice is born in France and is a
lawyer.) Timestamp of a target victim.
Correspondence knowledge: Every record in R1 has a corresponding record in R2.
Every record timestamped T2 has a record in R2, but not in R1.
7
Our Contributions What exactly are the records that can be
excluded (cracked) based on R1 and R2? Systematically characterize the set of cracked records
by correspondence attacks. Propose the notion of BCF-anonymity to measure
anonymity after excluding the cracked records.
Developed an efficient algorithm to identify a BCF-anonymized R2, and studied its data quality.
Extended the proposed approach to deal with more than two releases and other privacy notions.
8
Problem Statements Detection problem:
Determine the number of cracked records in the worst case by applying the correspondence knowledge on the k-anonymized R1 and R2 .
Anonymization problem: Given R1, D1 and D2, we want to generalize R2
= D1UD2 so that R2 satisfies a given BCF-anonymity requirement and remains as useful as possible wrt a specified information metric.
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
Alice: {France, Lawyer} with timestamp T1.
Attempt to identify her record in R1.
Forward-Attack (F-Attack)
a1, a2, a3 cannot all originate from [France, Lawyer].
Otherwise, R2 would have at least three [France, Professional, Flu].
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
F-Attack
CG(qid1,qid2) = {(g1,g2),(g1',g2')}
g1
g2
g1'
g2'
qid1
qid2
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
F-Attack
Crack size of g1 wrt P:
c = |g1| – min(|g1|,|g2|)
c = 3 – min(3, 2) = 1.
Crack size of g1' wrt P:
c = |g1'| – min(|g1'|,|g2'|)
c = 2 – min(2, 3) = 0.
F(P, qid1, qid2) = c
over all CG(qid1, qid2)
12
Definition: F-Anonymity F(qid1, qid2) denotes the maximum
F(P, qid1, qid2) for any target P that matches (qid1, qid2).
F(qid1) denotes the maximum F(qid1, qid2) for all qid2 in R2.
F-anonymity of (R1,R2), denoted by FA(R1,R2), is the minimum(|qid1| - F(qid1)) for all qid1 in R1.
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
Alice: {France, Lawyer} with timestamp T1.
Attempt to identify her record in R2.
Cross-Attack (C-Attack)
At least one of b4,b5,b6 must have timestamp T2.
Otherwise, R1 would have at least three records [Europe, Lawyer, Diabetes]
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
C-Attack
Crack size of g2 wrt P:
c = |g2| – min(|g1|,|g2|)
c = 2 – min(3, 2) = 0
Crack size of g2' wrt P:
c = |g2'| – min(|g1'|,|g2'|)
c = 3 – min(2, 3) = 1
C(P, qid1, qid2) = c
over all CG(qid1, qid2)
15
Definition: C-Anonymity C(qid1, qid2) denotes the maximum
C(P, qid1, qid2) for any target P that matches (qid1, qid2).
C(qid2) denotes the maximum C(qid1, qid2) for all qid1 in R1.
C-anonymity of (R1,R2), denoted by CA(R1,R2), is the minimum(|qid2| - C(qid2)) for all qid2 in R2.
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
Alice: {UK, Lawyer} with timestamp T2.
Attempt to identify her record in R2.
Backward-Attack (B-Attack)
At least one of b1,b2,b3 must have timestamp T1.
Otherwise, one of a1,a2,a3 would have no corresponding record in R2.
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
B-Attack
Target person P
{UK, Lawyer}
with timestamp T2.
Crack size of g2 wrt P:
c = max(0,|G1|-(|G2|-|g2|))
g2 = {b1, b2, b3}
G1 = {a1, a2, a3}
G2 = {b1, b2, b3, b7, b8}
c = max(0,3-(|5|-|3|)) = 1
R1 Birthplace Job Disease
(a1) Europe Lawyer Flu
(a2) Europe Lawyer Flu
(a3) Europe Lawyer Flu
(a4) Europe Lawyer Diabetes
(a5) Europe Lawyer Diabetes
R2 Birthplace Job Disease
(b1) UK Professional Flu
(b2) UK Professional Flu
(b3) UK Professional Flu
(b4) France Professional Diabetes
(b5) France Professional Diabetes
(b6) France Professional Diabetes
(b7) France Professional Flu
(b8) France Professional Flu
(b9) UK Professional Diabetes
(b10) UK Professional Diabetes
B-Attack
Crack size of g2' wrt P:
c = max(0,|G1'|-(|G2'|-|g2'|))
g2' = {b9, b10}
G1' = {a4, a5}
G2' = {b4, b5, b6, b9, b10}
c = max(0,2-(|5|-|2|)) = 0
B(P, qid2) = c
over all g2 in qid2.
19
Definition: B-Anonymity
B(qid2) denotes the maximum B(P, qid2) for any target P that matches qid2.
B-anonymity of (R1,R2), denoted by BA(R1,R2), is the minimum(|qid2| - B(qid2)) for all qid2 in R2.
20
In brief…
cracked records: either do not originate from Alice's QID or do not have Alice's timestamp.
Such cracked records are not related to Alice, thus, excluding them allows the attacker to focus on a smaller set of candidate records.
21
Definition: BCF-Anonymity
A BCF-anonymity requirement states that all of BA(R1,R2)k, CA(R1,R2)k, and FA(R1,R2)k, where k is a user-specified threshold.
We now present an algorithm for anonymizing R2=D1UD2.
BCF-Anonymizer1. generalize every value for Aj QID in R2 to ANYj;
2. let candidate list contain all ANYj;3. sort candidate list by Score in descending order;4. while the candidate list is not empty do5. if the first candidate w in candidate list is valid then
6. specialize w into {w1,…,wz} in R2;
7. compute Score for all wi; and add them to candidate list;
8. sort the candidate list by Score in descending order;9. else10. remove w from the candidate list;11. end if12. end while
13. output R2
ANY
Europe ……America
France UK ……
23
Anti-Monotonicity of BCF-Anonymity
Theorem: Each of FA, CA and BA is non-increasing with respect to a specialization on R2.
Guarantee that the produced BCF-anonymized R2 is maximally specialized (suboptimal) which any further specialization leads to a violation.
24
Empirical Study Study the threat of correspondence
attacks.
Evaluate the information usefulness of a BCF-anonymized R2.
Adult dataset (US Census data)8 categorical attributes30,162 records in training set15,060 records in testing set
Experiment Settings D1 contains all records in testing set.
Three cases of D2 at timestamp T2:
200D2: D2 contains the first 200 records in the training set, modelling a small set of new records at T2.
2000D2: D2 contains the 2000 records in the training set, modelling a medium set of new records at T2.
allD2: D2 contains all 30,162 records in the training set, modelling a large set of new records at T2.
27
Anonymization• BCF-Anonymized R2: Our method.• k-Anonymized R2: Not safe from correspondence attacks.• k-Anonymized D2: Anonymize D2 separately from D1.
29
Related Work Byun et al. (VLDB-SDM06) is an early study
on continuous data publishing scenario.Anonymization relies on delaying records release
and the delay can be unbounded. In our method, records collected at timestamp Ti
are always published in the corresponding release Ri without delay.
Xiao and Tao (SIGMOD07) presents the first study to address both record insertions and deletions in data re-publication.Anonymization relies on generalization and
adding counterfeit records.
30
Related Work Wang and Fung (SIGKDD06) study the
problem of anonymizing sequential releases where each subsequent release publishes a different subset of attributes for the same set of records.
A B C D
R1
R2
31
Conclusion & Contributions Systematically characterize different types
of correspondence attacks and concisely compute their crack size.
Define BCF-anonymity requirement.
Present an anonymization algorithm to achieve BCF-anonymity while preserving information usefulness.
Extendable to multiple releases.
32
For more information: http://www.ciise.concordia.ca/~fung
Acknowledgement: Reviewers of EDBT Concordia University
Faculty Start-up Grants Natural Sciences and Engineering Research
Council of Canada
(NSERC)Discovery GrantsPGS Doctoral Award
33
References[BSBL06] J.-W. Byun, Y. Sohn, E. Bertino, and N.
Li. Secure anonymization for incremental datasets. In VLDB Workshop on Secure Data Management (SDM), 2006.
[MGKV06] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, Atlanta, GA, April 2006.
[PXW07] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In SSDBM, Banff, Canada, 2007
34
References[SS98] P. Samarati and L. Sweeney. Protecting
privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International, March 1998.
[WF06] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In ACM SIGKDD, Philadelphia, PA, August 2006, pp. 414-423.
[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu. Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466-473, November 2005.
35
References[WFY07] K. Wang, B. C. M. Fung, and P. S. Yu.
Handicapping attacker's confidence: an alternative to k-anonymization. Knowledge and Information Systems: An International Journal (KAIS), 11(3):345-368, April 2007.
[XY07] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In ACM SIGMOD, June 2007.