data anonymization (1)
DESCRIPTION
Data Anonymization (1). Outline. Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data. The Massachusetts Governor Privacy Breach. Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/1.jpg)
Data Anonymization (1)
![Page 2: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/2.jpg)
Outline Problem concepts algorithms on domain generalization
hierarchy Algorithms on numerical data
![Page 3: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/3.jpg)
The Massachusetts Governor Privacy Breach
•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge
•Name•Address•Date Registered•Party affiliation•Date last voted
• Zip
• Birth date
• Sex
Medical Data Voter List
• Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis
• Zip
• Birth date
• Sex
Sweeney, IJUFKS 2002
Quasi IdentifierQuasi Identifier
87 % of US population
3
![Page 4: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/4.jpg)
Definition Table
Column: attributes, row: records
Quasi-identifier A list of attributes that can potentially be
used to identify individuals
K-anonymity Any QI in the table appears at least k
times
![Page 5: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/5.jpg)
Basic techniques Generalization
Zip {02138, 02139} 0213* Domain generalization hierarchy
A0 A1…An Eg. {02138, 02139} 0213* 021* 02*0** This hierarchy is a tree structure
suppression
![Page 6: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/6.jpg)
Balance
Better privacy guaranteeLower data utility
There are many schemes satisfying the k-anonymity specification.We want to minimize the distortion of table, in order to maximize data utility
• Suppression is required if we cannot find a k-anonymity group for a record.
![Page 7: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/7.jpg)
Criteria Minimal generalization
Minimal generalization that satisfy the k-anonymization specification
Minimal table distortion Minimal generalization with minimal
utility loss Use precision to evaluate the loss
[sweeny papers] Application-specific utility
![Page 8: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/8.jpg)
Complexity of finding optimal solution on generalization NP-hard (bayardo ICDE05) So all proposed algorithms are
approximate algorithms
![Page 9: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/9.jpg)
Shared features in different solutions Always satisfy the k-anonymity
specification If some records not, suppress them
Differences are at the utility loss/cost function Sweeney’s precision metric Discernibility & classification metrics Information-privacy metric
Algorithms Assume the domain generalization hierarchy is
given Efficiency Utility maximization
![Page 10: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/10.jpg)
Metrics to be optimized Two cost metrics – we want to minimize
(bayardo ICDE05) Discernibility
Classification The dataset has a class label column – preserving
the classification model
# of items in the k-anony group
# Records in minor classes in the group
![Page 11: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/11.jpg)
metrics A combination of information loss and
anonymity gain (wang ICDE04) Information loss, anonymity gain Information-privacy metric
![Page 12: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/12.jpg)
metrics Information loss
Dataset has class labels Entropy
a set S, labeled by different classes Entropy is used to calculate the impurity of labels
Information loss of a generalization G{c1,c2,…cn} p
I(G) = info(Sp) - info (Rci)
i
ii pp log Pi is the percentage of label iInfo(S)=
i p
ci
N
N
![Page 13: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/13.jpg)
Anonymity gain A(VID) : # of records with the VID AG(VID) >= A(VID): generalization
improves or does not change A(VID) Anonymity gain
P(G) = x – A(VID)x = AG (VID) if AG (VID) <=K
x = K, otherwise
As long as k-anonymity is satisfied, further generalization of the VID does not gain
![Page 14: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/14.jpg)
Information-privacy combined metricIP = info loss/anonymity gain = I(G)/P(G)
We want to minimize IPIf P(G) ==0, use I(G) only
Either small I(G) or large P(G) will reduce IP…If P(G)s are same, pick one with minimum I(G)
![Page 15: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/15.jpg)
Domain-hierarchy based algorithms The sweeny’s algorithm Bayardo’s tree pruning algorithm Wang’s top-down and bottom up
algorithms They are all dimension-by-dimension
methods
![Page 16: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/16.jpg)
Multidimensional techniques Categorical data?
Categories are mapped to numerize the categories
Bayardo 95 paper Order matters? (no research on that)
Numerical data K-anonymization n-dim space
partitioning Many existing techniques can be applied
![Page 17: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/17.jpg)
Single-dimensional vs. multidimensional
![Page 18: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/18.jpg)
The evolving procedure
Categorical(domain hierarchy)[sweeney, top-down/bottom-up]
numerized categories, single dimensional [bayardo05]
numerized/numerical multidimensional[Mondrian,spatial indexing,…]
![Page 19: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/19.jpg)
Method 1: Mondrain Numerize categorical data Apply a top-down partioning process
step1
Step2.1 Step2.2
![Page 20: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/20.jpg)
Allowable cut
![Page 21: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/21.jpg)
Method 2: spatial indexing Multidimensional spatial techniques
Kd-tree (similar to Mondrain algorithm) R-tree and its variations
R-tree R+-tree
Leaf layer
Upperlayer
![Page 22: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/22.jpg)
Compacting bounds
Example: uncompacted: age[1-80], salary[10k-100k]compacted: age[20-40], salary[10k-50k]
Original Mondrain does not consider compacting boundsFor R+-Tree, it is automatically done.
Information is betterpreserved
![Page 23: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/23.jpg)
Benefits of using R+-Tree Scalable: originally designed for
indexing disk-based large data Multi-granularity k-anonymity: layers Better performance Better quality
![Page 24: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/24.jpg)
Performance
Mondrain
![Page 25: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/25.jpg)
Utility Metrics
Discenibility penalty KL divergence: describe the difference
between a pair of distributions
Certainty penalty
Anonymized data distribution
T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range
![Page 26: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/26.jpg)
![Page 27: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/27.jpg)
Other issues Sparse high-dimensionality
Transactional data boolean matrix“On the anonymization of sparse high-dimensional
data” ICDE08 Relate to the clustering problem of
transactional data! The above one uses matrix-based clustering item based clustering (?)
![Page 28: Data Anonymization (1)](https://reader036.vdocuments.net/reader036/viewer/2022062301/56813f53550346895daa1370/html5/thumbnails/28.jpg)
Other issues Effect of numerizing categorical data
Ordering of categories may have certain impact on quality
General-purpose utility metrics vs. special task oriented utility metrics
Attacks on k-anonymity definition