privacy preserving data mining: an overview and examination of euclidean distance preserving data...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Privacy Preserving Data Mining:An Overview and Examination of Euclidean
Distance Preserving Data Transformation
Chris Giannella
cgiannel AT acm DOT org
Talk Outline● Introduction
– Privacy preserving data mining – what problem is it
aimed to address?
– Focus of this talk: data transformation
● Some data transformation approaches● My current research: Euclidean distance
preserving data transformation● Wrap-up summary
An Example Problem● The U.S. Census Bureau collects lots of data● If released in raw form, this data would provide
– a wealth of valuable information regarding broad
population patterns – Access to private information regarding individuals
How to allow analysts to extract population patterns without learning private information?
Privacy-Preserving Data Mining
“The study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi, “Privacy Preserving Data Mining,” KDUbiq Summer School, 2006.
Several broad approaches … this talk
data transformation (the “census model”)
DT Objectives
• Minimize risk of disclosing private information
• Maximize the analytical utility of the transformed data
DT is also studied in the field of Statistical Disclosure Control.
Some things DT does not address…
• Preventing unauthorized access to the private data (e.g. hacking).
• Securely communicating private data.DT and cryptography are quite different.
(Moreover, standard encryption does not solve the DT problem)
Assessing Transformed Data Utility
How accurately does a transformation preserve certain kinds of patterns, e.g.:
● data mean, covariance● Euclidean distance between data records● Underlying generating distribution?
How useful are the patterns at drawing conclusions/inferences?
Assessing Privacy Disclosure Risk
• Some efforts in the literature to develop rigorous definitions of disclosure risk –no widely accepted agreement
• This talk will take an ad-hoc approach:–for a specific attack, how closely can any
private data record be estimated?
Talk Outline
● Introduction
– Privacy preserving data mining – what problem is it aimed to address?
– Focus of this talk: data transformation
● Some data transformation approaches● My current research: Euclidean distance
preserving data transformation● Wrap-up summary
Some DT approaches
● Discussed in this talk:
– Additive independent noise
– Euclidean distance preserving transformation
● My current research
● Others:
– Data swapping/shuffling, multiplicative noise, micro-
aggregation, K-anonymization, replacement with
synthetic data, etc…
Additive Independent Noise
• For each private data record, (x1,…,xn), add independent random noise to each entry:–(y1,…,yn) = (x1+e1,…,xn+en)
–ei is generated independently as N(0, d*Var(i))
–Increasing d reduces privacy disclosure risk
Additive Independent Noise
• Difficult to set d producing • low privacy disclosure risk
• high data utility
• Some enhancements on the basic idea exist
• E.g. Muralidhar et al.
Talk Outline
● Introduction
– Privacy preserving data mining – what problem is it aimed to address?
– Focus of this talk: data transformation
● Some data transformation approaches
● My current research: Euclidean distance preserving data transformation (EDPDT)
● Wrap-up summary
EDPDT – High Data Utility!
● Many data clustering algorithms use Euclidean distance to group records, e.g.
– K-means clustering, hierarchical agglomerative
clustering
● If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.
EDPDT – Unclear Privacy Disclosure Risk
● Focus of the research ... approach
– Develop attacks combining the transformed data
with plausible prior knowledge.
– How well can these attacks estimate private data
records?
Two Different Prior Knowledge Assumptions
● Known input: The attacker knows a small subset of the private data records.
– Focus of this talk.
● Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records.
– happy to discuss “off-line”.
Known Input Prior Knowledge
Underlying assumption: Individuals know
a) if there is a record for them along the private data records, and
b) know the attributes of the private data records.
Each individual knows one private record.
A small group of malicious individuals could cooperate to produce a small subset of the private data records.
Known Input Attack
Given: {Y1,…,Ym} (transformed data records) {X1,…,Xk} (known private data records)
1) Determine the transformation constraints
i.e. which transformed records came from which known
private records.
2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints.
3) Apply T-1 to the transformed data.
Known Input Attack – General Case
Y = MX
● Each column of X (Y) is a private (transformed) data record.
● M is an orthogonal matrix.
[Ykn Yun] = M[Xknown Xunkown]
Attack:
Choose T randomly from {T an orthogonal matrix: TXknown = Ykn}.
Produce T-1(Yun).23
Known Input Attack -- Experiments
18,000 record, 16-attribute real data set.
Given k known private data records, computed Pk, the probability that the attack estimates one
unknown private record with > 85% accuracy.
P2 = 0.16
P4 = 1
…
P16 = 1
Wrap-Up Summary
● Introduction
– Privacy preserving data mining – what problem is it
aimed to address?
– Focus of this talk: data transformation
● Some data transformation approaches● My current research: Euclidean distance
preserving data transformation
Thanks to … ● You:
– for your attention● Kun Liu:
– joint research & some material used in this presentation
● Krish Muralidhar:
– some material used in this presentation
● Hillol Kargupta:
– joint research
Distance Preserving Perturbation
Y
2,5782,899Tax1,3241,889Rent83,82198,563Wages10021001ID
0.4527 0.8887 0.0726
0.2559-0.0514-0.9653
-0.8542 0.4556-0.2507×
8,432 10,151Tax-80,324-94,502 Rent-22,613-26,326Wages10021001ID
=
M X