privacy preserving data mining: an overview and examination of euclidean distance preserving data...

Privacy Preserving Data Mining:An Overview and Examination of Euclidean

Distance Preserving Data Transformation

Chris Giannella

cgiannel AT acm DOT org

Talk Outline● Introduction

– Privacy preserving data mining – what problem is it

aimed to address?

– Focus of this talk: data transformation

● Some data transformation approaches● My current research: Euclidean distance

preserving data transformation● Wrap-up summary

An Example Problem● The U.S. Census Bureau collects lots of data● If released in raw form, this data would provide

– a wealth of valuable information regarding broad

population patterns – Access to private information regarding individuals

How to allow analysts to extract population patterns without learning private information?

Privacy-Preserving Data Mining

“The study of how to produce valid mining models and patterns without disclosing private information.”- F. Giannotti and F. Bonchi, “Privacy Preserving Data Mining,” KDUbiq Summer School, 2006.

Several broad approaches … this talk

data transformation (the “census model”)

Data Transformation(the “Census Model”)

Private Data Transformed Data

Data Miner

Researcher

DT Objectives

• Minimize risk of disclosing private information

• Maximize the analytical utility of the transformed data

DT is also studied in the field of Statistical Disclosure Control.

Some things DT does not address…

• Preventing unauthorized access to the private data (e.g. hacking).

• Securely communicating private data.DT and cryptography are quite different.

(Moreover, standard encryption does not solve the DT problem)

Assessing Transformed Data Utility

How accurately does a transformation preserve certain kinds of patterns, e.g.:

● data mean, covariance● Euclidean distance between data records● Underlying generating distribution?

How useful are the patterns at drawing conclusions/inferences?

Assessing Privacy Disclosure Risk

• Some efforts in the literature to develop rigorous definitions of disclosure risk –no widely accepted agreement

• This talk will take an ad-hoc approach:–for a specific attack, how closely can any

private data record be estimated?

Talk Outline

● Introduction

– Privacy preserving data mining – what problem is it aimed to address?



preserving data transformation● Wrap-up summary

Some DT approaches

● Discussed in this talk:

– Additive independent noise

– Euclidean distance preserving transformation

● My current research

● Others:

– Data swapping/shuffling, multiplicative noise, micro-

aggregation, K-anonymization, replacement with

synthetic data, etc…

Additive Independent Noise

• For each private data record, (x1,…,xn), add independent random noise to each entry:–(y1,…,yn) = (x1+e1,…,xn+en)

–ei is generated independently as N(0, d*Var(i))

–Increasing d reduces privacy disclosure risk

Additive Independent Noise d = 0.5

Additive Independent Noise

• Difficult to set d producing • low privacy disclosure risk

• high data utility

• Some enhancements on the basic idea exist

• E.g. Muralidhar et al.

Talk Outline

● Introduction

– Privacy preserving data mining – what problem is it aimed to address?


● Some data transformation approaches

● My current research: Euclidean distance preserving data transformation (EDPDT)

● Wrap-up summary

EDPDT – High Data Utility!

● Many data clustering algorithms use Euclidean distance to group records, e.g.

– K-means clustering, hierarchical agglomerative

clustering

● If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

EDPDT – High Data Utility!

Original data Transformed data

EDPDT – Unclear Privacy Disclosure Risk

● Focus of the research ... approach

– Develop attacks combining the transformed data

with plausible prior knowledge.

– How well can these attacks estimate private data

records?

Two Different Prior Knowledge Assumptions

● Known input: The attacker knows a small subset of the private data records.

– Focus of this talk.

● Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records.

– happy to discuss “off-line”.

Known Input Prior Knowledge

Underlying assumption: Individuals know

a) if there is a record for them along the private data records, and

b) know the attributes of the private data records.

Each individual knows one private record.

A small group of malicious individuals could cooperate to produce a small subset of the private data records.

Known Input Attack

Given: {Y1,…,Ym} (transformed data records) {X1,…,Xk} (known private data records)

1) Determine the transformation constraints

i.e. which transformed records came from which known

private records.

2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints.

3) Apply T-1 to the transformed data.

Know Input Attack – 2D data, 1 known private data record

Known Input Attack – General Case

Y = MX

● Each column of X (Y) is a private (transformed) data record.

● M is an orthogonal matrix.

[Ykn Yun] = M[Xknown Xunkown]

Attack:

Choose T randomly from {T an orthogonal matrix: TXknown = Ykn}.

Produce T-1(Yun).23

Known Input Attack -- Experiments

18,000 record, 16-attribute real data set.

Given k known private data records, computed Pk, the probability that the attack estimates one

unknown private record with > 85% accuracy.

P2 = 0.16

P4 = 1

…

P16 = 1

Wrap-Up Summary

● Introduction

– Privacy preserving data mining – what problem is it

aimed to address?



preserving data transformation

Thanks to … ● You:

– for your attention● Kun Liu:

– joint research & some material used in this presentation

● Krish Muralidhar:

– some material used in this presentation

● Hillol Kargupta:

– joint research

Distance Preserving PerturbationA

ttri

bute

sR

eco

rds

Distance Preserving Perturbation

Y

2,5782,899Tax1,3241,889Rent83,82198,563Wages10021001ID

0.4527 0.8887 0.0726

0.2559-0.0514-0.9653

-0.8542 0.4556-0.2507×

8,432 10,151Tax-80,324-94,502 Rent-22,613-26,326Wages10021001ID

=

M X

Known Sample Attack

[more]

Known Sample Attack Experiments backup

Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

privacy preserving data mining: an overview and examination of euclidean distance preserving data...

Documents

data mining

data swappingshuffling

data mean

data records

synthetic data

private data record

transformed data dt

transformed data utility