li xiong cs573 data privacy and security privacy preserving data mining – secure multiparty...

38
Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Upload: garey-white

Post on 17-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Li Xiong

CS573 Data Privacy and Security

Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Page 2: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Outline• Privacy preserving two-party decision tree

mining using SMC protocols (Lindell & Pinkas ’00)

• Primitive SMC protocols– Secure sum– Secure union (encryption based)– Secure max (probabilistic random response based) – Secure union (probabilistic and randomization

based)• Secure data mining using sub protocols• Random response for privacy preserving data

mining or data sanitization

Page 3: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Random response protocols

• Multi-round probabilistic protocols• Randomization probability associated with

each round• Random response with randomization

probability

Page 4: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

• Multiple rounds• Randomization Probability at round r :

– Pr(r) =

• Local algorithm at round r and node i:

4

Max Protocol – multi-round random response

gi-1(r)>=vi gi-1(r)<vi

gi(r) gi-1(r) w/ prob Pr: rand [gi-1(r), vi)

w/ prob 1-Pr: vi

10 * rdP

igi-1(r) gi(r)

vi

Page 5: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

5

Max Protocol - Illustration

Start18 3532

32 4035

D2

D3

D2

D4

30

20 40

10

18 3532

32 4035

0

Page 6: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

6

Min/Max Protocol - Correctness

• Precision bound:– Converges with r– Smaller p0 and d provides faster convergence

2

)1(

01*1)Pr(1

rrrr

jdPj

Page 7: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

7

Min/Max Protocol - Cost

• Communication cost – single round: O(n)– Minimum # of rounds given precision guarantee (1-e):

Page 8: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

8

Min/Max Protocol - Security

• Probability/confidence based metric: P(C|IR,R)

– Different types of exposures based on claim• Data value: vi=a• Data ownership: Vi contains a

– Change of beliefs• P(C|IR,R) – P(C|R)• P(C|IR, R) / P(C|R)

• Relationship to privacy in anonymization– Change of beliefs P(C|D*, BR) – P(C|BR)

0.50 1

Absolute Privacy Provable Exposure

Page 9: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

9

Min/Max Protocol – Security (Analysis)

• Upper bound for average expected change of beliefs: max r 1/2r-1 * (1-P0*dr-1)

• Larger p0 and d provides better privacy

Page 10: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

10

• Loss of privacy decreases with increasing number of nodes• Probabilistic protocol achieves better privacy (close to 0)• When n is large, anonymous protocol is actually okay!

Min/Max Protocol – Security (Experiments)

Page 11: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Union

• Commutative encryption based approach– Number of rounds: 2 rounds– Each round: encryption and decryption

• Multi-round random-response approach?

Page 12: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Vector

• Each database has a boolean vector of the data items

• Union vector is a logical OR of all vectors

0

10b1

b2

bL

…p1

0

01

p2

0

10

pc

OR OR OR… =

0

11

VG

Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

Page 13: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Group Vector Protocol

0

0

0

vG’

0

0

1…

vG’

r=1, Pex=1/2, Pin=1/2

Pex=1/2r, Pin=1-Pex

for(i=1; i<L; i++) if (Vs[i]=1 and VG’[i]=0) Set VG’[i]=1 with prob. Pin

if (Vs[i]=0 and VG’[i]=1) Set VG’[i]=0 with prob. Pex

Processing of VG’ at ps of round r…

0

1

0

v1

0

0

1

v2

0

1

0

vc

r=2, Pex=1/4, Pin=3/4 0

0

1

vG’

0

1

1

vG’

0

1

1

vG’

0

0

1

vG’

0

1

1…

vG’

p1 p2 pc

Page 14: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

14

Random Shares based Secure Union• Phase 1: random item addition

– Multiple rounds with permutated ring– Each node sends a random share of its item set and a random share of a random

item set• Phase 2: random item removal

– Each node subtracts its random items set

Page 15: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

15

Random Shares based Secure Union - Analysis

• Item exposure attack– An adversary makes a claim C on a particular item a

node i contributes to the final result (C: vi in xi)• Set exposure attack

– An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai).

• Change of beliefs (posterior probability and prior probability)– P(C|IR,X) - P(C|X)– P(C|IR,X)/P(C|X)

Page 16: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

16

Exposure Risk – Set Exposure

• Disclosure decreases with increasing number of generated random items and increasing number of participating nodes

• Set exposure risk is or close to 0 for probabilistic and crypto approach

Page 17: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

17

Exposure Risk – Risk Exposure

• Item exposure risk decreases with increasing number of generated random items and participating nodes

• Item exposure risk for probabilistic approach is quite high

Page 18: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

18

Cost Comparison

• Commutative protocol and anonymous communication protocol efficient but sensitive to union size

• Probabilistic protocol efficient but sensitive to domain size• Estimated runtime for the general circuit-based protocol implemented

by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested

Page 19: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Open issues

Tradeoff between accuracy, efficiency, and security How to quantify security How to design adjustable protocols

Can we generalize the random-response algorithms and randomization algorithms for operators based on their properties Operators: sum, union, max, min … Properties: commutative, associative, invertible,

randomizable

Page 20: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

• Secure Sum

• Secure Comparison

• Secure Union

• Secure Logarithm

• Secure Poly. Evaluation

• Association Rule Mining

• Decision Trees

• EM Clustering

• Naïve Bayes Classifier

Data Mining on Horizontally Partitioned DataSpecific Secure Tools

Page 21: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

• Secure Comparison

• Secure Set Intersection

• Secure Dot Product

• Secure Logarithm

• Secure Poly. Evaluation

• Association Rule Mining

• Decision Trees

• K-means Clustering

• Naïve Bayes Classifier

• Outlier Detection

Data Mining on Vertically Partitioned DataSpecific Secure Tools

Page 22: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Summary of SMC Based PPDDM

• Mainly used for distributed data mining.• Efficient/specific cryptographic solutions for

many distributed data mining problems are developed.

• Random response or randomization based protocols offer tradeoff between accuracy, efficiency, and security

• Mainly semi-honest assumption(i.e. parties follow the protocols)

Page 23: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Ongoing research

• New models that can trade-off better between efficiency and security

• Game theoretic / incentive issues in PPDM

Page 24: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Outline• Privacy preserving two-party decision tree

mining using SMC protocols (Lindell & Pinkas ’00)

• Primitive SMC protocols– Secure sum– Secure union (encryption based)– Secure max (probabilistic random response based) – Secure union (probabilistic and randomization

based)• Secure data mining using sub protocols• Random response for privacy preserving data

mining or data collection

Page 25: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Data Collection Model

Data Publisher

Step 1: Data Collection

IndividualData

Data Miner

Step 2: Data Publishing

Data cannot be shared directly because of privacy concern

Page 26: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Randomized Response

)5.0(

)(

YesP

P'(Yes) P(Yes) P(No)(1 )

P'(No) P(Yes)(1 ) P(No)

Do you smoke?

Head

Tail No

Yes

The true answer is “Yes”

Biased coin:

5.0

)(

HeadP

Page 27: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Randomized Response

• Multiple attributes encoded in bits

)5.0(

)(

YesP

Head

TailFalse answer !E: 001

True answer E: 110Biased coin:

5.0

)(

HeadP

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Page 28: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Generalization for Multi-Valued Categorical Data

True Value: Si

Si

Si+1

Si+2

Si+3

q1

q2

q3

q4

P '(s1)

P '(s2)

P '(s3)

P '(s4)

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

P(s1)

P(s2)

P(s3)

P(s4)

M

Page 29: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

A Generalization• RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

• RR Matrix can be arbitrary

• Can we find optimal RR matrices?

M

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

Page 30: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

What is an optimal matrix?

• Which of the following is better?

M1 1 0 0

0 1 0

0 0 1

M2

13

13

13

13

13

13

13

13

13

Page 31: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

What is an optimal matrix?

• Which of the following is better?

M1 1 0 0

0 1 0

0 0 1

M2

13

13

13

13

13

13

13

13

13

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Page 32: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Optimal RR Matrix

• An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).– Privacy Quantification– Utility Quantification

• A number of privacy and utility metrics have been proposed. – Privacy: how accurately one can estimate individual info. – Utility: how accurately we can estimate aggregate info.

Page 33: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Optimization Methods

• Approach 1: Weighted sum: w1 Privacy + w2 Utility

• Approach 2– Fix Privacy, find M with the optimal Utility.– Fix Utility, find M with the optimal Privacy.– Challenge: Difficult to generate M with a fixed

privacy or utility.• Proposed Approach: Multi-Objective

Optimization

Page 34: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Optimization algorithm• Evolutionary Multi-Objective Optimization (EMOO)• The algorithm

– Start with a set of initial RR matrices– Repeat the following steps in each iteration

• Mating: selecting two RR matrices in the pool• Crossover: exchanging several columns between the two RR

matrices• Mutation: change some values in a RR matrix• Meet the privacy bound: filtering the resultant matrices • Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Page 35: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Illustration

Page 36: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Output of Optimization

Privacy

Utility

Worse

Better

M1M2

M4

M3

M5

M7

M6

M8

The optimal set is often plotted in the objective space as Pareto front.

Page 37: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

For First attribute of Adult data

Page 38: Li Xiong CS573 Data Privacy and Security Privacy Preserving Data Mining – Secure multiparty computation and random response techniques

Summary• Privacy preserving data mining

– Secure multi-party computation protocols– Random response techniques for computation

and data collection

• Knowledge sensitive data mining