Transcript

Mining Multiple Private Mining Multiple Private DatabasesDatabases

Topk Queries Across Multiple Private Databases (2005)

Li Xiong (Emory University)Subramanyam Chitti (GA Tech)

Ling Liu (GA Tech)

Presented by: Cesar Gutierrez

2

About Me

ISYE Senior and CS minor

Graduating December, 2008

Humanitarian Logistics and/or Supply Chain

Originally from Lima, Peru

Travel, paintball and politics

3

Outline

Intro. & Motivation Problem Definition Important Concepts & Examples Private Algorithm Conclusion

4

Introduction

↓ of information-sharing restrictions due to technology

↑ need for distributed data-mining tools that preserve privacy

Trade-off Accuracy

Efficiency Privacy

5

Motivating Scenarios

CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background

Legal/Commercial Problems prevent release of policy holder's information

6

Motivating Scenarios (cont'd)

Industrial trade group collaboration Useful pattern: "manufacturing using chemical

supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low

failure rate"

7

Model: n nodes, horizontal partitioning

Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information

about other nodes

Problem & Assumptions

...

8

Challenges

Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of

compromise Why not use secure multi-party computation

techniques? High communication overhead Feasible for small inputs only

9

Recall Our 3-D Goal

Privacy

Accuracy

Efficiency

10

Private Max

1

3

2

4

30

20 40

10

30

30

40

40

start

Actual Data sent on first pass

Static Starting Point Known

11

Multi-Round Max

Start18 3532

32 4035

D2

D3

D2

D4

30

20 40

10

18 3532

32 4035

0

Randomly perturbed data passed to successor during multiple passes

No successor can determine actual data from it's predecessor

Randomized Starting Point

12

Evaluation ParametersParameter Description

n # of nodes in the systemk KNN parameter

Po Initial randomization probability in neighbor selectiond Dampening factor in neighbor selectionr # of rounds in neighbor selection

Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"

13

Accuracy Results

14

Varying Rounds

15

Privacy Results

16

Conclusion

Problems Tackled Preserving efficiency and accuracy while

introducing provable privacy to the system Improving a naive protocol Reducing privacy risk in an efficient manner

17

Critique

Dependency on other research papers in order to obtain a full understanding

Few/No Illustrations A real life example would have created a

better understanding of the charts


Top Related