mining multiple private databases topk queries across multiple private databases (2005) li xiong...

Mining Multiple Private Mining Multiple Private DatabasesDatabases

Topk Queries Across Multiple Private Databases (2005)

Li Xiong (Emory University)Subramanyam Chitti (GA Tech)

Ling Liu (GA Tech)

Presented by: Cesar Gutierrez

2

About Me

ISYE Senior and CS minor

Graduating December, 2008

Humanitarian Logistics and/or Supply Chain

Originally from Lima, Peru

Travel, paintball and politics

3

Outline

Intro. & Motivation Problem Definition Important Concepts & Examples Private Algorithm Conclusion

4

Introduction

↓ of information-sharing restrictions due to technology

↑ need for distributed data-mining tools that preserve privacy

Trade-off Accuracy

Efficiency Privacy

5

Motivating Scenarios

CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background

Legal/Commercial Problems prevent release of policy holder's information

6

Motivating Scenarios (cont'd)

Industrial trade group collaboration Useful pattern: "manufacturing using chemical

supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low

failure rate"

7

Model: n nodes, horizontal partitioning

Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information

about other nodes

Problem & Assumptions

...

8

Challenges

Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of

compromise Why not use secure multi-party computation

techniques? High communication overhead Feasible for small inputs only

9

Recall Our 3-D Goal

Privacy

Accuracy

Efficiency

10

Private Max

1

3

2

4

30

20 40

10

30

30

40

40

start

Actual Data sent on first pass

Static Starting Point Known

11

Multi-Round Max

Start18 3532

32 4035

D2

D3

D2

D4

30

20 40

10

18 3532

32 4035

0

Randomly perturbed data passed to successor during multiple passes

No successor can determine actual data from it's predecessor

Randomized Starting Point

12

Evaluation ParametersParameter Description

n # of nodes in the systemk KNN parameter

Po Initial randomization probability in neighbor selectiond Dampening factor in neighbor selectionr # of rounds in neighbor selection

Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"

13

Accuracy Results

14

Varying Rounds

15

Privacy Results

16

Conclusion

Problems Tackled Preserving efficiency and accuracy while

introducing provable privacy to the system Improving a naive protocol Reducing privacy risk in an efficient manner

17

Critique

Dependency on other research papers in order to obtain a full understanding

Few/No Illustrations A real life example would have created a

better understanding of the charts

mining multiple private databases topk queries across multiple private databases (2005) li xiong...

Documents