Mining Multiple Private Mining Multiple Private DatabasesDatabases
Topk Queries Across Multiple Private Databases (2005)
Li Xiong (Emory University)Subramanyam Chitti (GA Tech)
Ling Liu (GA Tech)
Presented by: Cesar Gutierrez
2
About Me
ISYE Senior and CS minor
Graduating December, 2008
Humanitarian Logistics and/or Supply Chain
Originally from Lima, Peru
Travel, paintball and politics
3
Outline
Intro. & Motivation Problem Definition Important Concepts & Examples Private Algorithm Conclusion
4
Introduction
↓ of information-sharing restrictions due to technology
↑ need for distributed data-mining tools that preserve privacy
Trade-off Accuracy
Efficiency Privacy
5
Motivating Scenarios
CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background
Legal/Commercial Problems prevent release of policy holder's information
6
Motivating Scenarios (cont'd)
Industrial trade group collaboration Useful pattern: "manufacturing using chemical
supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low
failure rate"
7
Model: n nodes, horizontal partitioning
Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information
about other nodes
Problem & Assumptions
...
8
Challenges
Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of
compromise Why not use secure multi-party computation
techniques? High communication overhead Feasible for small inputs only
10
Private Max
1
3
2
4
30
20 40
10
30
30
40
40
start
Actual Data sent on first pass
Static Starting Point Known
11
Multi-Round Max
Start18 3532
32 4035
D2
D3
D2
D4
30
20 40
10
18 3532
32 4035
0
Randomly perturbed data passed to successor during multiple passes
No successor can determine actual data from it's predecessor
Randomized Starting Point
12
Evaluation ParametersParameter Description
n # of nodes in the systemk KNN parameter
Po Initial randomization probability in neighbor selectiond Dampening factor in neighbor selectionr # of rounds in neighbor selection
Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"
16
Conclusion
Problems Tackled Preserving efficiency and accuracy while
introducing provable privacy to the system Improving a naive protocol Reducing privacy risk in an efficient manner