mining multiple private databases topk queries across multiple private databases (2005) li xiong...
TRANSCRIPT
Mining Multiple Private Mining Multiple Private DatabasesDatabases
Topk Queries Across Multiple Private Databases (2005)
Li Xiong (Emory University)Subramanyam Chitti (GA Tech)
Ling Liu (GA Tech)
Presented by: Cesar Gutierrez
2
About Me
ISYE Senior and CS minor
Graduating December, 2008
Humanitarian Logistics and/or Supply Chain
Originally from Lima, Peru
Travel, paintball and politics
3
Outline
Intro. & Motivation Problem Definition Important Concepts & Examples Private Algorithm Conclusion
4
Introduction
↓ of information-sharing restrictions due to technology
↑ need for distributed data-mining tools that preserve privacy
Trade-off Accuracy
Efficiency Privacy
5
Motivating Scenarios
CDC needs to study insurance data to detect disease outbreaks Disease incidents Disease seriousness Patient Background
Legal/Commercial Problems prevent release of policy holder's information
6
Motivating Scenarios (cont'd)
Industrial trade group collaboration Useful pattern: "manufacturing using chemical
supplies from supplier X have high failure rates" Trade secret: "manufacturing process Y gives low
failure rate"
7
Model: n nodes, horizontal partitioning
Assume Semi-honesty: Nodes follow specified protocol Nodes attempt to learn additional information
about other nodes
Problem & Assumptions
...
8
Challenges
Why not use a Trusted Third Party (TTP)? Difficult to find one that is trusted Increased danger from single point of
compromise Why not use secure multi-party computation
techniques? High communication overhead Feasible for small inputs only
10
Private Max
1
3
2
4
30
20 40
10
30
30
40
40
start
Actual Data sent on first pass
Static Starting Point Known
11
Multi-Round Max
Start18 3532
32 4035
D2
D3
D2
D4
30
20 40
10
18 3532
32 4035
0
Randomly perturbed data passed to successor during multiple passes
No successor can determine actual data from it's predecessor
Randomized Starting Point
12
Evaluation ParametersParameter Description
n # of nodes in the systemk KNN parameter
Po Initial randomization probability in neighbor selectiond Dampening factor in neighbor selectionr # of rounds in neighbor selection
Large k = "avoid information leaks" Large d = more randomization = more privacy Small d = more accurate (deterministic) Large r = "as accurate as ordinary classifier"
16
Conclusion
Problems Tackled Preserving efficiency and accuracy while
introducing provable privacy to the system Improving a naive protocol Reducing privacy risk in an efficient manner