stochastic protection of confidential information in sdb: a hybrid of query restriction and data...
TRANSCRIPT
Stochastic Protection of Confidential Information in SDB:
A hybrid of Query Restriction and Data Perturbation
(to appear in Operations Research)
Manuel Nunez, Robert Garfinkel, and Ram Gopal
JHUOctober 5, 2006
Motivation
Two general goals for a statistical database: Protect confidential records Provide useful information
The goals are often in conflict -> tradeoffProblems faced by Census Bureaus, etc.
An Example
Record Name Job Age Co. Salary
1 Robinson Manager 27 A 55
2 Reese Trainee 42 B 31
3 Furillo Manager 63 C 107
4 Campanella Trainee 28 B 28
5 Cox Manager 55 B 63
6 Snider Manager 57 A 82
7 Koufax Trainee 21 D 29
8 Newcombe Trainee 32 C 31
9 Hodges Manager 35 D 60
10 Branca Trainee 36 D 27
11 Loes Manager 47 B 37
12 Roe Trainee 28 D 42
13 Reiser Manager 64 A 94
14 Gilliam Manager 46 C 51
Types of Protection
Random data perturbation Add noise to data, answer all queries
Query restriction/inference control Provide exact answers to some queries, but
refuse to answer others Keep track of answered queries (auditing)
Camouflage/interval methods Provide interval answer to queries Answer all queries
Exact Disclosure
DB is [2, 5, 8], two target SUM queries: q1 = [1, 1, 1], answer is 15 q2 = [1, 1, 0], answer is 7
User can solve system
And learn that a3 = 8
7
15
21
321
xx
xxx
Another Perspective
Notice that a linear combination of q1 and q2 yields the canonical vector e3 = [0, 0, 1]Namely, q1 – q2 = e3
In general, a group of linear queries is “exactly safe” if none of the canonical vectors can be expressed as a linear combination of the queries in the group
Degrees of Disclosure
Exact disclosure User is unable to learn the exact confidential
value of any subject
Interval disclosure User is unable to learn that confidential value
is within a subject pre-specified interval
Stochastic disclosure User is unable to randomly estimate a
confidential value with high probability
Previous Work on Query Restriction: J.O.C.
Interval disclosureSUM and MIN (MAX) queriesDetermine heuristic restriction of the polytope that describes the user’s knowledge.Use it to decide whether to answer the next query. Would it result in a “safe” polytope?Collusion, but not auditing, is a problem“Success” is a function of no. of answered queries
QR continued
Queries arrive onlineDecisions are made without knowing what comes nextIf all queries were known, finding a maximum cardinality set to answer is NP-Hard (Chin & Ozsoyoglu).
Previous work on Camouflage (CVC): Operations Research
Hide confidential vector in the interior of a “safe” polytope Π.Answer all queries q=f(x) with the interval [min f(x), max f(x), x ε Π]Answers are deterministically correctDepending on the query type, finding polynomial, minimum access algorithms is not trivial!
CVC Continued
A set of linear queries can be predetermined to safely yield exact answers via a network flow formulationCollusion is not a problem.The same cannot be said of “insider information”
Current extensions of CVC
Dealing with insider threats“Data” vs. “Process”Finding the best (smallest) camouflaging set based on the threat type and levelIs it necessarily a polytope?
Hybrid of Query Restriction and Data Perturbation
Provide an algorithm to determine which queries (from a given set) can be exactly answered without compromising confidentiality (safe subset)Provide a protection mechanism to answer all other queriesMaintain consistency of exact answers and protected answers
Our Approach
Given a target set of weighted queries, follow a 3-phase process:
1. Find the maximum weight query subset that can be safely answered exactly
2. For other target queries, answer safe approximate queries exactly (optional)
3. Answer all other non-target queries using a consistent perturbed DB
Importance of Consistency
Suppose q is answered exactly.In the absence of consistency a user who wants to determine ai can ask a series of queries q´= q + ei to get a set of i.i.d. estimates of ai As the number of such queries gets large the error in the resulting estimate of ai goes to zero.
What is Given to the User?
Guaranteed exact answers to safe target queries Public answers imply no threat from user
collusion
Approximate answers to unsafe target queries This way, we ensure some degree of
information for all target queries
Access to a perturbed DB for all other non-target query
Model Assumptions
DB has n subjectsOnly one confidential field: a є Rn (could be a stacking of any number of such fields)Every subject is identifiable by the record indexSet of subject indexes: N, |N| = nQueries have nonnegative weights
Phase 1: Query Restriction
Set of target queries: TQuery weights: wIndex set to queries in T: M, |M| = mSum of weights for K subset of M:
Kj
jwKW )(
Phase 1 Optimization Problem
Problem OPT:
Where F is a family of “safe” subsetsBut before defining a safe set, let’s
talk about matroids …
FKKWz :)(max
Matroids
Modeling theory founded by H. Whitney, 1935Many applications in combinatorial optimization: Maximal spanning tree Matroid intersection Maximal partition/matching Etc
Quick Definition
Matroid is a pair (M, F): M is a finite set, F is a family of subsets of MElements of F are called “independent” setsTwo properties: If K is in F, then all subsets of K are in F If K and L are in F, |K| = |L| + 1, then one
element of K can be added to L to create a new independent set
Rank of K, r(K), is the cardinality of largest independent set in K
Example: MST
All sub-trees are independent setsMatroid is the collection of sub-treesThe rank of a subgraph is the number of
links of the largest tree in the subgraph
100
300
200 400500 600
Example: Sets of L.I. vectors
Find a linear basis from a matrixThe matroid consists of subsets of linearly independent columnsA basis is an independent set of maximum cardinality Rank of a submatrix is the column-rank of the submatrix
Non example
Consider an Assignment ProblemA set of cells is independent if no row or column appears more than once.Seems to be almost a matroid but it’s not!
Main Matroid Result
Given a set of non-negative weights assigned to the elements of M
If (M, F) is a matroid, then the Greedy algorithm will find an independent set (i.e. a set in F) that maximizes the sum of the weights
Matroid Intersection
Given k matroids (M, F1), …, (M, Fk) and weights for the elements of M, the goal is to find a common independent set that maximizes the sum of the weightsProblem: intersection of matroids is not a matroidFor general k, the problem is NP-HardYet, a modified greedy algorithm works for intersection of 2 matroids
Matroid and Inference
Given target query set T, let M be the indexes to the queriesA subset K of M is safe w.r.t. subject i if the user cannot learn subject i’ s confidential record using linear combinations of queries with index set KLet Fi be the safe subsets of M w.r.t. subject i
Then, (M Fi) is a matroid!A safe set is safe w.r.t. all subjects, that is, is in the matroid intersection
Examples of Safe Sets
Four target queries:
Subject Q1 Q2 Q3 Q41 1 0 1 02 0 1 1 13 0 1 0 14 1 0 1 1
Weight 40 20 20 30
Independent (Safe) Sets
Safe w.r.t. subjectK 1 2 3 4 All W(K)
Empty 01 402 203 204 30
1, 2 601, 3 601, 4 702, 3 402, 4 503, 4 50
1, 2, 3 801, 2, 4 901, 3, 4 902, 3, 4 70
1, 2, 3, 4 110
Rank Evaluation
RankK r1 r2 r3 r4 Min
Empty
1
2
3
4
1, 2
1, 3
1, 4
2, 3
2, 4
3, 4
1, 2, 3
1, 2, 4
1, 3, 4
2, 3, 4
1, 2, 3, 4
Approximate Solutions to OPT
Matroid intersection greedy (MIG) algorithm: Start with full index set M1 = M At iteration t+1, remove one index from Mt to
create set Mt+1
Remove index that minimizes the ratios:
Stop when Mt becomes a safe set
t
j
j Mjf
w
,
More About MIG
Denominator fj roughly counts in how many additional matroids the set Mt+1 will become safeIn other words, the best index to remove is chosen so that its weight is low and it will make safe the set Mt+1 for many matroidsMIG will finish in no more than m iterations, and each iteration can be done in O(m3n2) operations
Approximation Error
Set obtained from MIG: K, M \ K is safeZ is the optimal value of OPTNemhauser + Wolsey bounds:
H(d) is the harmonic number:
d
i
idH1
/1)(
)1(
)()()()(
mH
KWMWZKWMW
Example
K = {2, 3}, M\K = {1, 4}K* = {2, 3, 4}, W(K*) = 40Bounds: 20 < Z < 40.4
Subject Q1 Q2 Q3 Q41 1 1 1 42 1 1 0 83 1 0 1 84 1 1 1 2
Weight 19 10 10 20
Phase 2: Additional Safe Answers
Set S is the chosen set of exact answer queriesWhat to do about a query q in T\S? Answer a query “close to” q Order queries in T\S according to weight
For instance, if q is a sum query, answer a safe query with smaller query sizeOr, answer the closest query to q that is a linear combination of the queries in S
Phase 3: Constrained Perturbation
Goal: Answer all queries with perturbed data a + a making sure that answers are consistent with target queriesTwo almost equivalent methods: Perturb and project onto query
hyperplane Perturb on the hyperplane direction
Extending Protection
What to do to provide interval protection?
What to do to provide stochastic protection from exact answers and from the perturbation?
Program G3LP
Let Q be a matrix whose columns are the exact answer queriesConsider linear program G3LP, i є N:
iii luz max
0,
s.t.
ul
aQlQ
aQuQTT
TT
Interval Disclosure
If zi* = ui
* - li* is optimal to G3LP, then the user will know
Interval disclosure occurs when
Where is chosen by subject i
**, iii ula
*iz
Stochastic Disclosure
Let Xi be a random estimation of ai
Let l and u be known bounds on ai
For and > 0, ai is protected if
That is, ai cannot be randomly estimated in any interval of range or smaller with probability or higher
rsulsrsrX i ],,[],[,,Pr
Protection against stochastic threat from deterministic answers
Before perturbation phase, systematically remove queries from exact answer set until the following condition holds for all subjects
/** ii lu
continued
The problem of which queries to be removed is also hard.A greedy heuristic gives similar bounds to those of Phase 1.
Stochastic threat from Perturbation
Based on the perturbation, confidence intervals on ai can be obtained from Chebyshev’s inequality.Solution is to generate a sequence of i.i.d. perturbations until a safe one is found.