stochastic protection of confidential information in sdb: a hybrid of query restriction and data...

Stochastic Protection of Confidential Information in SDB:

A hybrid of Query Restriction and Data Perturbation

(to appear in Operations Research)

Manuel Nunez, Robert Garfinkel, and Ram Gopal

JHUOctober 5, 2006

Motivation

Two general goals for a statistical database: Protect confidential records Provide useful information

The goals are often in conflict -> tradeoffProblems faced by Census Bureaus, etc.

An Example

Record Name Job Age Co. Salary

1 Robinson Manager 27 A 55

2 Reese Trainee 42 B 31

3 Furillo Manager 63 C 107

4 Campanella Trainee 28 B 28

5 Cox Manager 55 B 63

6 Snider Manager 57 A 82

7 Koufax Trainee 21 D 29

8 Newcombe Trainee 32 C 31

9 Hodges Manager 35 D 60

10 Branca Trainee 36 D 27

11 Loes Manager 47 B 37

12 Roe Trainee 28 D 42

13 Reiser Manager 64 A 94

14 Gilliam Manager 46 C 51

Types of Protection

Random data perturbation Add noise to data, answer all queries

Query restriction/inference control Provide exact answers to some queries, but

refuse to answer others Keep track of answered queries (auditing)

Camouflage/interval methods Provide interval answer to queries Answer all queries

Exact Disclosure

DB is [2, 5, 8], two target SUM queries: q1 = [1, 1, 1], answer is 15 q2 = [1, 1, 0], answer is 7

User can solve system

And learn that a3 = 8

7

15

21

321

xx

xxx

Another Perspective

Notice that a linear combination of q1 and q2 yields the canonical vector e3 = [0, 0, 1]Namely, q1 – q2 = e3

In general, a group of linear queries is “exactly safe” if none of the canonical vectors can be expressed as a linear combination of the queries in the group

Degrees of Disclosure

Exact disclosure User is unable to learn the exact confidential

value of any subject

Interval disclosure User is unable to learn that confidential value

is within a subject pre-specified interval

Stochastic disclosure User is unable to randomly estimate a

confidential value with high probability

Previous Work on Query Restriction: J.O.C.

Interval disclosureSUM and MIN (MAX) queriesDetermine heuristic restriction of the polytope that describes the user’s knowledge.Use it to decide whether to answer the next query. Would it result in a “safe” polytope?Collusion, but not auditing, is a problem“Success” is a function of no. of answered queries

QR continued

Queries arrive onlineDecisions are made without knowing what comes nextIf all queries were known, finding a maximum cardinality set to answer is NP-Hard (Chin & Ozsoyoglu).

Previous work on Camouflage (CVC): Operations Research

Hide confidential vector in the interior of a “safe” polytope Π.Answer all queries q=f(x) with the interval [min f(x), max f(x), x ε Π]Answers are deterministically correctDepending on the query type, finding polynomial, minimum access algorithms is not trivial!

CVC Continued

A set of linear queries can be predetermined to safely yield exact answers via a network flow formulationCollusion is not a problem.The same cannot be said of “insider information”

Current extensions of CVC

Dealing with insider threats“Data” vs. “Process”Finding the best (smallest) camouflaging set based on the threat type and levelIs it necessarily a polytope?

Hybrid of Query Restriction and Data Perturbation

Provide an algorithm to determine which queries (from a given set) can be exactly answered without compromising confidentiality (safe subset)Provide a protection mechanism to answer all other queriesMaintain consistency of exact answers and protected answers

Our Approach

Given a target set of weighted queries, follow a 3-phase process:

1. Find the maximum weight query subset that can be safely answered exactly

2. For other target queries, answer safe approximate queries exactly (optional)

3. Answer all other non-target queries using a consistent perturbed DB

Importance of Consistency

Suppose q is answered exactly.In the absence of consistency a user who wants to determine ai can ask a series of queries q´= q + ei to get a set of i.i.d. estimates of ai As the number of such queries gets large the error in the resulting estimate of ai goes to zero.

What is Given to the User?

Guaranteed exact answers to safe target queries Public answers imply no threat from user

collusion

Approximate answers to unsafe target queries This way, we ensure some degree of

information for all target queries

Access to a perturbed DB for all other non-target query

Model Assumptions

DB has n subjectsOnly one confidential field: a є Rn (could be a stacking of any number of such fields)Every subject is identifiable by the record indexSet of subject indexes: N, |N| = nQueries have nonnegative weights

Phase 1: Query Restriction

Set of target queries: TQuery weights: wIndex set to queries in T: M, |M| = mSum of weights for K subset of M:

Kj

jwKW )(

Phase 1 Optimization Problem

Problem OPT:

Where F is a family of “safe” subsetsBut before defining a safe set, let’s

talk about matroids …

FKKWz :)(max

Matroids

Modeling theory founded by H. Whitney, 1935Many applications in combinatorial optimization: Maximal spanning tree Matroid intersection Maximal partition/matching Etc

Quick Definition

Matroid is a pair (M, F): M is a finite set, F is a family of subsets of MElements of F are called “independent” setsTwo properties: If K is in F, then all subsets of K are in F If K and L are in F, |K| = |L| + 1, then one

element of K can be added to L to create a new independent set

Rank of K, r(K), is the cardinality of largest independent set in K

Example: MST

All sub-trees are independent setsMatroid is the collection of sub-treesThe rank of a subgraph is the number of

links of the largest tree in the subgraph

100

300

200 400500 600

Example: Sets of L.I. vectors

Find a linear basis from a matrixThe matroid consists of subsets of linearly independent columnsA basis is an independent set of maximum cardinality Rank of a submatrix is the column-rank of the submatrix

Non example

Consider an Assignment ProblemA set of cells is independent if no row or column appears more than once.Seems to be almost a matroid but it’s not!

Main Matroid Result

Given a set of non-negative weights assigned to the elements of M

If (M, F) is a matroid, then the Greedy algorithm will find an independent set (i.e. a set in F) that maximizes the sum of the weights

Matroid Intersection

Given k matroids (M, F1), …, (M, Fk) and weights for the elements of M, the goal is to find a common independent set that maximizes the sum of the weightsProblem: intersection of matroids is not a matroidFor general k, the problem is NP-HardYet, a modified greedy algorithm works for intersection of 2 matroids

Matroid and Inference

Given target query set T, let M be the indexes to the queriesA subset K of M is safe w.r.t. subject i if the user cannot learn subject i’ s confidential record using linear combinations of queries with index set KLet Fi be the safe subsets of M w.r.t. subject i

Then, (M Fi) is a matroid!A safe set is safe w.r.t. all subjects, that is, is in the matroid intersection

Examples of Safe Sets

Four target queries:

Subject Q1 Q2 Q3 Q41 1 0 1 02 0 1 1 13 0 1 0 14 1 0 1 1

Weight 40 20 20 30

Independent (Safe) Sets

Safe w.r.t. subjectK 1 2 3 4 All W(K)

Empty 01 402 203 204 30

1, 2 601, 3 601, 4 702, 3 402, 4 503, 4 50

1, 2, 3 801, 2, 4 901, 3, 4 902, 3, 4 70

1, 2, 3, 4 110

Rank Evaluation

RankK r1 r2 r3 r4 Min

Empty

1

2

3

4

1, 2

1, 3

1, 4

2, 3

2, 4

3, 4

1, 2, 3

1, 2, 4

1, 3, 4

2, 3, 4

1, 2, 3, 4

Approximate Solutions to OPT

Matroid intersection greedy (MIG) algorithm: Start with full index set M1 = M At iteration t+1, remove one index from Mt to

create set Mt+1

Remove index that minimizes the ratios:

Stop when Mt becomes a safe set

t

j

j Mjf

w

,

More About MIG

Denominator fj roughly counts in how many additional matroids the set Mt+1 will become safeIn other words, the best index to remove is chosen so that its weight is low and it will make safe the set Mt+1 for many matroidsMIG will finish in no more than m iterations, and each iteration can be done in O(m3n2) operations

Approximation Error

Set obtained from MIG: K, M \ K is safeZ is the optimal value of OPTNemhauser + Wolsey bounds:

H(d) is the harmonic number:

d

i

idH1

/1)(

)1(

)()()()(

mH

KWMWZKWMW

Example

K = {2, 3}, M\K = {1, 4}K* = {2, 3, 4}, W(K*) = 40Bounds: 20 < Z < 40.4

Subject Q1 Q2 Q3 Q41 1 1 1 42 1 1 0 83 1 0 1 84 1 1 1 2

Weight 19 10 10 20

Phase 2: Additional Safe Answers

Set S is the chosen set of exact answer queriesWhat to do about a query q in T\S? Answer a query “close to” q Order queries in T\S according to weight

For instance, if q is a sum query, answer a safe query with smaller query sizeOr, answer the closest query to q that is a linear combination of the queries in S

Phase 3: Constrained Perturbation

Goal: Answer all queries with perturbed data a + a making sure that answers are consistent with target queriesTwo almost equivalent methods: Perturb and project onto query

hyperplane Perturb on the hyperplane direction

Perturb & Project

Directional Perturbation

Extending Protection

What to do to provide interval protection?

What to do to provide stochastic protection from exact answers and from the perturbation?

Program G3LP

Let Q be a matrix whose columns are the exact answer queriesConsider linear program G3LP, i є N:

iii luz max

0,

s.t.

ul

aQlQ

aQuQTT

TT

Interval Disclosure

If zi* = ui

* - li* is optimal to G3LP, then the user will know

Interval disclosure occurs when

Where is chosen by subject i

**, iii ula

*iz

Stochastic Disclosure

Let Xi be a random estimation of ai

Let l and u be known bounds on ai

For and > 0, ai is protected if

That is, ai cannot be randomly estimated in any interval of range or smaller with probability or higher

rsulsrsrX i ],,[],[,,Pr

Protection against stochastic threat from deterministic answers

Before perturbation phase, systematically remove queries from exact answer set until the following condition holds for all subjects

/** ii lu

continued

The problem of which queries to be removed is also hard.A greedy heuristic gives similar bounds to those of Phase 1.

Stochastic threat from Perturbation

Based on the perturbation, confidence intervals on ai can be obtained from Chebyshev’s inequality.Solution is to generate a sequence of i.i.d. perturbations until a safe one is found.

Numerical results

Results are very encouraging. Large numbers of queries answered exactlyDevelopment of a test bank was difficult because of the problem of finding optimal solutions. A class of interesting problems was found for which those solutions were easily determined.

stochastic protection of confidential information in sdb: a hybrid of query restriction and data...

Documents

queries q

weighted queries

safe approximate queries

nontarget queries

group of linear queries

target sum queries

safe polytope

exact confidential value