from idiosyncratic to stereotypical: toward privacy in public databases

20
Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Upload: mirari

Post on 21-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,

Larry Stockmeyer, Hoeteck Wee

From Idiosyncratic to Stereotypical:

Toward Privacy in Public Databases

Page 2: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla2

Database Privacy

Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?

Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility

– disguise individual identifying information– preserve macroscopic properties

Develop a “good” sanitizing procedure with theoretical guarantees

Page 3: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla3

An outline of this talk

A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations

A candidate sanitization A brief overview of results General argument for privacy of n-point datasets

Open issues and concluding remarks

Page 4: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla4

Privacy… a philosophical view-point

[Ruth Gavison] … includes protection from being brought to the attention of others …

Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in

with the crowd

Appealing definition; can be converted into a precise mathematical statement!

Page 5: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla5

Database Privacy

Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular

features, while preserving means. Additionally, erase values that reveal too much

Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach

privacy Perturbation: Add noise to the query output

[Dinur Nissim’03, Dwork Nissim’04]

Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques

with analysis similar to query-based approaches

Page 6: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla6

Everybody’s First Suggestion

Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution

Want to reflect facts on the ground

Statistically insignificant facts can be important for allocating resources

Page 7: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla7

A geometric view

Abstraction : Points in a high dimensional metric space – say R d;

drawn i.i.d. from some distribution Points are unlabeled; you are your collection of

attributes Distance is everything

Real Database (RDB) – privaten unlabeled points in d-dimensional space.

Sanitized Database (SDB) – publicn’ new points possibly in a different space.

Page 8: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla8

The adversary or Isolator

Using SDB and auxiliary information (AUX), outputs a point q

q “isolates” a real point x, if it is much closer to x than to x’s neighbors,

T-radius of x – distance to its T-nearest neighbor x is “safe” if x > (T-radius of x)/(c-1)

B(q, cx) contains x’s entire T-neighborhood

(c-1)

c – privacy parameter; eg. 4

qx

c

large T and small c is good

i.e., if B(q,c) contains less than T RDB points

Page 9: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla9

A good sanitization

Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output

A rigorous (and too ideal) definitionD I I ’ w.o.p RDB 2R Dn aux z x 2 RDB :

| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n

Definition of can be forgiving, say, 2-(d) or (1 in a 1000)

Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved

Provides a framework for describing the power of a sanitization method, and hence for comparisons

Page 10: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla10

The Sanitizer

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius

x’ = San(x) R S(x,T-rad(x))

Intuition: We are blending x in with its crowd

If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one.

We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.

Page 11: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla11

Results on privacy.. An overview

Distribution Num. of points

Revealed to adversary

Auxiliary information

Uniform on surface of sphere

2 Both sanitized points Distribution, 1-radius

Uniform over a bounding box or surface of sphere

n One sanitized point, all other real points

Distribution, all real points

Gaussian 2o(d) n sanitized points Distribution

Gaussian 2(d) Work under progress

Page 12: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla12

Results on utility… An overview

Distributional/Worst-case

Objective Assumptions Result

Worst-case Find K clusters minimizing largest diameter

- Optimal diameter as well as approximations increase by at most a factor of 3

Distributional Find k maximum likelihood clusters

Mixture of k Gaussians

Correct clustering with high probability as long as means are pairwise sufficiently far

Page 13: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla13

A special case - one sanitized point

RDB = {x1,…,xn}

The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior

Recall: x’1 2R S(x1,|x1-y|)

where y is the nearest neighbor of x1

Main idea:Consider the posterior distribution on x1

Show that the adversary cannot isolate a large probability mass under this distribution

Page 14: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla14

Let Z = { pR d | p is a legal pre-image for x’1 }

Q = { p | if x1=p then x1 is isolated by q }

We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]

Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)

A special case - one sanitized point

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

|p-q| · 1/3 |p-x’1|

Page 15: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla15

Contribution from Z

Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) Increase in r x’1 gets randomized over a larger area

– proportional to rd. Hence the inverse dependence.

Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1

Z subtends a solid angle equal to at least half a sphere at x’1

x’1

x2

x3

x4

x5

Z

x6

S

r

p

Page 16: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla16

Contribution from Q Å Z

The ellipsoid is roughly as far from x’1 as its longest radius

Contribution from ellipsoid is 2-d x total solid angle

Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

r r

Page 17: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla17

The general case… n sanitized points

Initial intuition is wrong: Privacy of x1 given x1’ and all the other points in the

clear does not imply privacy of x1 given x1’ and sanitizations of others!

Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor

Where we are now Consider some example of safe sanitization (not necessarily

using perturbations) Density regions? Histograms?

Relate perturbations to the safe sanitization

Uniform distribution; histogram over fixed-size cells exponentially low probability of isolation

Page 18: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla18

Future directions

Extend the privacy argument to other “nice” distributions

For what distributions is there no meaningful privacy—utility trade-off?

Characterize acceptable auxiliary information Think of auxiliary information as an a priori

distribution

The low-dimensional case – Is it inherently impossible?

Discrete-valued attributes Our proofs require a “spread” in all attributes

Extend the utility argument to other interesting macroscopic properties – e.g. correlations

Page 19: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla19

Conclusions

Our work so far: A first step towards understanding the privacy-utility

trade-off A general and rigorous definition of privacy A work in progress!

How does this compare to other frameworks e.g. Query-based approaches?

Query-based approaches: directly identify good and bad functions

Our approach: summarize “good” functions by a “sanitized

database”

Page 20: From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Shuchi Chawla20

Questions?