raef bassily computer science & engineering pennsylvania state university new tools for...

38
Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy- Preserving Statistical Analysis IBM Research Almaden February 23, 2015

Upload: terence-warner

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Raef Bassily Computer Science &

Engineering Pennsylvania State University

New Tools for Privacy-Preserving Statistical Analysis

IBM Research Almaden

February 23, 2015

Page 2: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Privacy in Statistical Databases

Aqueries

answers)(

Government,researchers,businesses

(or) maliciousadversary

Curatorx1

x2

xn

...

Users

• Two conflicting goals: Utility vs. Privacy

internet

social networks

anonymized datasets

• Balancing these goals is tricky: No control over external sources of information Ad-hoc Anonymization schemes are unreliable:

[Narayanan-Shmatikov’08],

[Korolova’11],

[Calendrino et al.’12], …

Need algorithms with robust, provable privacy guarantees.

Page 3: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

This work

Gives efficient algorithms for statistical data analyses with optimal accuracy under rigorous, provable privacy guarantees.

Page 4: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Differential privacy [DMNS’06, DKMMN’06]

local random coins

A A

local random coins

x1

x2

xn

x2’

x1

Datasets x and x’ are called neighbors if they differ in one record.

xn

Require: Neighbor datasets induce close distributions on outputs

Def.: A randomized algorithm A is -differentially private if, for all neighbor data sets and , for all events ,

“Almost same” conclusions will be reached from the output regardless of whether any individual opts into or opts out of the data set.

Think of Think of

Worst-case definition:

DP gives same guarantee regardless of side information of attacker.

Two regimes:

-differential privacy

-differential privacy,

Page 5: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Two models for private data analysis

A

Individuals TrustedCuratorx1

x2

xnA is differentially private

w.r.t. datasets of size n

Centralized model

B

Individuals Untrusted Curator

y1

y2

yn

x1

x2

xn

Q1

Q2

Qn

Each Qi is differentially private w.r.t. datasets of size 1

Local model

Page 6: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

This talk

1. Differentially private algorithms for: Convex Empirical Risk Minimization in the centralized model

Estimating Succinct Histograms in the local model

2. Generic framework for relaxing Differential Privacy

Page 7: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

1. Differentially private algorithms for: Convex Empirical Risk Minimization in the centralized model

Estimating Succinct Histograms in the local model

2. Generic framework for relaxing Differential Privacy

This talk

Page 8: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Example of Convex ERM: Support Vector Machines

• Goal: Classify data points of different “types”

Find a hyper-plane separating two different “types” of data points.

• Many applications Medical studies: Disease classification

based on protein structures.

Tested +ve

Tested -ve

• Many applications Medical studies: Disease classification

based on protein structures.

• Coefficients of hyper-plane is the solution of a convex optimization problem defined by the data set.

• is given by a linear combination of only few data points called support vectors.

Page 9: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Convex empirical risk minimization

C

• Dataset .

• Convex constraint set .

• Loss function

where is convex for all .

Page 10: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Convex empirical risk minimization

Actual minimizer

C

• Dataset .

• Convex constraint set .

• Loss function

where is convex for all .

• Goal: Find a “parameter”

that minimizes

Page 11: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Excess risk

OutputActual minimizer

C

• Dataset .

• Convex constraint set .

• Loss function

where is convex for all .

• Goal: Find a “parameter”

that minimizes

• Output such that

Convex empirical risk minimization

Page 12: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Other examples

• Median

• Linear regression

Page 13: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Why privacy is hard to maintain in ERM?

• Dual form of SVM: typically contains a subset of the exact data points in the clear.

• Median: Minimizer is always a data point.

Page 14: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Private convex ERM [Chaudhuri-Monteleoni 08 & -- Sarwate 11]

• Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer-Smith-Thakurta‘12, Smith-Thakurta ’13, …]

• Privacy: A is differentially private in input • Utility measured by (worst-case) expected excess risk:

A -diff. private

Dataset

Convex setLoss , Random coins

Page 15: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

• Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] address special case (smooth functions) Application to many problems (e.g., SVM, median, …)

introduces large additional error.

Contributions [B, Smith, Thakurta ‘14]

• This work improves previous excess risk bounds by factor of

1. New algorithms with optimal excess risk assuming:

• Loss function is Lipschitz.

• Parameter set C is bounded.

(Separate set of algorithms for strongly convex loss.)

2. Matching lower bounds

Page 16: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Privacy Excess risk Technique

-DPExponential sampling(inspired by [McSherry-Talwar’07])

-DPNoisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and[Chaudhuri-Sarwate-Song’13])

Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1.

Results (dataset size = , C )

Page 17: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Privacy Excess risk Technique

-DPExponential sampling(inspired by [McSherry-Talwar’07])

-DPNoisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and[Chaudhuri-Sarwate-Song’13])

Results (dataset size = , C )

Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1.

Page 18: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Exponential sampling

• Define a probability distribution over C :

• Output a sample from C according to

An instance of the exponential mechanism [McSherry-Talwar’08]

Efficient construction based on rapidly mixing MCMC: Uses [Applegate-Kannan’91] as a subroutine. Provides purely multiplicative convergence guarantee. Does not follow directly from existing results.

Tight utility analysis via a “peeling” argument: Exploits structure of convex functions:

A1 , A2 , … are decreasing in volume

Shows that when

Page 19: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

• Run SGD with noisy queries for

sufficiently many iterations.

Noisy stochastic gradient descent

• Our contributions: Tight privacy analysis

Stochastic privacy amplification Running SGD for many iterations (T = n2 iterations) optimal

excess risk.

Remarks:

• Stochastic part only for efficiency.

• Empirically, [CSS’13] showed few

iterations are enough in some cases.

Page 20: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Generalization error

For a distribution , generalization error at :

For any distribution , for output of any -DP algorithm:

• -DP algorithm such that:

• -DP algorithm such that:

• Generalized linear model: we get optimal.

Page 21: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

1. Differentially private algorithms for: Convex Empirical Risk Minimization in the centralized model

Estimating Succinct Histograms in the local model

2. Generic framework for relaxing Differential Privacy

This talk

Page 22: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Finance.com

Fashion.com

WeirdStuff.com

How many users like Business.com?

...

A conundrum

server

How can the server compute aggregate statistics about users

without storing user-specific information?

Page 23: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

...

n

1

2

... Untrusted server

A set of items (e.g. websites) = [d] = {1, …, d}Set of users = [n] Frequency of an item a is f(a) = ( users holding a♯ )/n

Finance.com

Fashion.com

WeirdStuff.com

Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users.

. . . . . .

1 2 3Item ♯ . . . . . . d-2 d-1 d

. . . . . .

. . . . . .

1 2 3Item ♯ . . . . . . d-2 d-1 d

. . . . . .

Succinct histogram =

for some

implicitly

Succinct histograms

Page 24: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Local model of Differential Privacy

Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S,

v1

...

v2

vn

Q1

Q2

Qn

...

z1

z2

zn

Succinct histogram

is item of user

zi is differentially-private report of user i

LDP protocols for frequency estimation is used

• in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14]

• as a basis for other estimation tasks [Dwork-Nissim’04]

Page 25: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Error is measured by the worst-case estimation error:

Performance measures

v1

...

v2

vn

Q1

Q2

Qn

...

z1

z2

zn

Succinct histogram

is item of user

zi is differentially-private report of user i

A protocol is efficient if it runs in time poly(log(d), n)Communication Complexity measured by number of bits transmitted per user.

• d is very large, e.g., number of all possible URL’s

• log(d) = # of bits to describe single URL

Page 26: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Contributions [B, Smith ‘15]

1. Efficient -LDP protocol with optimal error:• run in time poly(log(d), n).

• Estimate all frequencies up to error .

2. Matching lower bound on the error.

3. Generic transformation reducing the communication

complexity to 1 bit/user.• Previous protocols either

ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14]

or, had larger error [HKR’12]

Too slow

Too much error• Best previous lower bound was

Page 27: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

• UHH: at least fraction of users have the same item

while the rest have (i.e., “no item”)

Design paradigm

• Reduction from a simpler problem with a unique heavy hitter

(UHH problem) Efficient protocol with optimal error for UHH

efficient protocol with optimal error for the general problem.

Page 28: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Construction for the UHH problem

v*

v*

...

Encoder

Encoder

z1Noising operator

z2Noising operator

znNoising operator

Round Decoder

(error-correcting code)

Key idea: is the signal-to-noise ratio. Decoding succeeds when

• Each user has either v* or • v* is unknown to the server • Goal: Find v* and estimate f(v*)

Similar to [Duchi et al.’13]

Page 29: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

• Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol.

v1

vn

Hash

...

Hash

..

1

K

2

..

1

K

2

..

..

v1

vn

..

..

UHH

UHH

UHH.

UHH

UHH

UHH.

Item whose frequency

Construction for the general setting

Key insight: • Decompose general scenario into multiple instances of UHH via hashing. • Run parallel copies of the UHH protocol on these instances.

Page 30: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

. . .

Efficient Private

Protocol for a

unique heavy hitter

UHH

Efficient Private Protocol for estimating all heavy hitters

Efficient Private

Protocol for a

unique heavy hitter

UHH

Time poly(log(d), n) All frequencies up to the optimal error

Efficient Private

Protocol for a

unique heavy hitter

UHH

Recap: Construction of succinct histograms

Page 31: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Transforming to a protocol with 1-bit reports

• generate public random string; one for each user• User i sends a biased bit Bi

• Conditioned on Bi = 1, the public string has the same distribution as the output of

local randomizer Qi

Gen(Qi , vi , si)vi Bi

si

Local randomizer: Qi

IF Bi = 1, THEN

report of user i = si

ELSE ignore user i

This transformation works for any local protocol not only heavy hitters.

Key idea: What matters is the distribution of the output of each local randomizer.

Public string does not depend on private data: can be generated by untrusted server.

For our HH protocol, this transformation gives essentially same error and computational

efficiency (Gen can be computed in O(log(d))).

Page 32: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

1. Differentially private algorithms for: Convex Empirical Risk Minimization in the centralized model

Estimating Succinct Histograms in the local model

2. Generic framework for relaxing Differential Privacy

This talk

Page 33: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Attacker’s side information

Aqueries

answers)(

Curatorx1

xi

xn

..

Attackerinternet

social networks

anonymized datasets

..

Attacker’s side information is the main reason privacy is hard.

Page 34: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Attacker’s side information

Aqueries

answers)(

Curatorx1

xi

xn

..

Omniscientattacker

..everything

except xi

Differential privacy is robust against arbitrary side information.Attackers typically have limited knowledge.

Contributions [B, Groce, Katz, Smith’13]: • Rigorous framework for formalizing and exploiting

limited adversarial information: coupled-worlds privacy • Algorithms with higher accuracy than is possible under

differential privacy

Page 35: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Exploiting attacker’s uncertainty [BGKS’13]

Aqueries

answers)(

Curatorx1

xi

xn

..

Attacker

..

Side info in Δ

for any side information in Δ, Given some restricted class of attacker’s knowledge Δ,

the output of A must “look the same” to the attacker regardless of whether any single individual is in or out of the computation.

Page 36: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Distributional Differential Privacy [BGKS’13]

local random coins

A A

local random coins

x1

xi

xn

xi

x1

xn

A is -DDP if,

for any distribution on the data set , for any index i, for any value v of a data entry, and for any event

This implies: for all distributions and for all i, w.p. : For any distribution in Δ, almost same inferences will be made about Alice whether or not Alice’s data is present in the data set.

Page 37: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

What can we release exactly and privately?

• Sums whenever the data distribution has a small uniform component.

• Histograms constructed from a random sample from the population.

• Stable functions small probability that the output changes when any single entry

of the dataset changes.

Under modest distributional assumptions, we can release several exact statistics while satisfying DDP:

Page 38: Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February

Conclusions

• Privacy, a pressing concern in “Big Data”, but hard to define intuitively.

• Differential privacy, a sound rigorous approach:

Robust against arbitrary side information

• This work:

the first efficient differentially private algorithms with optimal

accuracy guarantees for essential tasks in statistical data analysis.

generic definitional framework for privacy relaxing DP.