raef bassily computer science & engineering pennsylvania state university new tools for...

Raef Bassily Computer Science &

Engineering Pennsylvania State University

New Tools for Privacy-Preserving Statistical Analysis

IBM Research Almaden

February 23, 2015

Privacy in Statistical Databases

Aqueries

answers)(

Government,researchers,businesses

(or) maliciousadversary

Curatorx1

x2

xn

...

Users

• Two conflicting goals: Utility vs. Privacy

internet

social networks

anonymized datasets

• Balancing these goals is tricky: No control over external sources of information Ad-hoc Anonymization schemes are unreliable:

[Narayanan-Shmatikov’08],

[Korolova’11],

[Calendrino et al.’12], …

Need algorithms with robust, provable privacy guarantees.

This work

Gives efficient algorithms for statistical data analyses with optimal accuracy under rigorous, provable privacy guarantees.

Differential privacy [DMNS’06, DKMMN’06]

local random coins

A A

local random coins

x1

x2

xn

x2’

x1

Datasets x and x’ are called neighbors if they differ in one record.

xn

Require: Neighbor datasets induce close distributions on outputs

Def.: A randomized algorithm A is -differentially private if, for all neighbor data sets and , for all events ,

“Almost same” conclusions will be reached from the output regardless of whether any individual opts into or opts out of the data set.

Think of Think of

Worst-case definition:

DP gives same guarantee regardless of side information of attacker.

Two regimes:

-differential privacy

-differential privacy,

Two models for private data analysis

A

Individuals TrustedCuratorx1

x2

xnA is differentially private

w.r.t. datasets of size n

Centralized model

B

Individuals Untrusted Curator

y1

y2

yn

x1

x2

xn

Q1

Q2

Qn

Each Qi is differentially private w.r.t. datasets of size 1

Local model

This talk

1. Differentially private algorithms for: Convex Empirical Risk Minimization in the centralized model

Estimating Succinct Histograms in the local model

2. Generic framework for relaxing Differential Privacy




This talk

Example of Convex ERM: Support Vector Machines

• Goal: Classify data points of different “types”

Find a hyper-plane separating two different “types” of data points.

• Many applications Medical studies: Disease classification

based on protein structures.

Tested +ve

Tested -ve

• Many applications Medical studies: Disease classification

based on protein structures.

• Coefficients of hyper-plane is the solution of a convex optimization problem defined by the data set.

• is given by a linear combination of only few data points called support vectors.

Convex empirical risk minimization

C

• Dataset .

• Convex constraint set .

• Loss function

where is convex for all .


Actual minimizer

C

• Dataset .


• Loss function


• Goal: Find a “parameter”

that minimizes

Excess risk

OutputActual minimizer

C

• Dataset .


• Loss function


• Goal: Find a “parameter”

that minimizes

• Output such that


Other examples

• Median

• Linear regression

Why privacy is hard to maintain in ERM?

• Dual form of SVM: typically contains a subset of the exact data points in the clear.

• Median: Minimizer is always a data point.

Private convex ERM [Chaudhuri-Monteleoni 08 & -- Sarwate 11]

• Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer-Smith-Thakurta‘12, Smith-Thakurta ’13, …]

• Privacy: A is differentially private in input • Utility measured by (worst-case) expected excess risk:

A -diff. private

Dataset

Convex setLoss , Random coins

• Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] address special case (smooth functions) Application to many problems (e.g., SVM, median, …)

introduces large additional error.

Contributions [B, Smith, Thakurta ‘14]

• This work improves previous excess risk bounds by factor of

1. New algorithms with optimal excess risk assuming:

• Loss function is Lipschitz.

• Parameter set C is bounded.

(Separate set of algorithms for strongly convex loss.)

2. Matching lower bounds

Privacy Excess risk Technique

-DPExponential sampling(inspired by [McSherry-Talwar’07])

-DPNoisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and[Chaudhuri-Sarwate-Song’13])

Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1.

Results (dataset size = , C )

Privacy Excess risk Technique

-DPExponential sampling(inspired by [McSherry-Talwar’07])

-DPNoisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and[Chaudhuri-Sarwate-Song’13])

Results (dataset size = , C )

Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1.

Exponential sampling

• Define a probability distribution over C :

• Output a sample from C according to

An instance of the exponential mechanism [McSherry-Talwar’08]

Efficient construction based on rapidly mixing MCMC: Uses [Applegate-Kannan’91] as a subroutine. Provides purely multiplicative convergence guarantee. Does not follow directly from existing results.

Tight utility analysis via a “peeling” argument: Exploits structure of convex functions:

A1 , A2 , … are decreasing in volume

Shows that when

• Run SGD with noisy queries for

sufficiently many iterations.

Noisy stochastic gradient descent

• Our contributions: Tight privacy analysis

Stochastic privacy amplification Running SGD for many iterations (T = n2 iterations) optimal

excess risk.

Remarks:

• Stochastic part only for efficiency.

• Empirically, [CSS’13] showed few

iterations are enough in some cases.

Generalization error

For a distribution , generalization error at :

For any distribution , for output of any -DP algorithm:

• -DP algorithm such that:

• -DP algorithm such that:

• Generalized linear model: we get optimal.




This talk

Finance.com

Fashion.com

WeirdStuff.com

How many users like Business.com?

...

A conundrum

server

How can the server compute aggregate statistics about users

without storing user-specific information?

...

n

1

2

... Untrusted server

A set of items (e.g. websites) = [d] = {1, …, d}Set of users = [n] Frequency of an item a is f(a) = ( users holding a♯ )/n

Finance.com

Fashion.com

WeirdStuff.com

Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users.

. . . . . .

1 2 3Item ♯ . . . . . . d-2 d-1 d

. . . . . .

. . . . . .

1 2 3Item ♯ . . . . . . d-2 d-1 d

. . . . . .

Succinct histogram =

for some

implicitly

Succinct histograms

Local model of Differential Privacy

Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S,

v1

...

v2

vn

Q1

Q2

Qn

...

z1

z2

zn

Succinct histogram

is item of user

zi is differentially-private report of user i

LDP protocols for frequency estimation is used

• in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14]

• as a basis for other estimation tasks [Dwork-Nissim’04]

Error is measured by the worst-case estimation error:

Performance measures

v1

...

v2

vn

Q1

Q2

Qn

...

z1

z2

zn

Succinct histogram

is item of user

zi is differentially-private report of user i

A protocol is efficient if it runs in time poly(log(d), n)Communication Complexity measured by number of bits transmitted per user.

• d is very large, e.g., number of all possible URL’s

• log(d) = # of bits to describe single URL

Contributions [B, Smith ‘15]

1. Efficient -LDP protocol with optimal error:• run in time poly(log(d), n).

• Estimate all frequencies up to error .

2. Matching lower bound on the error.

3. Generic transformation reducing the communication

complexity to 1 bit/user.• Previous protocols either

ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14]

or, had larger error [HKR’12]

Too slow

Too much error• Best previous lower bound was

• UHH: at least fraction of users have the same item

while the rest have (i.e., “no item”)

Design paradigm

• Reduction from a simpler problem with a unique heavy hitter

(UHH problem) Efficient protocol with optimal error for UHH

efficient protocol with optimal error for the general problem.

Construction for the UHH problem

v*

v*

...

Encoder

Encoder

z1Noising operator

z2Noising operator

znNoising operator

Round Decoder

(error-correcting code)

Key idea: is the signal-to-noise ratio. Decoding succeeds when

• Each user has either v* or • v* is unknown to the server • Goal: Find v* and estimate f(v*)

Similar to [Duchi et al.’13]

• Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol.

v1

vn

Hash

...

Hash

..

1

K

2

..

1

K

2

..

..

v1

vn

..

..

UHH

UHH

UHH.

UHH

UHH

UHH.

Item whose frequency

Construction for the general setting

Key insight: • Decompose general scenario into multiple instances of UHH via hashing. • Run parallel copies of the UHH protocol on these instances.

. . .

Efficient Private

Protocol for a

unique heavy hitter

UHH

Efficient Private Protocol for estimating all heavy hitters

Efficient Private

Protocol for a

unique heavy hitter

UHH

Time poly(log(d), n) All frequencies up to the optimal error

Efficient Private

Protocol for a

unique heavy hitter

UHH

Recap: Construction of succinct histograms

Transforming to a protocol with 1-bit reports

• generate public random string; one for each user• User i sends a biased bit Bi

• Conditioned on Bi = 1, the public string has the same distribution as the output of

local randomizer Qi

Gen(Qi , vi , si)vi Bi

si

Local randomizer: Qi

IF Bi = 1, THEN

report of user i = si

ELSE ignore user i

This transformation works for any local protocol not only heavy hitters.

Key idea: What matters is the distribution of the output of each local randomizer.

Public string does not depend on private data: can be generated by untrusted server.

For our HH protocol, this transformation gives essentially same error and computational

efficiency (Gen can be computed in O(log(d))).




This talk

Attacker’s side information

Aqueries

answers)(

Curatorx1

xi

xn

..

Attackerinternet

social networks

anonymized datasets

..

Attacker’s side information is the main reason privacy is hard.

Attacker’s side information

Aqueries

answers)(

Curatorx1

xi

xn

..

Omniscientattacker

..everything

except xi

Differential privacy is robust against arbitrary side information.Attackers typically have limited knowledge.

Contributions [B, Groce, Katz, Smith’13]: • Rigorous framework for formalizing and exploiting

limited adversarial information: coupled-worlds privacy • Algorithms with higher accuracy than is possible under

differential privacy

Exploiting attacker’s uncertainty [BGKS’13]

Aqueries

answers)(

Curatorx1

xi

xn

..

Attacker

..

Side info in Δ

for any side information in Δ, Given some restricted class of attacker’s knowledge Δ,

the output of A must “look the same” to the attacker regardless of whether any single individual is in or out of the computation.

Distributional Differential Privacy [BGKS’13]

local random coins

A A

local random coins

x1

xi

xn

xi

x1

xn

A is -DDP if,

for any distribution on the data set , for any index i, for any value v of a data entry, and for any event

This implies: for all distributions and for all i, w.p. : For any distribution in Δ, almost same inferences will be made about Alice whether or not Alice’s data is present in the data set.

What can we release exactly and privately?

• Sums whenever the data distribution has a small uniform component.

• Histograms constructed from a random sample from the population.

• Stable functions small probability that the output changes when any single entry

of the dataset changes.

Under modest distributional assumptions, we can release several exact statistics while satisfying DDP:

Conclusions

• Privacy, a pressing concern in “Big Data”, but hard to define intuitively.

• Differential privacy, a sound rigorous approach:

Robust against arbitrary side information

• This work:

the first efficient differentially private algorithms with optimal

accuracy guarantees for essential tasks in statistical data analysis.

generic definitional framework for privacy relaxing DP.

raef bassily computer science & engineering pennsylvania state university new tools for...

Documents

local model slide

convex constraint set

differential privacy

private data analysis

private algorithms

provable privacy guarantees

example of convex erm

convex optimization