differential privacy - huainanstar.aust.edu.cn/xjfang/dp.pdf · published or analyzed. dp requires...

Differential Privacy

方贤进

http://star.aust.edu.cn/~xjfang/ Email: [email protected]

2015年9月9日

In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records.

----Source from wikipedia

https://en.wikipedia.org/wiki/Differential_privacy

Outline 1. Context

2. The Mechanisms of Privacy Preserving

3. Applications under Differential Privacy

4. Our Current Research Works

1. Context

Context The information Technology and applications are developing rapidly in this era.

Mobile Internet Cloud computing Internet of Things(or Everything) Electronic Commerce Electronic Payment Society Network Service(SNS) Location-Based Service ……….

These technologies and applications Produce Big Data.

Context

Parametric or Algorithm

Models

Data mining

Statistical Analysis

Releasing statistical

results One can perform data analysis and data mining to obtain valuable information and knowledge.

Context

However, the disclosure of individual privacy and sensitive information is challenging when performing data release, data analysis or data mining.

Name HIV indicator

Tom 0 Jack 1

Henry 1 Diego 0 Alice 1 …… …….

A dataset

f(n)=the count of the front of n

recorders whose HIV indicator are 1.

Query interface

An simple example for statistical database incurring disclosure of sensitive information

An example for statistical database

Name age HIV+ indicator Tom 52 0 Jack 43 1

Henry 21 1 Diego 41 0 Alice 54 1 …… …….

The original dataset Histogram publication

for the attribute age

（Range count query）

1 2

1

3

5

0

1

2

3

4

5

6

20-30 30-40 40-50 50-60 70-

hiv+

age

Note: Suppose that the attacker know Alice’s age is 54. He is also aware of the other patients’ condition in bin 50-60 except Alice. He can infer whether Alice’s HIV+ is positive or not according to the above histogram publication.

Context Summarily, the following operations can all incur disclosure of individual identifiers or sensitive information:

Providing the full database for analytics or data mining;

Removing the personal identifiers(such as name, ID);

Giving the query/statistic/analysis interface program

Context The events of disclosure of sensitive information:

Patient specific data of 135,000 state employees and their families published by the Group insurance Commission (GIC) in Massachusetts;

Note: Latanya Sweeney Broke this anonymized data and found William Weld, the governor of Massachusetts at that time.

Anonymity


① Privacy disclosure of search records of AOL;

The data of search records

The identifier of one search engines user


② Users’ comment data for movies in Netflix

③ Public Internet Movies Database(IMDB);

IMDB Users’ Information about Browsing movies in

Netflix Website

2. Mechanisms of Privacy Preserving

(1) k-anonymity • Generalization for values of sensitive

attributes

Vulnerability:

Homogeneity(一致性) attack

The background knowledge attack

The extension of k-anonymity

• l-diversity • t-closeness • (α, k) anonymity • m-invariance

Attack for these mechanisms:

Composition attack

DeFinetti attack

Foreground knowledge

(2) Differential Privacy(DP) • In 2006, Differential Privacy is presented by

Dwork to solve the privacy disclosure for statistical database.

• This is the privacy preserving model whose security is not related to the adversaries’ background knowledge.

• It is founded on the mathematical foundation, gives the rigorous definition of privacy preserving, and presents the quantified evaluation for noise, error and computation complexity.

• Let D be a sensitive dataset to be published or analyzed. DP requires that, prior to D’s release, it should be modified using a randomized algorithm G, such that the output of G doesn’t reveal much information about any particular tuples in D.

• The formal definition of DP is as follows:

(2) Differential Privacy(DP)

• ε-Differential Privacy: A randomized algorithm G satisfied ε-Differential Privacy, if for any two neighbor dataset D1 and D2 that differ only in one tuple, and for any possible output O of G, we have:

Pr[ ( 1) ]Pr[ ( 2) ]

G D O eG D O

ε=≤

=

Where Pr[.] denotes the probability of an event.


• According to the above inequality, a deterministic algorithm can’t possibly satisfy ε-differential privacy for any value of ε. Hence, a deterministic analysis algorithm must be randomly perturbed to satisfy the privacy requirement.

• Parameter ε decides that the probability of algorithm with constant output can’t change notably when a tuple is deleted in dataset. The lower ε indicates the stronger privacy preserving

• DP’s implementation mechanism has Laplace Mechanism and Exponential Mechanism.


2.1 Laplace Mechanism with


Laplace Mechanism • Laplace Mechanism is limited to analysis tasks

that returns numeric results. • Laplace Mechanism releases the result of a

function F that takes a dataset as input, and a set of numeric values as output.

• Given F, the Laplace mechanism transforms F into a differentially private algorithm G, by adding i.i.d (independent identically distributed, 独立同分布) noise(denoted as η) into each output value of F. The noise η is sampled from a Laplace distribution Lap(λ) with pdf (probability density function):

/1Pr[ ]2

xx e ληλ

−= =

How to Compute noise by Laplace distribution?

<

≥====

−−

0,21

0,21

21]Pr[

||

xe

xeexf(x) x

x

x

λ

λ

λ

λ

λλ

η

对拉普拉斯分布进行积分，它的累积分布函数(Cumulative Distribution Function, CDF)为：

−+=

<

≥−==

−

−

∞−∫

)1((sgn1*21

0,21

0,211

)()(

||λ

λ

λ

x

x

x

x

ex

xe

xeduufxF

）

逆累积分布函数为：

|)5.0|*21ln(*)5.0sgn(*)(1 −−−−== − pppFx λ

其中p是一个在0.0~1.0之间均匀分布的随机数，λ称为尺度(scale)参数。在隐私保护的Laplace机制中称为噪声尺度，其值要满足 λ >=∆F/ ε ，其中∆ F表示函数F的敏感度，ε称为隐私保护预算参数。

How to Compute noise by Laplace distribution?

Laplace Mechanism • Dwork et al. prove that the Laplace Mechanism

ensure the ε-differential privacy if λ≥S(F)/ε, where S(F) is the sensitive of function F. Parameter ε is called the privacy budget. Lower ε indicates stronger privacy protection, but also noisier results.

• Let F be a function that maps a dataset into a fixed-size vector of real numbers. The sensitive of F is defined as:

11, 2( ) max ( 1) ( 2)

D DF S F F D F D∆ = = −

Where ||.|| denotes the L1 norm distance, D1 and D2 are two arbitrary neighbor datasets. Intuitively, S(F) measures the maximum possible change in F’s output when we modify one arbitrary tuple in F’s input.

An example of Laplace mechanism for count query

native place HIV+ indicator Anhui 0

Zhejiang 1 Anhui 1

Shanghai 0 Zhejiang 1

…….

F(n)=the recorder count of HIV+ indicator is positive for each native

place

Native place Count(F(n)) Anhui 16 Shanghai 8 Zhejiang 10

Obviously, the sensitivity S(F) is 1 for count query. Let ε be 0.1. Count query has been changed via adding Laplace noise Lap(1/ ε), that is, F(n)+Lap(1/ ε)

Obviously, the sensitivity S(F) is 1 for count query. Let ε be ln(2). Count query has been changed via adding Laplace noise Lap(λ)=Lap(1/ ε), that is, F(n)+Lap(1/ ε).


Native place original count Random p noise noisy count Anhui 16 0.64657 4.158559 20.15855936 Shanghai 8 0.262234 -2.23545 5.764546932 Zhejiang 10 0.97904 0.454824 10.45482422

Noise scale λ=1/ε=1/ln2=1.442695041

Sum of Squared Errors(SSE)=22.49773144

Obviously, the sensitive S(F) is 1 for count query. Let ε be ln(3). Count query has been changed via adding Laplace noise Lap(λ)=Lap(1/ ε), that is, F(n)+Lap(1/ ε).


Noise scale λ=1/ε=1/ln3=0.910239227


Sum of Squared Errors(SSE)=0.108439743

0 2 4 6 8

10 12 14 16 18

Anhui Shanghai Zhejiang

original count noisy count

native place original count Random p noise scale(S(F)/ε=1/ln3) noise noisy count

Anhui 16 0.560526972 0.910239227 0.11745 16.11744976 Shanghai 8 0.642245598 0.910239227 0.304713 8.304712935 Zhejiang 10 0.477258612 0.910239227 -0.04237 9.957628737

Comparison between Histogram Publications

2.2 Exponential Mechanism with


Exponential Mechanism • Exponential Mechanism mainly targets tasks

with categorical outputs(e.g. query result is a kind of selection or scheme), in which injecting random noise no longer yield meaning results.

• Exponential mechanism tackles this problem by performing random perturbations during the selection of the output.

( )Pr[ selected] exp( )2

s

s

fisf

ε ωω ∝∆

Where Ω is the set of all possible output values. ω is each possible output value, ω∈Ω. fs is the score function. fs is the sensitivity of fs . Parameter ε is still the privacy budget.

The effectiveness of EM can be improved by exploiting special properties of the score function of certain special application.

1, 2, ' 1 2 1

( ) max ( ') ( ')s s s sD D D Df S f f f

ωω ω

∈Ω∆ = = −

An example of exponential mechanism with differential privacy

A sports event will be selected from the set of soccer, volleyball, basketball, badminton. The participators vote to decide which is the competition item, while satisfying ε-differential privacy preserving during decision process.

In this case, let score function fs() be the votes.

Obviously, the sensitivity of score function, ∆ fs is 1.

( )Pr[A sports event is selected] *exp( )2

s

s

fnf

ε ωω •=

∆

Sports event fs=Votes

Selection probability Pr[•] ε=0

n=0.25 ε=0.1

n=0.0946 ε=1

n=2.83E-8 Volleyball 25 0.25 0.3302 0.007594

Soccer 30 0.25 0.4240 0.092513 Basketball 8 0.25 0.1411 0.000002 Badminton 2 0.25 0.1045 0.000000

Sensitive information

Output with differential

privacy

3. Applications under Differential Privacy

Dataset Querying or statistical Analysis

Adding Randomized noise by DP algorithms perturbed

results

It is hard for adversary to infer any attribute value of any

individual record in the dataset

This can be challenging in practice especially when multiple users need to pose a large number of queries for exploratory analysis.

3.1 Differential Privacy Data Release • 3.1.1 Interactive query:

• The main objective in differentially private query processing is to maximize the accuracy of the query results, while satisfying the privacy guarantees. Matrix mechanism to optimize Linear Counting Queries:

Chao Li, Gerome Miklau. Optimal error of query sets under the differentially-private matrix mechanism. in the 16th International Conference on Database Theory. 2013, ACM: Genoa, Italy. p. 272-283.

the Low-Rank Mechanism (LRM) for batch queries under differential privacy：

Ganzhao Yuan, Zhenjie Zhang and et al. LowRank Mechanism: Optimizing Batch Queries under Differential Privacy, in The 38th International Conference on Very Large Data Bases. 2012, VLDB Endowment: Istanbul, Turkey:1352-1363.

Range count query via wavelet transforms：

Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. differential privacy via wavelet transforms. TKDE, 2011,23(8): 1200-1214.

Multiple counts queries…..

Data cubes….

……

An example for range count query: Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. differential privacy via wavelet transforms. TKDE, 2011,23(8): 1200-1214.

Suppose that we perform range count query for a relational table T that contains d attributes A1,A2, . . .,Ad, each of which is discrete and continuous. We define n as the number of tuples in T , and m as the size of the multi-dimensional domain on which T is defined, i.e. ,

1| |d

iim A

==∏

To perform range count query under ε-differential privacy, we transform dataset T into frequency matrix M with m entries, such that (1) the i-th (i ∈ [1, d]) dimension of

M is indexed by the values of Ai, and (2) the entry in M with a coordinate vector

<x1, x2, . . . , xd> stores the number of tuples t in T such that t =<x1, x2,…, xd>.

Generally speaking, frequency matrix M is to linearize the domain of dataset T.

e.g.

The following dataset has 3 attributes A,B,C, whose value.

domain is 0,1. 1

3, | | 8dii

d m A=

= = =∏

tid A B C 1 0 0 1 2 0 1 1 3 0 0 0 4 0 0 1 5 1 1 0

Dataset T

Frequency Matrix M

M=(1,2,0,1,0,0,1,0)

Notice that if we modify a tuple in T , then at most two entries in the frequency matrix M will change. Using the Laplace mechanism, we can ensure ε-differential privacy by adding Laplace noise Lap(2/ε) into each entry of M, that is,

M*←M+ Lap(2/ε) . The above noise injection approach fails to provide accurate results for aggregate queries. Specifically, if we answer a range-count query using a noisy frequency matrix M∗ generated with the above approach, then the noise in the query result has a variance Θ(m/ε2) in the worst case. This is because: (1)each entry in M* has a noise variance 8/ε2(by the pdf of Lap(2/ε). (2)a range count query may cover up to m entries in M*. Hence, this mechanism can produce a variance Θ(m/ε2) in the worst case, which results in meaningless.

A first-Cut solution:

Improving the above solution using wavelet transform:

Step1: dataset T→frequency matrix M；

Step2: Applying a wavelet transform on the frequency matrix M. Generally speaking, a wavelet transform is an invertible linear function, i.e.,

M → matrix C,

such that (i) each entry in C is a linear combination of the entries in M, and (ii) M can be losslessly reconstructed from C. The entries in C are referred to as the wavelet coefficients.

Step3: adding an independent Laplace noise to each wavelet coefficient.

C→C* (noisy coefficients)

Step4: C*→M*(A noisy frequency matrix), which is returned as the output, and used to answer range count query.

The above solution can satisfy ε-differential privacy while its noise variance is at most

2((log ) / )dO m ε

• 3.1.2 non-interactive data Release

Dataset

For data release or

data analysis

Users can arbitrarily access the released data for query and analysis purposes.

It is a promising way for privacy preserving data sharing and analytics while providing a rigorous privacy guarantee.

Adding noise, Transform or

Approximation via DP algorithms

Publishing dataset in the form of Marginal Contingency table multi-dimensional histograms Real synthetic dataset to mimic original dataset

users Noisy result

Submit query/analysis task

① Yaroslavtsev G, C.G., Procopiuc C M, Srivastava D. Accurate and efficient private release of datacubes and contingency tables. in the Proc. of 2013 IEEE 29th International Conference on Data Engineering (ICDE). 2013, IEEE Computer Society: Brisbane, Australia. p. 745-756.

② Wahbeh Qardaji, W.Y., Ninghui Li. PriView: Practical Differentially Private Release of Marginal Contingency Tables. in the Proc. of 2014 ACM SIGMOD international conference on Management of data 2014, ACM: Snowbird, UT, USA. p. 1435-1446.

Contingency table Release:

① Yonghui Xiao，Li Xiong，Liyue Fan，Slawomir Goryczka，Haoran Li. DPCube: Differentially Private Histogram Release through Multidimensional Partitioning. TRANSACTIONS ON DATA PRIVACY, 2014. 7(3): p. 195–222.

② Wahbeh Qardaji, W.Y., Ninghui Li. Understanding hierarchical methods for differentially private histograms. in the VLDB Endowment. 2013, VLDB Endowment Riva del Garda, Trento, Italy. p. 1954-1965.

Histogram publishing:

① Joo Ho Lee, In Yong Kimy, Christine M. O'Keefe. On Regression-Tree-Based Synthetic Data Methods for Business Data. Journal of Privacy and Confidentiality, 2013. 5(1): 107-135.

② Haoran Li, Li Xiong, Xiaoqian Jiang. Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions. Adv Database Technol., 2014: 475–486.

Synthetic dataset release

Note: In probability theory and statistics, a Copulas function is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform. Copulas are used to describe the dependence between random variables.

How to sample for generating synthetic dataset?

Dataset D contains attribute A and B which are in binary attributes domain. Suppose that the following is their marginal distribution and joint distribution.

A

0 1 B

0 1/n 0 1/n

1 0 (n-1)/n (n-1)/n

1/n (n-1)/n 1

Marginal distribution

Joint distribution

Id A B 1 0 0 2 1 1 3 1 1 … 1 1 … 1 1 … 1 1 … 1 1

sampling

Synthetic dataset

3.2 Data mining & Machine Learning under Differential Privacy

• Frequent-item mining: “Privbasis: frequent itemsets mining with differential privacy”, ”Top-k Frequent Itemsets via Differentially Private FP-trees”, “差分隐私保护下一种精确挖掘top-k 频繁模式方法” Raghav Bhaskar, Srivatsan Laxman, Adam Smith, Abhradeep Thakurta.

Discovering frequent patterns in sensitive data, in the 16th ACM SIGKDD international conference on Knowledge discovery and data mining 2010, ACM: Washington, DC, USA. p. 503-512.

① k patterns are selected centrally from frequent item sets whose length is more than l;

② Adding Laplace noise into frequency of k patterns.

• Classification Private decision trees: Arik Friedman, Assaf Schuster. “Data

mining with differential privacy”. in the 16th ACM SIGKDD international conference on Knowledge discovery and data mining 2010, ACM: Washington, DC, USA. p. 493-502.

Noman Mohammed, R.C., Benjamin C.M. Fung, Philip S. Yu. Differentially private data release for data mining. in the 17th ACM SIGKDD international conference on Knowledge discovery and data mining 2011, ACM: San Diego, California, USA. p. 493-501.

Note: This paper presented differentially-private anonymization algorithm based on Generalization(DiffGen) that achieves ε-differential privacy and supports effective classification analysis.


• Clustering Clustering is an important data analysis technique which may disclose

sensitive data. Clustering algorithms include k-means, center and median clustering.

Nissim K, R.S., Smith A, a Smooth Sensitivity and Sampling in Private Data Analysis, in 39th Annual ACM Symposium on Theory of Computing(STOC'07). 2007, ACM: San Diego, California, USA. p. 75-84.

Note: this paper proposed k-means cluster centers release method satisfying ε-differential privacy, called Pk-means, by using sample and aggregate framework. Pk-means provides a method of measure smooth bounds and smooth sensitivity in which the setting of privacy budget ε is crucial.

Cynthia Dwork. A Firm Foundation for Private Data Analysis. Communications of the ACM, 2011. 54(1): p. 86-95.

The above clustering methods satisfy ε-differential privacy, however, are lower

utility. In particular, it is a NP hard to select k when dataset is very large size.


• Regression analysis with differential privacy




Smith, A., Privacy-preserving statistical estimation with optimal convergence rates. in the forty-third annual ACM symposium on Theory of computing 2011, ACM: San Jose, California, USA. p. 813-821

该方法称为LPLog,先求出目标函数或预测函数的参数向量后，对参数向量添加Laplace噪音，然后再计算目标或预测函数的值进行预测或分析。该方法计算参数向量的敏感度太高，导致预测精度较低。



Daniel Kifer, Adam Smith and Abhradeep Thakurta. Private Convex Optimization for Empirical Risk Minimization with Applications to High-dimensional Regression. in the 25th Annual Conference on Learning Theory 2012: Edinburgh, Scotland. p. 25.1 - 25.40.

Kamalika Chaudhuri, C.M., Anand D. Sarwate. Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research, 2011. 12(Mar): p. 1069-1109.

上述方法(ERM模型)直接对目标函数扰动，对数据集中的记录的目标函数的均值

添加噪音，然后再求带噪音的参数向量：

2

1

1 1 1( ) ( , ) , * arg min ( )2

nT

D i Di

f w f t w b w w f w wn n=

= + = + ∆∑Where fD(w) is objective function to be perturbed, b is noise vector, w is parameter vector to be optimized, ti is a tuple in dataset.

缺点：计算w*敏感性的计算代价非常大，同时该方法的约束条件为：目标函数具有凸函数(convex function)以及双可微特性。



Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Marianne Winslett. Functional Mechanism: Regression Analysis under Differential Privacy. in Proc. of The 38th International Conference on Very Large Data Bases. 2012, VLDB Endowment: Istanbul, Turkey. p. 1364-1375.

通过函数机制(Function Mechanism，FM)实现差分隐私下的线性和逻辑回归。FM首先通过噪音扰动fD(w)得到扰动的目标函数

2

1( , ) ( ) , ( ) ( , )

nT

i i i D ii

f t w y x w f w f t w=

= − =∑Where ti is a tuple compose of (xi,yi) in which xi includes d attributes, yi is a category attribute.

Then, w* is computed by the following formula:

( )Df w

* arg min ( )Dw f w=



Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, Marianne Winslett . PrivGene: Differentially Private Model Fitting Using Genetic Algorithms. in the 2013 ACM SIGMOD International Conference on Management of Data 2013, ACM: New York, USA. p. 665-676.

Note: this paper proposes PrivGene, a general-purpose differentially private model fitting solution based on genetic algorithms (GA). PrivGene performs the random perturbations using a novel technique called the Enhanced Exponential Mechanism(EEM), which improves over the exponential mechanism, and can be used to perform three common analysis tasks involving model fitting: logistic regression, SVM classification, and k-means clustering.

• SVM with differential privacy B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function

space: Privacy-preserving mechanisms for SVM learning. Journal of Privacy and Confidentiality, 2012. 4(1): p. 65-100.

Smith, A., Privacy-preserving statistical estimation with optimal convergence rates. in the forty-third annual ACM symposium on Theory of computing 2011, ACM: San Jose, California, USA. p. 813-821.(PrivateSVM)

Lei Jing. Differentially Private M-Estimators. in 23rd Annual conference on Neural Information Processing System 2011: Granada, Spain. p. 361-369. (ObjectiveSVM)

PrivateSVM利用拉普拉斯噪音扰动参数向量ｗ产生w*，使得满足ε-差分隐私．然而，该方法因w*的敏感性过高会导致较大的噪音量，进而导致比较低的分类精确。

ObjectiveSVM是对目标函数加噪音的分类方法，该方法采用拉普拉斯分布Lap(b)产生随机噪音ｂ，然后把b添加到风险函数fD(w)而得到扰动目标函数


( )Df w

再根据上式计算出w*。虽然ObjectiveSVM方法的分类精度高于PrivateSVM 方法，然而该方法的缺陷是其目标函数必须具有特定性质，即是具有凸函数特性以及双可微特征的目标函数．

* arg min ( )Dw

w f w=

4. Our current research works

Existing challenges in date release for high dimensional or large attribute domains: 1. The high dimensions and large attribute domains result

in a large number of histogram bins or contingency table entries that may have skewed distributions or extremely low counts, leading to significant perturbation or estimation errors.

2. The large domain space incurs a high computation complexity both in time and space. For DP histogram methods that use the original histogram as inputs, it is infeasible to read all histogram bins into memory simultaneously due to memory constraints, and external algorithms need to be considered.

1| |d

iim A

==∏

For example:

A million rows(that is, record count n=106) dataset D with 10 attributes, each of which has 20 possible values, results in a domain space size of m=2010≈10TB, which is the output scalability problem.

On the other hand, when the dataset D is represented as the full dimensional contingency table x for histogram, contingency for data publishing with differential privacy, the average count in each entry of x is n/m=10-7, is very small. Once noise is added to x(or some transformation of it) to obtain another x*， the noise completely dominates the original signal, making the published vector x* next to useless, that is, lower signal to noise ratio.

Our current research work I:

Differentially private High Dimensional Data Release via

Bayesian Network

Research target 1: • Let A be the set of attributes on a

dataset D, and d be the size of A, n be the number of the tuples in D.

• The research target 1 is to create the model to describe the conditional independence among attributes in dataset D using k-degree Bayesian network(with small k).

age workclass education occupation income 35 Federal-gov 9th Farming-fishing <=50K 43 Private 11th Transport-moving <=50K 59 Private HS-grad Tech-support <=50K 56 Local-gov Bachelors Tech-support >50K 19 Private HS-grad Craft-repair <=50K 54 Federal-gov Some-college ? >50K 39 Private HS-grad Exec-managerial <=50K 49 Private HS-grad Craft-repair <=50K 23 Local-gov Assoc-acdm Protective-serv <=50K 20 Private Some-college Sales <=50K

……

…… …… …… ……

Dataset D

age workclass

education

income

occupation

k-degree Bayesian Network N, k=3

Node index Xi Πi 1 age Φ 2 workclass age 3 education age 4 occupation age,education,workclass 5 income occupation,workclass

The AP(Attribute-Parent) pairs in N

For example:

Research target 2: • The research target 2 is to create

the synthetic dataset D’ which approximates the original dataset D, while satisfying ε-differential privacy preserving.

Original dataset D

Synthetic dataset D’

Approximating By ε-differential

privacy algorithm

Achieving the nearly same result while performing

data analysis, data mining or statistic on dataset D

and D’.

Crucial scientific issues 1 • How to evaluate the divergence

between Pr[A] and PrN[A], where Pr[A] represents the distribution of tuples in attributes set A on dataset D, PrN[A] represents the approximation of Pr[A] defined by Bayesian network N.

( ) ( )

Pr[ , ]( , ) Pr[ , ]logPr[ ]Pr[ ]x dom X dom

X xI X X xX xπ

πππ∈ ∈ Π

= Π =Π = = Π =

= Π =∑ ∑

The mutual information I(X,∏) between two attribute values in dataset D is denoted as:

Where dom(X) and dom(∏) are the value domain of attribute X and ∏ respectively. Pr[X, ∏ ] is the joint distribution of X and ∏ . Pr[X] and Pr[∏ ] are the marginal distribution of X and ∏ .

1 1(Pr[ ], Pr [ ]) ( , ) ( ) ( )

d d

N i i ii i

KL A A I X H X H A= =

= − Π + −∑ ∑

Kullback–Leibler divergence(KL-divergence) of Pr[A] and PrN[A] is used to measure the difference between the two probability distributions, and is defined as follows:

Where H(Xi) and H(A) are the notions of entropy from information theory.

2( )

( ) Pr[ ]log (Pr[ ])x dom X

H X X x X x∈

= − = =∑

Crucial scientific issues 2 • How to design k-degree Bayesian

network N upon dataset D while satisfying differential privacy and the following condition.

11

Pr [ ] Pr[ | ] Pr[ ] Pr[ , , ]N

d

i i di

A X A X X=

= Π ≈ =∏

Using the mutual information function I as the score function and exponential mechanism under ε/2-differential privacy, Creating k-degree Bayesian network algorithm upon dataset D are designed.

∆ is a scaling factor. In order to satisfy ε/2-differential privacy, the value of ∆ is set as follow:

ε)()1(2 ISd −

=∆

Where S(I) denotes the sensitivity of the mutual information function I(X,∏).

The core question is how to improve score function in exponential mechanism under differential privacy , and computer the sensitivity to obtain high-quality Bayesian network approximating the original distribution of dataset D.

Crucial scientific issues 3 • According to the constructed

Bayesian network N upon dataset D, how to generate synthetic dataset D’ approximating the original dataset D, while satisfying ε/2 differential privacy.

Step1: computing an noisy conditional distribution;

Step2: deriving an approximate distribution of the tuples in dataset D

为满足ε/2-差分隐私Laplace机制，噪声参数为：

λ=4(d-k)/nε.

Step3: Sampling tuples from the approximate distribution to generate a synthetic dataset D*.

the Bayesian network N provides a means to perform sampling efficiently without materializing Pr*N[A], which reduces the time and space consuming.

sampling each Xi from the conditional distribution Pr∗[Xi|Πi] independently, without considering any attribute not in Πi ∪Xi.

For any j>i, . If we sample Xi (i∈[1,d]) in increasing order of i, then by the time Xj (j∈[2,d]) is to be sampled, we must have sampled all attributes in Πj. That is to say, the sampling of Xj does not require the full distribution Pr∗N[A].

j iX ∉Π

Sampling algorithm

Our current research work Π:

Differentially private model fitting using Clonal Selection Algorithm

Existing approaches to differentially private model fitting generally follow the methodology of developing the differentially private version of a commonly used algorithm in the non-private setting.

The main faced challenge of direct using the Laplace or exponential mechanism is that the sensitivity of such algorithm is prohibitively high.

The paper, “Functional mechanism: Regression analysis under differential privacy”, proposed function mechanism to investigate linear and logistic regression under differential privacy. This solver, however, incurs prohibitive sensitivity.

Popular model fitting algorithms for logistic regression, such as iteratively re-weighted least squares , are much more complex, and incur prohibitive high sensitivity, too.

Function Mechanism(FM) for model fitting is to inject random noise into fitting function f. Applying function mechanism to linear regression is relatively straightforward. The fitting function of logistic regression, however, still incurs prohibitively high sensitivity. The paper of “Functional mechanism: Regression analysis under differential privacy” tackles this problem by applying FM to a truncated version of the fitting function consisting of the first few terms of its Taylor expansion. This approach imposes considerable information loss. Further, it is limited to fitting functions with a closed-form Tayler expansion. The fitting functions of SVM classification and k-means clustering cannot be handled this way as neither of them is differentiable.

The paper of “PrivGene: Differentially Private Model Fitting Using Genetic Algorithms” proposed a general- purpose differentially private mode fitting based on the genetic algorithm. PrivGene performs the random perturbations using a novel technique called the Enhanced Exponential Mechanism(EMM), which improves over the exponential mechanism, and achieves high result quality for a broad class of model fitting tasks.

The goal of EEM, thus, is to maximize the impact of f(D, ω) in the probability assignment, while satisfying differential privacy requirements.

PrivGen solution for logistic regression Let D be a database containing n tuples from a domain T , such that each tuple has d attributes X1,X2, . . . , Xd−1, Y , and attribute Y has a binary domain 0, 1. For each t = (x, y) = (x1, x2, . . . , xd−1, y), we assume without loss of generality1 that |xk| ≤ 1 for k ∈ 1, 2, . . . , d − 1, i.e., T = [−1, 1]d−1×0, 1. A logistic regression model built on D is parameterized by a vector α and a constant β (called the bias), as formalized in Definition 1.

PrivGen solution for logistic regression Input: D, f: sensitive dataset and its fitting function

ε: privacy budget m, m’: sizes of candidate set Ω and selected set Ω’ , respectively r: number of iterations

Output: ω: best parameter vector identified by PrivGene 1: Initialize candidate set Ω with m randomly generated vectors 2: for i = 1 to r − 1 do //iteration times 3: Compute Ω’= DP_Select(D, f,Ω,m’, ε/r) //从Ω中选择m’个放入Ω’ 4: Set new candidate set Ω to empty 5: for j = 1 to m/2 do 6: Randomly choose two vectors ω1, ω2 from Ω’; // k=? 7: Compute(v1,v2)=crossover(ω1, ω2) 8: Call Mutate(v1) and Mutate(v2) 9: Add v1,v2 to Ω 10: end for 11: end for 12: Compute ω = DP_Select(D, f,Ω, 1, ε/r) 13: return ω

Algorithm 2 DP_Select (D, f, Ω, m’, εs): returns Ω’ Input: D, f: sensitive dataset and its fitting function

Ω: candidate set of parameter vectors

m’: number of parameter vectors to select from Ω

εs: total amount of privacy budget used for selecting Ω’

Output: Ω’: set of selected parameter vectors

1: Initialize Ω’ to empty

2: For each ω ∈ Ω, compute f(D, ω)

3: for i = 1 to m’ do

4: Use privacy budget εs/m to apply the exponential mechanism or the enhanced exponential mechanism to select the parameter vector ω∗ from Ω that aims to maximize f(D, ω∗). (ω* is proportional to exp (ε · f(D, ω*)/Δ), Δ is denotes as follows:

5: Remove ω∗ from Ω, and add ω∗ to Ω’

6: end for

7: return Ω’ Exponential Mechanism under differential privacy

PrivGen solution for logistic regression

PrivGen solution for SVM classification Let D∈Tn be a database containing n tuples sampled from a d-dimensional domain T = [−1, 1]d−1 ×−1, 1. We denote the i-th (i ∈ [1, d − 1]) dimension of T as Xi, and the last dimension of T as Y . For ease of exposition, we use x to denote a vector in [−1, 1]d−1, and use t = (x, y) to denote a tuple in D. A linear SVM classifier on D is defined as follows.

PrivGen solution for SVM classification To solve SVM classification with PrivGene, we define each parameter vectors as a d-dimensional vector, such that the first i (i ∈ [1, d − 1]) dimensions correspond to α and the last dimension correspond to β. In addition, the tuple fitting function is defined as

According to the method for solving logistic regression, the dampening factor Δ in algorithm DP_select(D, f, Ω, m’, εs) can be obtained as follows, which is not trivial problem:

Differentially Private Model Fitting using Clonal Selection Algorithm

• Our idea is to use clonal selection algorithm to perform differentially private model fitting task, instead of optimizing the dampening factor ∆.

• Investigating the relationship between clonal selection and exponential mechanism for obtaining high accuracy model fitting under differential privacy.

Thank you!

Any Questions?

differential privacy - huainanstar.aust.edu.cn/xjfang/dp.pdf · published or analyzed. dp requires...

Documents