using data privacy for better adaptive predictions vitaly feldman ibm research – almaden...
TRANSCRIPT
Using Data Privacy for Better Adaptive Predictions
Vitaly Feldman IBM Research – Almaden
Foundations of Learning Theory, 2014
Cynthia Dwork Moritz Hardt Omer Reingold Aaron Roth MSR SVC IBM Almaden MSR SVC UPenn, CS
Statistical inference
Genome Wide Association StudiesGiven: DNA sequences with medical recordsDiscover: • Find SNPs associated with diseases• Predict chances of developing some condition• Predict drug effectiveness• Hypothesis testing
Given: samples drawn i.i.d. from unknown distribution over Output solution of value with a guarantee that w.p. (over )
Existing approaches
Theoretical ML• Uniform convergence bounds for the solution
class For every of complexity , w.p.
• Output stability-based bounds• But often too loose in practice (do not exploit
additional structure) and complicated to derive
Practical ML• Cross-validation
Statistics• Model-specific fit and significance tests• Bootstrapping etc.
Real world is interactive
Outcomes of analyses inform future manipulations on the same data• Exploratory data analysis• Model selection• Feature selection• Hyper-parameter tuning• Public data - findings inform others
Samples are no longer i.i.d.!
Is the issue real?
Freedman’s paradox (1983):Data: Throw away uncorrelated variables: Perform least squares regression over:
“Such practices can distort the significance levels of conventional statistical tests. The existence of this effect is well known, but its magnitude may come as a surprise, even to a hardened statistician.”
competitions
Public Private
Private dataPublic score
Data
Private score
http://www.rouli.net/2013/02/five-lessons-from-kaggles-event.html
“If you based your model solely on the data which gave you constant feedback, you run the danger of a model that overfits to the specific noise in that data.” –Kaggle FAQ.
Adaptive statistical queries
is tolerance of the queryWith probability for all
𝜙𝑡
𝑣1𝜙2𝑣2
𝑣𝑡
𝜙1
Learning algorithm(s)
SQ oracle[K93, FGRVX13]
𝑆
Can measure error/performance and test hypothesesCan be used in place of samples in most algorithms!
SQ algorithms• PAC learning algorithms (except parities)• Convex optimization (Ellipsoid, iterative methods)• Expectation maximization (EM)• SVM (with kernel)• PCA• ICA• ID3• k-means• method of moments• MCMC• Naïve Bayes• Neural Networks (backprop)• Perceptron• Nearest neighbors• Boosting[K 93, BDMN 05, CKLYBNO 06, FPV 14]
For a query respond with How many samples are needed to answer queries?If are fixed then
With 1 round of adaptivity (constant and )?
Naïve answering
Chernoff Union
Our result
There exists an algorithm that can answer adaptively chosen SQ such that with probability the answers are -valid using
The algorithm runs in time
Also:
Cannot be achieved efficiently: lower bound for poly-time algorithms under crypto assumptions [HU14]
Privacy-preserving data analysis
How to get utility from data while preserving privacy of
individuals
DATA
Differential Privacy Each sample point is created frompersonal data of an individual (GTTCACG…TC, “YES”)
Differential Privacy [DMNS06]
(Randomized) algorithm A is -differentially private if for any two data sets such that :
If then
Properties of DP
• Privacy has a price– Minimum data set size usually scales as
• Composable adaptively:If is -DP and is -DP then is-DPOr better [DRV 10]: For every and , composition of -DP algorithms is
is a loss function an -DP algorithm such that
DP implies generalization
For all over :
DP composition implies that DP preserving algorithms can reuse data
adaptively
𝐿𝐻𝑆=∫0
𝐵
Pr𝐴
[𝐿 (𝐴 (𝑆 ) ,𝑥 )≥ 𝑧 ]𝑑𝑧≤∫0
𝐵
¿¿
Proof
For and let be with -th element replaced by . By -DP
Taking expectation over
Counting queriesCounting query on a data set For function , value DP algorithms for approximate answering of counting queries are actively studied for ~10 years
𝜙𝑡
𝑣1𝜙2𝑣2
𝑣𝑡
𝜙1
Data analyst(s) Query release algorithm
𝑆
𝜙𝑖 : 𝑋→ [0,1 ] ,|𝑣𝑖−𝜙 𝑖 (𝑆 )|≤ τ
From private counting to SQs
Let be an (adaptive) query asking strategy
Let be an algo that answers counting queries of s.t.1. is -DP2. For any data set w.p. answers are -accurate
Then for any over , applied to , w.p. outputs -valid response to SQs of provided that
Can be extended to -DP with
For let denote the -th query asked by Depends on , randomness of (and nothing else)Let
+ Union bound and accuracy of
For let denote the event
Proof I
Pr𝐴 ,𝑆∼𝑃 𝑛
[|𝝍 (𝑃 )−𝝍 (𝑆)|≥2𝜏 ]≤ 𝛽 /𝑡
E𝐴 ,𝑆∼𝑃 𝑛
[𝝍 (𝑆)𝑘∨𝚽 ]≤𝑒𝑘𝜖 ⋅ E𝑆∼𝑃𝑛
[𝜙 (𝑆 )𝑘 ]
Proof: moment bound
where . Suffice to prove for all
Consider where . -DP approximately preserves conditional expectations
E𝐴 ,𝑆∼𝑃 𝑛
[𝝍 (𝑆)𝑘∨𝚽 ]≤𝑒𝑘𝜖 ⋅ E𝑆∼𝑃𝑛
[𝜙 (𝑆 )𝑘 ]
Corollaries
There exists an algorithm that can answer adaptively chosen SQ such that with probability the answers are -valid using random samples.The algorithm runs in time
There exists an -DP algorithm that can answer (adaptive) counting queries such that with probability the answers are -accurate provided that . The algorithm runs in time Also -DP for
[HR10]
MWU + Sparse Vector
• Initialize ; • For each query :
– if answer with – else
• Answer with
Laplace noise
At most MWU updatesSparse Vector Technique [DNRRV09]: privacy loss only when approximate comparison with a threshold fails
Threshold validation queries
Threshold SQ:
-valid response {YES ,
There exists an algorithm that can answer adaptively chosen thresholds SQ such that with probability the answers are -valid as long as at most comparisons failed using
random samples.The algorithm runs in time time
Conclusions
• Adaptive data manipulations can cause overfitting/false discovery
• Theoretical model of the problem based on SQs• Using exact empirical means is risky • DP provably preserves “freshness” of samples:
adding noise can provably prevent overfitting• In applications not all data must be used with DP