differential privacy for social science...
TRANSCRIPT
![Page 1: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/1.jpg)
DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCH
Salil VadhanHarvard [email protected][[email protected] bounces mail from census.gov, google.com, …]
NAS CNSTAT Privacy workshopJune 6, 2019Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our funders.
with support from:
![Page 2: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/2.jpg)
Computer Science, Law, Social Science, Statistics
http://privacytools.seas.harvard.edu/
The Privacy Tools Project (2012-present)
![Page 3: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/3.jpg)
Our Goal
computerscience
socialscience
data science
law &policy
privacy
utility
Achieve: &
Via:
Chong Vadhan
Gasser Sweeney
King Crosas
Airoldi
Altman Nissim
(Georgetown)
Gaboardi(Buffalo)
Honaker
O’Brien
Program on Information ScienceMIT Libraries
Smith (BU) Dwork
Ullman(NEU)
Wood
![Page 4: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/4.jpg)
Dataverse Repositories around the world: 27 installations
Harvard Dataverse Repository:2400 dataverses with 75,000 datasetsand 2.9 million downloads
Target: Data Repositories
![Page 5: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/5.jpg)
Datasets are restricted due to privacy concerns
Goal: enable wider sharing while protecting privacy
![Page 6: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/6.jpg)
Approach: Integrated Privacy Tools
RobotLawyers
DataTagsInterview
SensitiveData Set
Deposit in repository
SensitiveData Set
RestrictedAccessData Set w/DUA
PSI: DifferentialPrivacy
Public Access Statistics
Tools we are working on
![Page 7: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/7.jpg)
PSI: Differential Privacy Tool
RobotLawyers
DataTagsInterview
SensitiveData Set
Deposit in repository
SensitiveData Set
RestrictedAccessData Set w/DUA
PSI: DifferentialPrivacy
Public Access Statistics
Statistical summaries andexploratory data analysis withstrong privacy guarantees
![Page 8: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/8.jpg)
PSI: Differential Privacy Tool
![Page 9: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/9.jpg)
http://psiprivacy.org/about
![Page 10: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/10.jpg)
• Generality: applicable to datasets across social science.• Accessible: no differential privacy expert optimizing
algorithms for a particular dataset or application• Workflow-compatible: fits into workflow of practicing
social scientists, using familiar concepts & tools• Tiered access: DP interface for wide access to rough
statistical information; users can still apply for raw data (cf. Census PUMS vs RDCs)
Hope: Beta version deployed in Dataverse by end of 2019
Goals of PSI
![Page 11: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/11.jpg)
Intervene at time of deposit in repository
![Page 12: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/12.jpg)
Allow depositor to make DP releases
![Page 13: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/13.jpg)
Others can explore releases & make DP queries
![Page 14: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/14.jpg)
Regression, Inference, and Machine Learning
Theorem [KLNRS08,S11]: Differential privacy for vast array of machine learning and statistical estimation problems with little loss in asymptotic convergence rate as → ∞. • Optimizations & practical implementations for logistic regression, ERM,
LASSO, SVMs in [RBHT09,CMS11,ST13,JT14].
DPHypothesis or model about world, e.g. rule for predicting disease from attributes
Sex Blood ⋯ HIV?F B ⋯ YM A ⋯ NM O ⋯ NM O ⋯ YF A ⋯ NM B ⋯ Y
![Page 15: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/15.jpg)
Case Study: Small-Area Regressions• Opportunity Atlas [Chetty, Friedman, Hendren, Jones, Porter 2018]:
linear regressions on Census & IRS data to predict child income rank from parent income rank in each Census tract, broken up by race & gender.
• Challenging for DP: • small sample sizes (10’s to 1000’s)• sometimes very small variance in explanatory variable
• OA gets good results with a DP-inspired method.• See John Friedman’s presentation tomorrow!
![Page 16: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/16.jpg)
0
20
40
60
80
100
Varia
nce
TotalVariance
SignalVariance
Sampling NoiseVariance Variance
Variance Decomposition for Tract-Level EstimatesTeenage Birth Rate For Black Women With Parents at 25th Percentile
Source: Chetty, Friedman, Hendren, Jones, Porter (2018)
3% increase in noise var.
Privacy Noise
![Page 17: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/17.jpg)
Well-chosen DP methods almost as good[preliminary results, in collaboration with Alabi, McMillan, Sarathy, Smith]
![Page 18: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/18.jpg)
Some challenges
• Managing privacy-loss budget in a query system.• Global budget for all analysts: can be exhausted quickly• Per-analyst budgets: need to ensure no collusion, limit publication
• “Good” accuracy for small (e.g. Opportunity Atlas)• requires high privacy loss (e.g. 8).• and carefully engineered DP algorithms
![Page 19: DIFFERENTIAL PRIVACY FOR SOCIAL SCIENCE RESEARCHsites.nationalacademies.org/cs/groups/dbassesite/...Dataverse Repositories around the world: 27 installations Harvard Dataverse Repository:](https://reader035.vdocuments.net/reader035/viewer/2022063021/5fe5495940d4cf66566e6577/html5/thumbnails/19.jpg)
Theoretical possibility: rich synthetic data
Utility: preserves fraction of people with every set of attributes!
“fake” people
[Blum-Ligett-Roth ’08….]
Problem: uses computation time exponential in .(necessary in the worst case [DNNRV09,UV11])
CDP
Sex Blood ⋯ HIV?F B ⋯ YM A ⋯ NM O ⋯ NM O ⋯ YF A ⋯ NM B ⋯ Y
Sex Blood ⋯ HIV?M B ⋯ NF B ⋯ YM O ⋯ YF A ⋯ NF O ⋯ N
≪