enhancing selectivity pyramid in big data protectionriley/pdfs/pyramidtalk.pdfenhancing selectivity...
TRANSCRIPT
![Page 1: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/1.jpg)
Pyramid
Enhancing Selectivityin Big Data Protection
Mathias Lécuyer, Riley Spahn,Roxana Geambasu, Tzu-Kuo Huang, Siddhartha Sen
Columbia University
![Page 2: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/2.jpg)
The “Collect-Everything” Mentality
2
● Companies collect enormous personal data○ Clicks, location, browsing history, many more
● Data has beneficial uses○ Article recommendation○ Ad targeting○ Fraud detection
● But data raises substantial risks in the event of a breach
![Page 3: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/3.jpg)
wide-accessdata lake
The “Data Lake” Mentality
3
sports fashion technology
ad targetingarticle recommendation fraud detection
![Page 4: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/4.jpg)
wide-accessdata lake
Collection + Wide Access Lead to Exposure
4
wide-accessdata lake
sports fashion technology
ad targetingarticle recommendation fraud detectionad targeting
![Page 5: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/5.jpg)
Question: Can Companies Be More Selective?
● We hypothesize that not all data that is collected is needed or used.
● If we can distinguish “needed” data from “unneeded” data, we can greatly improve protection.○ E.g., store unneeded data offline
5
![Page 6: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/6.jpg)
1. Limit in-use data2. Avoid accessing unused data 3. Without impacting accuracy,
performance
Selective Data Systems
6
unused data(tightly protected)
in-use data
(wide access)
![Page 7: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/7.jpg)
How to achieve selectivity in machine learning?
● Access to the “working set” is not enough● (Re)training models requires access to most/all data
● Training set minimization addresses this ○ E.g.: sampling, count featurization, active learning, ...○ Can we retrofit these mechanisms for protection?
7
![Page 8: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/8.jpg)
Pyramid
● First selective data system
● Retrofits count featurization for protection○ Keeps a small amount of recent raw data○ Summarizes past data using differentially private count tables○ Combines the raw data with count features and feeds that into ML
models for training
● Reduces data exposure by two orders of magnitude with moderate performance degradation
8
![Page 9: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/9.jpg)
Outline
9
● Motivation
● Design
● Evaluation
● Conclusions
![Page 10: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/10.jpg)
Architecture
10
Models M1 M2 M3 M4
Pyramid
Cold Raw Data Store
Count Tables
Count Featurization Differential Privacy
Recent RawData
![Page 11: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/11.jpg)
Architecture
11
Models M1 M2 M3 M4
Pyramid
Count Featurization
Cold Raw Data Store
Observation<l,x>
Differential Privacy
Count Tables
Recent RawData
![Page 12: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/12.jpg)
Count Featurization Example
12
PageID
Value Click No Click
P1 0 0
AdId
Value Click No Click
A1 0 0
A2 0 0
UserID
Value Click No Click
U1 0 0
U2 0 0 P2 0 0
![Page 13: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/13.jpg)
Count Featurization Example
13
<Label:Click | AdId:A1, UserID:U1, PageID:P1>
AdId
Value Click No Click
A1 1 0
A2 0 0
UserID
Value Click No Click
U1 1 0
U2 0 0 P2 0 0
PageID
Value Click No Click
P1 01
![Page 14: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/14.jpg)
Count Featurization Example
14
<Label:Click | AdId:A1, UserID:U1, PageID:P1>
PageID
Value Click No Click
P1 1 0
AdId
Value Click No Click
A1 1 0
A2 0 1 P2 0 1
<Label:No-Click | AdId:A2, UserID:U1, PageID:P2>
U2 0 0
UserID
Value Click No Click
U1 11
![Page 15: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/15.jpg)
Count Featurization Example
15
<Label:Click | AdId:A1, UserID:U1, PageID:P1>
PageID
Value Click No Click
P1 1 0
AdId
Value Click No Click
A1 1 1
A2 0 1
UserID
Value Click No Click
U1 1 1
U2 0 1 P2 0 2
<Label:No-Click | AdId:A2, UserID:U1, PageID:P2>
<Label:No-Click | AdId:A1, UserID:U2, PageID:P2>
![Page 16: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/16.jpg)
Count Featurization Example
16
PageID
Value Click No Click
P1 1300 63700
AdId
Value Click No Click
A1 1250 23751
A2 1482 26765
UserID
Value Click No Click
U1 105 1523
U2 112 1288 P2 3692 29874
![Page 17: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/17.jpg)
Count Featurization Example
17
<AdId:A1, UserID:U2, PageID:P1>
<P(click|AdId=A1), P(click|UserID=U2), P(click|PageID=P1)>
<0.05, 0.08, 0.02>
ML Model
PageID
Value Click No Click
P1 1300 63700
AdId
Value Click No Click
A1 1250 23751
A2 1482 26765
UserID
Value Click No Click
U1 105 1523
U2 112 1288 P2 3692 29874
![Page 18: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/18.jpg)
Architecture
18
Models M1 M2 M3 M4
Pyramid
Count Featurization
Cold Raw Data Store
AdIDCountTable
UserIDCountTable
PageIDCountTable
Prediction Request:<x>
<x> <x’>
Differential Privacy
Recent RawData
![Page 19: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/19.jpg)
Differentially PrivateCount Tables
Most data is offlineand not present in
Pyramid
Pyramid’s Protections
19
Time
HotWindow
RetentionWindow
RecentRaw Data
NewObservations
Months or years Days or weeks
![Page 20: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/20.jpg)
Protection Assumptions
● State is not managed out of band
● Models are retrained on request
● State from previous models does not persist
20
![Page 21: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/21.jpg)
Differential Privacy
● Randomizes output to protect privacy
● Privacy budget, ε, shared among queries
● Resilient to auxiliary information
● Resilient to post-processing
21
![Page 22: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/22.jpg)
Add NewObservations
Private CountTables
22
Σ
AdId
Value ClickNoClick
A1 20012.2 50012.1
AdId
Value ClickNoClick
A1 -10.3 45.2
AdId
Value ClickNoClick
A1 6514.3 15432.2
AdId
Value ClickNoClick
A1 6670.7 16682.3
AdId
Value ClickNoClick
A1 6827.2 17897.6
Day 1 Day 2 Day 3 Current Day
5791.3 18231.2
25803.5 68243.219289.2 52820.1
Time
![Page 23: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/23.jpg)
Challenges Combing Count Featurization and Differential Privacy
● Support large datasets with large numbers of features
● Must choose optimal count tables to support future workloads
● Some features are more sensitive to differential privacy
23
● Private Count-Median Sketch
● Feature Combination Selection
● Weighted Noise Infusion
Challenge Solution
![Page 24: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/24.jpg)
Outline
24
● Motivation
● Design
● Evaluation
● Conclusions
![Page 25: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/25.jpg)
Evaluation Datasets
25
● Criteo○ Ad click/no-click prediction○ Estimating probability of a click ○ 45 million points w/ 39 features
● Movielens○ Movie rating prediction○ Estimate probability a user will rate a movie highly○ 22 million ratings, 34K movies, 240K users
![Page 26: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/26.jpg)
Criteo: Training on just 0.4% of the data leads toonly 3.1% loss in accuracy
26
(y=1: baseline model trained on entire data)100%10%1%0.1%0.01%
Fraction of the raw data used for training (hence exposed)
Nor
mal
ized
Log
istic
Los
s(lo
wer
is b
ette
r)
1.031
0.4%
![Page 27: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/27.jpg)
Nor
mal
ized
Log
istic
Los
s(lo
wer
is b
ette
r)
Movielens: Training on just 0.8% of the data leads toonly 5.4% loss in accuracy
27
(y=1: baseline model trained on entire data)100%10%1%0.1%0.01%
Fraction of the raw data used for training (hence exposed)
1.054
0.8%
![Page 28: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/28.jpg)
● Data collection and wide access increase exposure risks
● Selective data systems minimize in-use data and separate it from unused data○ Training set minimization is a productive
way to think about selectivity
● Pyramid retrofits count featurization for protection with differential privacy○ Reduces exposure 2 orders of magnitude
Conclusions
28
unused data(tightly protected)
in-use data
(wide access)
![Page 29: Enhancing Selectivity Pyramid in Big Data Protectionriley/pdfs/PyramidTalk.pdfEnhancing Selectivity in Big Data Protection Mathias Lécuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang,](https://reader036.vdocuments.net/reader036/viewer/2022081607/5f01cd1e7e708231d4011a98/html5/thumbnails/29.jpg)
Limitations and Future Work
● Pyramid applicability:○ Works well for classification problems○ Most effective for categorical features○ Supports some but not all workload evolutions
● Future: extend applicability by retrofitting other training set minimization mechanisms for protection○ Vector quantization: can support continuous features○ Sampling and herding: can support unsupervised tasks○ Active learning: can permit selective data collection
29