data-mining on gbytes of encrypted data · data mining • classification • regression •...
TRANSCRIPT
![Page 1: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/1.jpg)
Data-Mining on GBytes of Encrypted DataEncrypted Data
Valeria Nikolaenko
In collaboration withDan Boneh (Stanford); Udi Weinsberg, Stratis Ioannidis,
Marc Joye, Nina Taft (Technicolor).
![Page 2: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/2.jpg)
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
2Valeria Nikolaenko
![Page 3: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/3.jpg)
MotivationUsers
Data
Data mining engineData
Data model
Engine learns nothing more than the model!
3Valeria Nikolaenko
Privacy concern!
![Page 4: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/4.jpg)
Data Mining
• Classification• Regression• Clustering• Summarization
: linear regression
: matrix factorization• Summarization• Dependency modeling
4Valeria Nikolaenko
: matrix factorization
Main challenge: make these algorithms privacy preserving and efficient.
![Page 5: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/5.jpg)
Contribution
• Design of a practical system forprivacy preserving linear regression
• Implementation• Experiments on real datasets• Experiments on real datasets
Comparison to state of the art:• Hall et al.’11: 2 days vs 3 min• Graepel et al.’12: 10 min vs 2 sec
5Valeria Nikolaenko
![Page 6: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/6.jpg)
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
6Valeria Nikolaenko
![Page 7: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/7.jpg)
Computations on Encrypted Data
• 2009, C. Gentry – FHE
• 1979, A. Shamir1988, BGW
(slow for our problems)
Secret sharing1988, BGW
• 1982, A.C. Yao – Garbled circuits
• Our approach: hybrid of Yao and hom. encryption
(huge communication overhead)
7Valeria Nikolaenko
Secret sharing
![Page 8: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/8.jpg)
Yao’s Garbled Circuits
“garbled
C(x) C(x)
8Valeria Nikolaenko
circuit“garbled circuit”
01
01
01
...
...01
G01
G11
G02
G12
G03
G13
...
...G0
nG1
n
x=010...1 Nothing is leaked about x,other than C(x)!
![Page 9: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/9.jpg)
Data Mining System Architecture
User Evaluator
Crypto Service Provider (CSP)User i
CreateĈ
9Valeria Nikolaenko
Create“garbled circuit” ĈĈ
{Gx11, Gx2
2, …, Gxnn }
Compute C(x1, …, xn)
C(x1, …, xn)
Ĉ
{G01, G0
2, …, G0n}
{G11, G1
2, …, G1n}
xi
![Page 10: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/10.jpg)
System Properties
�Evaluator learns the model, not the inputs
Problems:• Not scalable with the number of users• Not scalable with the number of users• Users need to be online
10Valeria Nikolaenko
![Page 11: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/11.jpg)
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
11Valeria Nikolaenko
![Page 12: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/12.jpg)
Linear Regression I
• (f,r) – from users• solve for β
bA =βr
12Valeria Nikolaenko
bA =β
f
r = <f, β>
![Page 13: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/13.jpg)
Linear Regression II
Users
f1, r1Training engine… Training engine
fn, rn
Training engine learns nothing about f’s and r’s,other than β!
β
13Valeria Nikolaenko
…
![Page 14: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/14.jpg)
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
14Valeria Nikolaenko
![Page 15: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/15.jpg)
Construct Matrices
Users
…
f1Tf1
Evaluator
computes A = ∑fiTfi
in the circuit
Phase1: compute and ∑= iT
i ffA ∑= iT
i rfb
15Valeria Nikolaenko
fnTfn
For additions can use homomorphic encryption:
)()](),([ 2121 AAHEAHEAHE +→
in the circuit
![Page 16: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/16.jpg)
Construct Matrices
Users
…
EvaluatorHE(f1
Tf1)
• computesHE(A) = HE(∑f Tf )
Phase1: compute and ∑= iT
i ffA ∑= iT
i rfb
16Valeria Nikolaenko
For additions can use homomorphic encryption:
)()](),([ 2121 AAHEAHEAHE +→
HE(fnTfn)
HE(A) = HE(∑fiTfi )
• decrypts in the circuit
Circuit – independent of the number of usersUsers – offline
![Page 17: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/17.jpg)
Solve Linear System
Input: HE(A),HE(b)
Decrypt A and b
bA =βPhase2: solve for β
Cholesky: compute L, s.t. A= LLT
Solve for β: L(LT β)=b
17Valeria Nikolaenko
β
![Page 18: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/18.jpg)
Privacy Preserving Regression System
Evaluator Crypto Service Provider (CSP)
HEpk(f iTf i)
pk
(pk, sk)←HE.KeyGen
Create “garbled circuit” Ĉ
ComputeHEpk(A = ∑fiTfi),HEpk(b = ∑fiTri)
HEpk(f iTr i)
pk
pkPhase1
{G01, G0
2, …, G0n}
{G11, G1
2, …, G1n}
18Valeria Nikolaenko
Send ĈĈHEpk(f i
Tr i)
β
Garbled inputs
ComputeC(HEpk(A),HEpk(b))
Phase2
![Page 19: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/19.jpg)
System Properties
�Evaluator learns the model, not the inputs�Scalable with the number of users�Users can be offline
19Valeria Nikolaenko
Extensions:• Masking instead of decryption in circuit• Protection against malicious
Evaluator and CSP
![Page 20: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/20.jpg)
Outline
• Motivation• Background on cryptographic tools• Linear regression• Our solution• Our solution• Experiments and performance
20Valeria Nikolaenko
![Page 21: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/21.jpg)
Performance• Hybrid vs. Pure-Yao: 100 times improvement in time!
• For up to 20 features, 1000’s users• time < 3 min• communication < 1GB
21Valeria Nikolaenko
• communication < 1GB
• For 100 million of users, 20 features: 8.75 hours
• Tested on real datasets
![Page 22: Data-Mining on GBytes of Encrypted Data · Data Mining • Classification • Regression • Clustering • Summarization: linear regression: matrix factorization • Dependency modeling](https://reader033.vdocuments.net/reader033/viewer/2022050423/5f91d022d8fba6464e4d6f62/html5/thumbnails/22.jpg)
Conclusion
• Privacy-preserving data-mining is efficient• Our approach can be used in practice
• Current work: matrix factorization
22Valeria Nikolaenko
• Current work: matrix factorization
• Future work:– implement other data mining algorithms– improving implementation to support high
parallelization