privacy-preserving efficient subset of features selection ... · choiceofslotexponent...
TRANSCRIPT
![Page 1: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/1.jpg)
Privacy-preserving Efficient Subset of Features Selectionfor Regression Models
N. Gama, M. Georgieva
December 10, 2018 1 / 20
![Page 2: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/2.jpg)
GWAS (find the best additional feature)
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Question: Is the new future important?Naive method: compute stati for each i...... that means compute more than 10000 logreg
2 / 20
![Page 3: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/3.jpg)
Description of Idash 2018 Task 2
Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.
Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?
Semi-parallel approachDon’t do 10000 logregs...
3 / 20
![Page 4: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/4.jpg)
Description of Idash 2018 Task 2
Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.
Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?
Semi-parallel approachDon’t do 10000 logregs...
3 / 20
![Page 5: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/5.jpg)
Logreg, IRLS, relevance of a feature
X Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
Single Logistic regression:Find θ s.t Y = sign(Xθ)
IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X
4 / 20
![Page 6: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/6.jpg)
Logreg, IRLS, relevance of a feature
X Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
Importance of the ith feature:the ith coeff is big: θi (numerator)the ith error term is small:(Hess−1)i,i (denominator)
stat= ratio
Single Logistic regression:Find θ s.t Y = sign(Xθ)
IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X
4 / 20
![Page 7: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/7.jpg)
Semi-parallel GWAS (high level idea)
Semi-parallel GWAS (optimized)1 Do logreg(X, y) without S2 Once model is converged, add si
Gradient:
X0
0
t
si
Y-p
<si,Y-p>
They can be batch-computed: (Y-p) St
Hessian:
Xt
si
X si
p(1-p)
Old Hess
5 / 20
![Page 8: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/8.jpg)
MPC versus FHE
FHELong term storageUnique CloudSlower and consumes more memory
MPCFaster than FHEMore accuracyAll data owner must participate
6 / 20
![Page 9: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/9.jpg)
Fixed points versus Floating point
Floating point:x = m.2τ , with m ∈ 2−ρ.Z and 1
2 ≤ |m| < 1τ = dlog2(x)e data dependent and not public (not FHE-friendly)The exponent is always in sync with the dataex: (1.23 · 10−4) ∗ (7.24 · 10−4) = (8.90 · 10−8)
Fixed point:x = m.2τ , with m ∈ 2−ρ.Z and 0 ≤ |m| < 1,τ is public, thus FHE-friendlyRisk of overflow (τ too small)Risk of underflow (τ too large)ex: (0.000123 · 100) ∗ (0.000724 · 100) = (0.000000 · 100)
Plaintext parameters:ρ ∈ N: bits of precision of the plaintext (≈ 15 bits)τ ∈ Z: slot exponent (order of magnitude of the complex values in each slot)
7 / 20
![Page 10: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/10.jpg)
Choice of slot exponent
The slot exponent τ that defines the plaintext interval must be carefully estimated.
variable avg stdev min max dist
p 0.440816 0.0975715 0.176397 0.853487 0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
’p.histo’
w 0.236977 0.0201871 0.125047 0.25 0
20
40
60
80
100
120
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
’w.histo’
z∗i -3.33092 7.36068 -30.9426 31.2008 0
500
1000
1500
2000
2500
3000
3500
4000
-40 -30 -20 -10 0 10 20 30 40
’zStar.histo’
G 0.0577846 0.0953495 -0.011997 0.236977 0
0.5
1
1.5
2
2.5
3
-0.05 0 0.05 0.1 0.15 0.2 0.25
’G.histo’
A 0.0621965 0.301255 -0.317312 2.236 0
2000
4000
6000
8000
10000
12000
14000
16000
-0.5 0 0.5 1 1.5 2 2.5
’A.histo’
(s∗i )2 2.44243 4.11085 0.111961 14.5044 0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16
’sStar2.histo’
log(stati) 0.200039 1.84459 -13.7207 4.36158 0
200
400
600
800
1000
1200
-14 -12 -10 -8 -6 -4 -2 0 2 4 6
’ri.histo’
p− value 0.310218 0.24083 0 0.999163 0
200
400
600
800
1000
1200
1400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
’pval.histo’
8 / 20
![Page 11: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/11.jpg)
Choice of slot exponent
The slot exponent τ that defines the plaintext interval must be carefully estimated.
variable avg stdev min max dist
p 0.440816 0.0975715 0.176397 0.853487 0
10
20
30
40
50
60
70
80
90
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
’p.histo’
w 0.236977 0.0201871 0.125047 0.25 0
20
40
60
80
100
120
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
’w.histo’
z∗i -3.33092 7.36068 -30.9426 31.2008 0
500
1000
1500
2000
2500
3000
3500
4000
-40 -30 -20 -10 0 10 20 30 40
’zStar.histo’
G 0.0577846 0.0953495 -0.011997 0.236977 0
0.5
1
1.5
2
2.5
3
-0.05 0 0.05 0.1 0.15 0.2 0.25
’G.histo’
A 0.0621965 0.301255 -0.317312 2.236 0
2000
4000
6000
8000
10000
12000
14000
16000
-0.5 0 0.5 1 1.5 2 2.5
’A.histo’
(s∗i )2 2.44243 4.11085 0.111961 14.5044 0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16
’sStar2.histo’
log(stati) 0.200039 1.84459 -13.7207 4.36158 0
200
400
600
800
1000
1200
-14 -12 -10 -8 -6 -4 -2 0 2 4 6
’ri.histo’
p− value 0.310218 0.24083 0 0.999163 0
200
400
600
800
1000
1200
1400
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
’pval.histo’
8 / 20
![Page 12: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/12.jpg)
Numerical stability
Not stableIncrease the precision of the algorithm, butthat implies bigger parameters.
StableUse stable computation with negativefeedback(e.g. gradient descent)Smaller parameters
9 / 20
![Page 13: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/13.jpg)
FHE Solution
FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter
The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)
10 / 20
![Page 14: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/14.jpg)
FHE Solution
FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter
The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)
10 / 20
![Page 15: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/15.jpg)
Plaintext algorithm in FHE solution
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary
Output:stat ∈ Rmwith stati = z∗
i√s∗2
i
Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption
11 / 20
![Page 16: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/16.jpg)
Plaintext algorithm in FHE solution
SX Y
Patient 1
Patient n
intercept
age
weightgender
1
1 1
01
covariates target
sisi
SPNi
m>10000
Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary
Output:stat ∈ Rmwith stati = z∗
i√s∗2
i
Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption
11 / 20
![Page 17: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/17.jpg)
Optimization of plaintext algorithm
Make the plaintext algorithm FHE friendlyFind simple geometric equivalents of the formulaFind approximation with lower multiplicative depthReplace feature scaling of X with orthogonalization
12 / 20
![Page 18: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/18.jpg)
Algorithm in plaintext
13 / 20
![Page 19: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/19.jpg)
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
13 / 20
![Page 20: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/20.jpg)
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
13 / 20
![Page 21: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/21.jpg)
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
continuous function batched on a large vector
very large dimension
(fully packed SIMD)
13 / 20
![Page 22: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/22.jpg)
Algorithm in plaintext
continuous non-polynomial functions
(Approx numbers, or Lookup tables)
for loops
(better with fast bootstrapping)
individual non-linear operations in small dimension
(lookup tables)
multiplication with fresh ciphertexts
(better with TFHE’s external product)
continuous function batched on a large vector
very large dimension
(fully packed SIMD)
Which fully homomoprhic scheme should we choose?
13 / 20
![Page 23: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/23.jpg)
Each library has its own strengths
Strengths of HE librariesBGV/Helib: SIMD finite field arithmeticB/FV, Seal: SIMD vector mod t
HEAAN: SIMD fixed point arithmeticTFHE: single evaluation, boolean logic, comparison, threshold, complexcircuitsetc...
How to get all the benefits without the limitations?
14 / 20
![Page 24: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/24.jpg)
Solution: Chimera
Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN
For this use-caseWe use the switch between TFHE and HEAAN!
15 / 20
![Page 25: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/25.jpg)
Solution: Chimera
Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN
For this use-caseWe use the switch between TFHE and HEAAN!
15 / 20
![Page 26: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/26.jpg)
Chimera solution
1 Initial Logreg on matrix X and vector yadapt lib TFHE + logreg
2 Mass Linear algebra computationsimplement Chimera (version 2 of TFHE)
3 Batch Logarithm computationadapt lib HEAAN
16 / 20
![Page 27: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/27.jpg)
Benchmarks (Idash Bootstrapped)
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 3h06 10.2 mins 7.8 GB
Input ciphertext: 5GB (enc X, y, S)Final ciphertext: 640KB (enc numerator + denominator)
17 / 20
![Page 28: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/28.jpg)
Benchmarks (with new optimizations)k = 3, n = 250, m = 10000
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 35 mins 3 mins 7.8 GB
k = 7, n = 250, m = 10000
Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 41 mins 3.1 mins 7.8 GB
initial ciphertext: 5GB (enc X, y, S)final ciphertext: 640KB (enc numerator + denominator)
18 / 20
![Page 29: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/29.jpg)
Numerical Accuracy (FHE has noise)
-10
-5
0
5
10
-10 -5 0 5 10
actual vs. computedy=x
19 / 20
![Page 30: Privacy-preserving Efficient Subset of Features Selection ... · Choiceofslotexponent Theslotexponentτthatdefinestheplaintextintervalmustbecarefullyestimated. variable avg stdev](https://reader033.vdocuments.net/reader033/viewer/2022050217/5f62cc46f8bed61af8350d02/html5/thumbnails/30.jpg)
Questions?
20 / 20