privacy-preserving efficient subset of features selection ... · choiceofslotexponent...

Privacy-preserving Efficient Subset of Features Selectionfor Regression Models

N. Gama, M. Georgieva

December 10, 2018 1 / 20

GWAS (find the best additional feature)

Patient 1

Patient n

intercept

weightgender

covariates target

m>10000

Question: Is the new future important?Naive method: compute stati for each i...... that means compute more than 10000 logreg

2 / 20

Description of Idash 2018 Task 2

Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.

Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?

Semi-parallel approachDon’t do 10000 logregs...

3 / 20

Description of Idash 2018 Task 2

Goal:Develop a secure parallel outsourcing solution to compute Genome WideAssociation Studies (GWAS) based on linear/logistic regression usinghomomorphically encrypted data.

Challenge (informally):Currently: a logreg model, 250 patients, with 3 or 4 physical features.Which new feature, among 10000 possible genes (SNP), would improve themodel?

Semi-parallel approachDon’t do 10000 logregs...

3 / 20

Logreg, IRLS, relevance of a feature

Patient 1

Patient n

intercept

weightgender

covariates target

Single Logistic regression:Find θ s.t Y = sign(Xθ)

IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X

4 / 20

Logreg, IRLS, relevance of a feature

Patient 1

Patient n

intercept

weightgender

covariates target

Importance of the ith feature:the ith coeff is big: θi (numerator)the ith error term is small:(Hess−1)i,i (denominator)

stat= ratio

Single Logistic regression:Find θ s.t Y = sign(Xθ)

IRLS:Compute grad = Xt(Y − p), with p = σ(Xθ)Compute Hessian = Xtdiag(p(1 − p))X

4 / 20

Semi-parallel GWAS (high level idea)

Semi-parallel GWAS (optimized)1 Do logreg(X, y) without S2 Once model is converged, add si

Gradient:

<si,Y-p>

They can be batch-computed: (Y-p) St

Hessian:

p(1-p)

Old Hess

5 / 20

MPC versus FHE

FHELong term storageUnique CloudSlower and consumes more memory

MPCFaster than FHEMore accuracyAll data owner must participate

6 / 20

Fixed points versus Floating point

Floating point:x = m.2τ , with m ∈ 2−ρ.Z and 1

2 ≤ |m| < 1τ = dlog2(x)e data dependent and not public (not FHE-friendly)The exponent is always in sync with the dataex: (1.23 · 10−4) ∗ (7.24 · 10−4) = (8.90 · 10−8)

Fixed point:x = m.2τ , with m ∈ 2−ρ.Z and 0 ≤ |m| < 1,τ is public, thus FHE-friendlyRisk of overflow (τ too small)Risk of underflow (τ too large)ex: (0.000123 · 100) ∗ (0.000724 · 100) = (0.000000 · 100)

Plaintext parameters:ρ ∈ N: bits of precision of the plaintext (≈ 15 bits)τ ∈ Z: slot exponent (order of magnitude of the complex values in each slot)

7 / 20

Choice of slot exponent

The slot exponent τ that defines the plaintext interval must be carefully estimated.

variable avg stdev min max dist

p 0.440816 0.0975715 0.176397 0.853487 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

’p.histo’

w 0.236977 0.0201871 0.125047 0.25 0

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26

’w.histo’

z∗i -3.33092 7.36068 -30.9426 31.2008 0

-40 -30 -20 -10 0 10 20 30 40

’zStar.histo’

G 0.0577846 0.0953495 -0.011997 0.236977 0

-0.05 0 0.05 0.1 0.15 0.2 0.25

’G.histo’

A 0.0621965 0.301255 -0.317312 2.236 0

-0.5 0 0.5 1 1.5 2 2.5

’A.histo’

(s∗i )2 2.44243 4.11085 0.111961 14.5044 0

0 2 4 6 8 10 12 14 16

’sStar2.histo’

log(stati) 0.200039 1.84459 -13.7207 4.36158 0

-14 -12 -10 -8 -6 -4 -2 0 2 4 6

’ri.histo’

p− value 0.310218 0.24083 0 0.999163 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

’pval.histo’

8 / 20

Choice of slot exponent

The slot exponent τ that defines the plaintext interval must be carefully estimated.

variable avg stdev min max dist

p 0.440816 0.0975715 0.176397 0.853487 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

’p.histo’

w 0.236977 0.0201871 0.125047 0.25 0

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26

’w.histo’

z∗i -3.33092 7.36068 -30.9426 31.2008 0

-40 -30 -20 -10 0 10 20 30 40

’zStar.histo’

G 0.0577846 0.0953495 -0.011997 0.236977 0

-0.05 0 0.05 0.1 0.15 0.2 0.25

’G.histo’

A 0.0621965 0.301255 -0.317312 2.236 0

-0.5 0 0.5 1 1.5 2 2.5

’A.histo’

(s∗i )2 2.44243 4.11085 0.111961 14.5044 0

0 2 4 6 8 10 12 14 16

’sStar2.histo’

log(stati) 0.200039 1.84459 -13.7207 4.36158 0

-14 -12 -10 -8 -6 -4 -2 0 2 4 6

’ri.histo’

p− value 0.310218 0.24083 0 0.999163 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

’pval.histo’

8 / 20

Numerical stability

Not stableIncrease the precision of the algorithm, butthat implies bigger parameters.

StableUse stable computation with negativefeedback(e.g. gradient descent)Smaller parameters

9 / 20

FHE Solution

FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter

The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)

10 / 20

FHE Solution

FHE parameters:L ∈ N: level exponent of the ciphertext ( α = 2−(L+ρ): noise rate)N = f(λ, α): key size, with λ the security parameter

The lwe-estimator script was used to assert the security.(conform to HE security standardization white paper)

10 / 20

Plaintext algorithm in FHE solution

Patient 1

Patient n

intercept

weightgender

covariates target

m>10000

Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary

Output:stat ∈ Rmwith stati = z∗

i√s∗2

Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption

11 / 20

Plaintext algorithm in FHE solution

Patient 1

Patient n

intercept

weightgender

covariates target

m>10000

Input:X ∈Mn,k+1(R) input matrixy ∈ Bn binary vectorS ∈Mn,m(R) assumed binary

Output:stat ∈ Rmwith stati = z∗

i√s∗2

Key points of our solution:Make plaintext algorithmFHE friendlyUse hybrid homomorphic encryption

11 / 20

Optimization of plaintext algorithm

Make the plaintext algorithm FHE friendlyFind simple geometric equivalents of the formulaFind approximation with lower multiplicative depthReplace feature scaling of X with orthogonalization

12 / 20

Algorithm in plaintext

13 / 20

continuous non-polynomial functions

(Approx numbers, or Lookup tables)

for loops

(better with fast bootstrapping)

13 / 20

for loops

individual non-linear operations in small dimension

(lookup tables)

multiplication with fresh ciphertexts

(better with TFHE’s external product)

13 / 20

for loops

(lookup tables)

continuous function batched on a large vector

very large dimension

(fully packed SIMD)

13 / 20

for loops

(lookup tables)

continuous function batched on a large vector

very large dimension

(fully packed SIMD)

Which fully homomoprhic scheme should we choose?

13 / 20

Each library has its own strengths

Strengths of HE librariesBGV/Helib: SIMD finite field arithmeticB/FV, Seal: SIMD vector mod t

HEAAN: SIMD fixed point arithmeticTFHE: single evaluation, boolean logic, comparison, threshold, complexcircuitsetc...

How to get all the benefits without the limitations?

14 / 20

Solution: Chimera

Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN

For this use-caseWe use the switch between TFHE and HEAAN!

15 / 20

Solution: Chimera

Idea:Unified plaintext space over the TorusSwitch between ciphertext representationsImplement bridges between TFHE, B/FV and HEAAN

For this use-caseWe use the switch between TFHE and HEAAN!

15 / 20

Chimera solution

1 Initial Logreg on matrix X and vector yadapt lib TFHE + logreg

2 Mass Linear algebra computationsimplement Chimera (version 2 of TFHE)

3 Batch Logarithm computationadapt lib HEAAN

16 / 20

Benchmarks (Idash Bootstrapped)

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 3h06 10.2 mins 7.8 GB

Input ciphertext: 5GB (enc X, y, S)Final ciphertext: 640KB (enc numerator + denominator)

17 / 20

Benchmarks (with new optimizations)k = 3, n = 250, m = 10000

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 35 mins 3 mins 7.8 GB

k = 7, n = 250, m = 10000

Steps Timing (4 cores) Timing (96 cores) RAMKeyGen 5.5 mins 2.0 mins 4.4 GBEncryption 7.2 mins 1.3 mins 8.6 GBCloud Computation 41 mins 3.1 mins 7.8 GB

initial ciphertext: 5GB (enc X, y, S)final ciphertext: 640KB (enc numerator + denominator)

18 / 20

Numerical Accuracy (FHE has noise)

-10 -5 0 5 10

actual vs. computedy=x

19 / 20

Questions?

20 / 20

privacy-preserving efficient subset of features selection ... · choiceofslotexponent...

Documents

subset groupoids

a unified aquatic life framework for addressing the...

name avg games league avg games league name average...

avg android app performance report by avg technologies

avg...

avg 9.0 email server edition - avg...

economic outlook · economic outlook presentation to ifac...

traditional statistics mean, stdev (normal curve) mean,...

76% passed, average=75%, stdev=20%; high score =101%, low...

avg antivirus 2014 - creativemark · avg antivirus,...

avg internet security 2014€¦ · avg internet security...

avg 9 internet...

efﬁcient feature subset selection and subset size...

avg antivirus antivirus fact sheet.pdfavg antivirus windows...

avg antivirus · emailing, downloading and sharing files...

avg internet security...

cta indices average stdev avg/sd max min mutual fund ...hfr...

avg internet security business edition...

avg anti-virus...

avg pc tuneup...