integrating data for analysis, anonymization, and sharing supported by the nih grant u54hl108460 to...

1
integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego Shuang Wang, 1 Xiaoqian Jiang, 1 Yuan Wu, 1 Lijuan Cui, 2 and Samuel Cheng 2 , Lucila Ohno-Machado 1 EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning 1 Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA 2 School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, Oklahoma, USA Introduction EXPLORER framework Summary of Conclusions It has been shown in last decade that data privacy cannot be maintained by simply removing patient identities. Thus, training data in one institute cannot be exchanged or shared with other institutions directly for the purposes of global logistic regression model learning. To address such a challenge, numerous privacy-preserving distributed frequentist regression models for horizontally partitioned data have been studied, among which Grid LOgistic RE- gression (GLORE) model [1] and the Secure Pooled Analysis acRoss K- site (SPARK) protocol [2] are the closest work for the method presented here. Despite its simplicity and interpretability, the distributed frequentist logistic regression approach has limitations as shown in Table 1. Table 1: Comparing EXPLORER with GLORE and SPARK Reference s [1] Wu, Y., Jiang, X., Kim, J., Ohno-Machado, L. (2012). Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. JAMIA, 19(5), 758-764. [2] El Emam, K., Samet, S., Arbuckle, L., Tamblyn, R., Earle, C., & Kantarcioglu, M. (2013). A secure distributed logistic regression protocol for the detection of rare adverse drug events. JAMIA, 20(3), 453-461. [3] Wang, S., Jiang, X., Wu, Y., Cui, L., Cheng, S., & Ohno- Machado, L. (2013). EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online In summary, EXPLORER offers an alternative tool for privacy-preserving distributed statistical learning. We showed empirically on multiple data sets that the results are very similar to those of ordinary logistic regression. These promising results warrant further validation in larger data sets and further refinement of the methodology. Inability to openly share (i.e., transmit) patient data without onerous processes involving pair-wise agreements between institutions may significantly slow down analyses that could produce important results for healthcare improvement and biomedical research advances. EXPLORER provides a means to mitigate this problem by relying on multiparty computation without need for extensive re-training of models, nor reliance on synchronous communications among sites. Privacy protection Asynchronou s communicati on Online learning GLORE or SPARK EXPLORER We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning [3]. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance as the traditional frequentist logistic regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications Experimental Results Methodology Secured Intermediate iNformation Exchange (SINE) protocol Dataset Dataset description # of covariates # of samples 1 Simulated i.i.d. data 5 500 2 Simulated correlated data 6 500 3 Simulated binary data 15 1500 4 Myocardial infarction 9 1253 Table 2: Summary of datasets used in our experiments Ordinary LR EXPLORER two-sample Z-test β Prob. β Prob. Test statistic p- value β1 -2.0743 1.000 -2.0722 1.000 -0.3573 0.7209 β2 0.0452 0.267 0.0527 0.300 -0.3669 0.7137 β3 3.0195 1.000 3.0167 1.000 0.5919 0.5539 β4 -0.0402 -0.0265 0.133 -0.7003 0.4837 β5 -1.7316 1.000 0.1305 0.8962 Table 3: Distributed forward feature selection on data set 1 over 30 trials Table 4: Comparisons of H-L tests and AUCs for simulated dataset 2 with/without interaction using Ordinary LR and 4-site EXPLORER With interaction Without interaction Ordinary LR EXPLORER Ordinary LR EXPLORER H-L test test statistics 22.459 40.595 15.184 18.909 p-value 0.289 0.198 0.203 0.150 AUC Averaged value 0.851 0.852 0.794 0.793 Standard 0.045 0.045 0.058 0.058 Z-test -0.049 0.075 Z-test p- value 0.961 0.940 The convergence speed of all 10 coefficients of the data set 4 for an asynchronous EXPLORE setup β Ordinary LR EXPLORER two-sample Z test value std. value std. Test statistic p-value β0 1.577 0.157 1.558 0.153 0.468 0.64005 β1 -1.035 0.082 -1.027 0.083 -0.359 0.71924 β2 1.728 0.072 1.734 0.072 -0.301 0.76344 β3 0.604 0.077 0.607 0.077 -0.183 0.85511 β4 -1.120 0.096 -1.116 0.096 -0.139 0.88984 β5 0.943 0.090 0.950 0.091 -0.315 0.75262 β6 2.239 0.103 2.249 0.102 -0.379 0.70441 β7 -0.942 0.089 -0.938 0.089 -0.192 0.84777 β8 -1.535 0.081 -1.537 0.081 0.073 0.94145 β9 1.212 0.087 1.221 0.087 -0.390 0.69690 β1 0 1.048 0.080 1.055 0.080 -0.322 0.74759 β1 1 1.343 0.089 1.344 0.089 -0.043 0.96579 β1 2 -0.066 0.080 -0.064 0.080 -0.131 0.89593 β1 3 -0.676 0.081 -0.669 0.081 -0.362 0.71716 β1 4 -3.354 0.111 -3.361 0.109 0.226 0.82122 β1 5 0.362 0.081 0.367 0.081 -0.222 0.82458 Table 5: Learned model parameter β of dataset 3 using Ordinary LR and 2-site EXPLORER EXPLORER site 1 EXPLORER site 2 EXPLORER site 3 Secure random m atrix: Sensitive inform ation: Secure random m atrix: Sensitive inform ation: Secure random m atrix: Sensitive inform ation: Recover Secure random m atrix: Step 1 Step 2 Step 3 Institution 2 Institution 1 Institution 3 Patientdata Patientdata Patientdata O nline learning capability to avoid the need fortraining on the entire database w hen a single record is updated. M odellearning from distributed sources w ithoutsharing raw data. clientsites could dynam ically shiftfrom online to offline S trong privacy protection. Site 3 ... Site 1 ... Site 2 ...

Upload: christopher-james

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego Shuang Wang,

integrating Data for Analysis,Anonymization, and SHaring

Supported by the NIH Grant U54HL108460 to the University of California, San Diego

Shuang Wang,1 Xiaoqian Jiang,1 Yuan Wu,1 Lijuan Cui,2 and Samuel Cheng2, Lucila Ohno-Machado1

EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning

1Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA2School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, Oklahoma, USA

Introduction EXPLORER framework

Summary of Conclusions

It has been shown in last decade that data privacy cannot be maintained by simply removing patient identities. Thus, training data in one institute cannot be exchanged or shared with other institutions directly for the purposes of global logistic regression model learning. To address such a challenge, numerous privacy-preserving distributed frequentist regression models for horizontally partitioned data have been studied, among which Grid LOgistic RE- gression (GLORE) model [1] and the Secure Pooled Analysis acRoss K-site (SPARK) protocol [2] are the closest work for the method presented here. Despite its simplicity and interpretability, the distributed frequentist logistic regression approach has limitations as shown in Table 1.

Table 1: Comparing EXPLORER with GLORE and SPARK

Ref

eren

ces

[1] Wu, Y., Jiang, X., Kim, J., Ohno-Machado, L. (2012). Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. JAMIA, 19(5), 758-764.[2] El Emam, K., Samet, S., Arbuckle, L., Tamblyn, R., Earle, C., & Kantarcioglu, M. (2013). A secure distributed logistic regression protocol for the detection of rare adverse drug events. JAMIA, 20(3), 453-461.[3] Wang, S., Jiang, X., Wu, Y., Cui, L., Cheng, S., & Ohno-Machado, L. (2013). EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning. JBI, 46(3), 480-496.

In summary, EXPLORER offers an alternative tool for privacy-preserving distributed statistical learning. We showed empirically on multiple data sets that the results are very similar to those of ordinary logistic regression. These promising results warrant further validation in larger data sets and further refinement of the methodology. Inability to openly share (i.e., transmit) patient data without onerous processes involving pair-wise agreements between institutions may significantly slow down analyses that could produce important results for healthcare improvement and biomedical research advances. EXPLORER provides a means to mitigate this problem by relying on multiparty computation without need for extensive re-training of models, nor reliance on synchronous communications among sites.

EXPLORER site 1

EXPLORER site 2

EXPLORER site 3

Secure random matrix:

Sensitive information:

Secure random matrix:

Sensitive information:

Secure random matrix:

Sensitive information:

Recover

Secure random matrix:

Step 1Step 2Step 3

Institution 2

Institution 1

Institution 3

Patient data

Patient data

Patient data

Online learning capability to avoid the need for training on the entire database when a single record is updated.

Model learning from distributed sources without sharing raw data.

client sites could dynamically shift from online to offline

Strong privacy protection.

Privacy protection

Asynchronous communication Online learning

GLORE or SPARK ✔ EXPLORER ✔ ✔ ✔

We developed an EXpectation Propagation LOgistic REgRession (EXPLORER) model for distributed privacy-preserving online learning [3]. The proposed framework provides a high level guarantee for protecting sensitive information, since the information exchanged between the server and the client is the encrypted posterior distribution of coefficients. Through experimental results, EXPLORER shows the same performance as the traditional frequentist logistic regression model, but provides more flexibility in model updating. That is, EXPLORER can be updated one point at a time rather than having to retrain the entire data set when new observations are recorded. The proposed EXPLORER supports asynchronized communication, which relieves the participants from coordinating with one another, and prevents service breakdown from the absence of participants or interrupted communications

Site 3

...

Site 1

...

Site 2

...

Experimental Results

Methodology

Secured Intermediate iNformation Exchange (SINE) protocol

Dataset Dataset description # of covariates # of samples1 Simulated i.i.d. data 5 5002 Simulated correlated data 6 5003 Simulated binary data 15 15004 Myocardial infarction 9 1253

Table 2: Summary of datasets used in our experiments

Ordinary LR EXPLORER two-sample Z-testβ Prob. β Prob. Test statistic p-value

β1 -2.0743 1.000 -2.0722 1.000 -0.3573 0.7209β2 0.0452 0.267 0.0527 0.300 -0.3669 0.7137β3 3.0195 1.000 3.0167 1.000 0.5919 0.5539β4 -0.0402 0.200 -0.0265 0.133 -0.7003 0.4837β5 -1.7313 1.000 -1.7316 1.000 0.1305 0.8962

Table 3: Distributed forward feature selection on data set 1 over 30 trials

Table 4: Comparisons of H-L tests and AUCs for simulated dataset 2 with/without interaction using Ordinary LR and 4-site EXPLORER

With interaction Without interactionOrdinary LR EXPLORER

Ordinary LR EXPLORER

H-L test test statistics 22.459 40.595 15.184 18.909p-value 0.289 0.198 0.203 0.150

AUC

Averaged value 0.851 0.852 0.794 0.793Standard deviation 0.045 0.045 0.058 0.058Z-test statistics -0.049 0.075Z-test p-value 0.961 0.940

The convergence speed of all 10 coefficients of the data set 4 for an asynchronous 8-site EXPLORE setup

β

Ordinary LR EXPLORER two-sample Z test

value std. value std. Test statistic p-value

β0 1.577 0.157 1.558 0.153 0.468 0.64005β1 -1.035 0.082 -1.027 0.083 -0.359 0.71924β2 1.728 0.072 1.734 0.072 -0.301 0.76344β3 0.604 0.077 0.607 0.077 -0.183 0.85511β4 -1.120 0.096 -1.116 0.096 -0.139 0.88984β5 0.943 0.090 0.950 0.091 -0.315 0.75262β6 2.239 0.103 2.249 0.102 -0.379 0.70441β7 -0.942 0.089 -0.938 0.089 -0.192 0.84777β8 -1.535 0.081 -1.537 0.081 0.073 0.94145β9 1.212 0.087 1.221 0.087 -0.390 0.69690

β10 1.048 0.080 1.055 0.080 -0.322 0.74759β11 1.343 0.089 1.344 0.089 -0.043 0.96579β12 -0.066 0.080 -0.064 0.080 -0.131 0.89593β13 -0.676 0.081 -0.669 0.081 -0.362 0.71716β14 -3.354 0.111 -3.361 0.109 0.226 0.82122β15 0.362 0.081 0.367 0.081 -0.222 0.82458

Table 5: Learned model parameter β of dataset 3 using Ordinary LR and 2-site EXPLORER