training testing privacy-preserving decision tree …snewman36/struex_poster_iisp.pdf"data...

1
Privacy-Preserving Decision Tree Ensembles Scenario State-Of-The-Art Comparison and Challenges Inductive Learning Dataset Optimized Decision Tree Algorithm Inductive learning: learn target model through iterative inductions over the training sample set How? Approximation to an optimal hypothesis optimize an objective learning function Example: Minimizing a Loss Function Training dataset: 1 , 1 ,…, , where is the true label of Target function: Inductive learner produces a model = () which approximates Γ such that the loss function , is minimized Optimal model minimizes the average loss defined by , for all samples in the training set Weighted by their posterior probability For many problems, is a non-deterministic function Decision Tree is one of the most fundamental inductive learning models Healthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6] 1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015. 2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403. 3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012. 5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362. 6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015). 7 Agrawal, Rakesh, and Ramakrishnan Srikant. "Privacy-preserving data mining." ACM Sigmod Record. Vol. 29. No. 2. ACM, 2000. 8 Kargupta, Hillol, et al. "On the privacy preserving properties of random data perturbation techniques." Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003. 9 Fan, Wei. "On the optimality of probability estimation by random decision trees." AAAI. Vol. 2004. 2004. 10 Ho, Tin Kam. "Random decision forests." Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. Vol. 1. IEEE, 1995. 11 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-340 12 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005. 13 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. 14 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015. 15 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009. 16 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007. 17 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997. 18 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999. 19 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490. 20 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187. 21 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986. 22 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613. 23 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000. 24 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014. 25 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355. 26 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017). 27 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016. 28 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411. Privacy-Preserving Training Primary Care Physician’s Dataset Hospital’s Dataset Insurance Provider’s Dataset Medical Specialist’s Dataset Complete Dataset HIPAA ? Patient’s Private Medical Data Patient’s Sensitive Classification Result Privacy-Preserving Evaluation Decision Tree Model Randomization Techniques Black Box What does this mean? Tree T1 trained on dataset D1 Tree T2 trained on dataset D2 where D2 = any dataset differing from D1 by, at most, one training example If any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree How? Add noise to D1 before building the tree! Train using differentially private queries [12] Make each step of the training process differentially private [13] Add randomization - Random forests [14] - Random decision trees [15] “In the setting of multiparty computation, sets of two or more parties with private inputs wish to jointly compute some (predetermined) function of their inputs. The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16] What does this mean? Exchange random-looking message such that messages can still be used to compute decision tree Messages don’t mean anything, still get trained model How? Building Blocks: Commodity-Based Cryptography [17] Homomorphic Encryption [18] [19] Oblivious Transfer [20] Yao’s Garbled Circuits [21] Shamir’s Secret Sharing Scheme [22] Primary Concern: Privacy of the datasets Leakage Points: (1) Training Process, (2) Tree Structure Idea: Evaluation as a Service Service provider has predictive ensemble model Charges per query made Privacy Concerns: Server: Models - as a source of revenue - encodes business knowledge - encodes underlying, potentially sensitive, training data Client: Data Classification Result Differential Privacy [11] Training based on Garbled Circuits [23] Training based on Shamir’s Secret Sharing [24] Evaluation using Homomorphic Encryption [25] Evaluation using Commodity-Based Cryptography [26] Evaluation using SGX [27] Secure Multiparty Computation Seminal Work: Agrawal and Srikant – Privacy-Preserving Data Mining [7] Introduced privacy-preserving data mining concept Techniques: - Discretize values Protect individual, unique values - + where , Has since been broken [8] Opened doors into the area Random Decision Trees Introduced by Fan [9] Splits at each node according to a randomly chosen feature - Reduces problem to protecting the leaf nodes Random Forests Introduced by Ho [10] Random subspace method to implement stochastic discrimination Ensemble method with bagging Comparison of Approaches Trade-Offs: Black-Box Access vs Accuracy Loss - Attacker can combine a-priori information with the results from many protocol executions to reverse engineer private data OR - Can introduce randomness and lose accuracy of the resulting model Efficiency Loss vs Data Access - Multiple data holders need to exchange messages privately cryptographic operations - Efficiency loss OR - Must assume single data holder [7] [28] [12] [13] [14] [25] [26] [24] Open Research Challenges Risks of Reverse Engineering Computation Costs Incorporating Different Trust and Sensitivity Levels Combining Secure Multiparty Computation with Differential Privacy Dynamic and Flexible Collaborative Learning Acknowledgement This research has been partially sup- port by the National Science Foundation under Grants CNS- 1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities. The first author was awarded a partial GRA support from IISP. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation. labeled data unlabeled data …… final predictions learn the combination from labeled data training testing classifier 1 classifier 2 classifier m Ensemble model Privacy Preserving Ensemble Learning Differential Privacy, Secure Multiparty Computation, Quantification of Privacy Ensemble Learning: Supervised, Unsupervised, Semi-supervised Distributed vs. Centralized Privacy Preserving Ensemble Learning Architecture Decision Trees, Deep Neural Networks Stacey Truex and Ling Liu

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: training testing Privacy-Preserving Decision Tree …snewman36/STruex_Poster_IISP.pdf"Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference

Privacy-Preserving Decision Tree Ensembles

Scenario State-Of-The-Art Comparison and Challenges

Inductive Learning

Dataset

Optimized Decision

Tree Algorithm

Inductive learning: learn target model through iterative inductions over the training sample setHow?

Approximation to an optimal hypothesis optimize an objective learning functionExample: Minimizing a Loss Function

Training dataset: 𝑥1, 𝑐1 , … , 𝑥𝑛, 𝑐𝑛 where 𝑐𝑖 is the true label of 𝑥𝑖Target function: 𝑐 = Γ 𝑥Inductive learner produces a model 𝑦 = 𝑔(𝑥) which approximates Γ 𝑥 such that the loss function 𝐿 𝑐, 𝑦 is minimized

Optimal model minimizes the average loss defined by 𝐿 𝑐, 𝑦 for all samples in the training setWeighted by their posterior probability 𝑃𝑐 𝑦 𝑥

For many problems, 𝑐 = Γ 𝑥 is a non-deterministic functionDecision Tree is one of the most fundamental inductive learning models

Healthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6]

1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403.3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012.5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362.6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015).7 Agrawal, Rakesh, and Ramakrishnan Srikant. "Privacy-preserving data mining." ACM Sigmod Record. Vol. 29. No. 2. ACM, 2000.8 Kargupta, Hillol, et al. "On the privacy preserving properties of random data perturbation techniques." Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003.9 Fan, Wei. "On the optimality of probability estimation by random decision trees." AAAI. Vol. 2004. 2004.10 Ho, Tin Kam. "Random decision forests." Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. Vol. 1. IEEE, 1995.

11 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-34012 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005.13 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.14 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015.15 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009.16 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007.17 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.18 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999.19 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490.

20 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187.21 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986.22 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613.23 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000.24 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014.25 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355.26 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017).27 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016.28 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411.

Privacy-Preserving Training

Primary Care Physician’s

Dataset

Hospital’s Dataset

Insurance Provider’s

Dataset

Medical Specialist’s

Dataset

Complete Dataset

HIPAA ?

Patient’s Private Medical Data

Patient’s Sensitive Classification Result

Privacy-Preserving Evaluation

Decision Tree Model

Randomization Techniques

Black

Box

What does this mean?Tree T1 trained on dataset D1Tree T2 trained on dataset D2 where

D2 = any dataset differing from D1 by, at most, one training exampleIf any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree

How?Add noise to D1 before building the tree!

• Train using differentially private queries [12]

• Make each step of the training process differentially private [13]

• Add randomization- Random forests [14]

- Random decision trees [15]

“In the setting of multiparty computation, sets of two or more parties with private inputs wish to jointly compute some (predetermined) function of their inputs. The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16]

What does this mean?Exchange random-looking message such that messages can still be used to compute decision tree

Messages don’t mean anything, still get trained model

How? Building Blocks:• Commodity-Based Cryptography [17]

• Homomorphic Encryption [18] [19]

• Oblivious Transfer [20]

• Yao’s Garbled Circuits [21]

• Shamir’s Secret Sharing Scheme [22]

Primary Concern:Privacy of the datasets

Leakage Points:(1) Training Process, (2) Tree Structure

Idea: Evaluation as a ServiceService provider has predictive ensemble modelCharges per query made

Privacy Concerns:Server: Models

- as a source of revenue- encodes business knowledge- encodes underlying, potentially sensitive, training data

Client: DataClassification Result

Differential Privacy [11]

• Training based on Garbled Circuits [23]

• Training based on Shamir’s Secret Sharing [24]

• Evaluation using Homomorphic Encryption [25]

• Evaluation using Commodity-Based Cryptography [26]

• Evaluation using SGX [27]

Secure Multiparty Computation

Seminal Work: Agrawal and Srikant – Privacy-Preserving Data Mining [7]

• Introduced privacy-preserving data mining concept• Techniques:

- Discretize values Protect individual, unique values- 𝑥𝑖 + 𝑟 where 𝑟 ∈𝑅 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛, 𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

• Has since been broken [8]

• Opened doors into the area

Random Decision Trees• Introduced by Fan [9]

• Splits at each node according to a randomly chosen feature- Reduces problem to protecting

the leaf nodes

Random Forests• Introduced by Ho [10]

• Random subspace method to implement stochastic discrimination• Ensemble method with bagging

Comparison of Approaches

Trade-Offs:• Black-Box Access vs Accuracy Loss

- Attacker can combine a-priori information with the results from many protocol executions to reverse engineer private data

OR- Can introduce randomness and lose accuracy of the resulting model

• Efficiency Loss vs Data Access- Multiple data holders need to exchange messages privately cryptographic operations

- Efficiency lossOR- Must assume single data holder

[7] [28] [12]

[13]

[14] [25] [26] [24]

Open ResearchChallenges

• Risks of Reverse Engineering

• Computation Costs

• Incorporating Different Trust and Sensitivity Levels

• Combining Secure Multiparty Computation with Differential Privacy

• Dynamic and Flexible Collaborative Learning

AcknowledgementThis research has been partially sup- port by the National Science Foundation under Grants CNS-1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities. The first author was awarded a partial GRA support from IISP.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation.

labeled data

unlabeled data

……

final

predictions

learn the combination from labeled data

training testing

classifier 1

classifier 2

classifier m

Ensemble model

Privacy Preserving Ensemble Learning

• Differential Privacy, Secure Multiparty Computation, Quantification of Privacy

• Ensemble Learning: Supervised, Unsupervised, Semi-supervised

• Distributed vs. Centralized Privacy Preserving Ensemble Learning Architecture

• Decision Trees, Deep Neural Networks

Stacey Truex and Ling Liu