a tool for optimizing de-identified health data for use in statistical classification
TRANSCRIPT
Technische Universität MünchenIEEE CBMS 2017
Fabian Prasser, Johanna Eicher, Raffael Bild, Helmut Spengler and Klaus A. Kuhn
Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiology
University Medical Center rechts der IsarTechnical University of Munich (TUM)
Munich, Germany
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification
Technische Universität MünchenIEEE CBMS 2017
Motivation and backgroundModern big data initiatives in medical research● High case numbers, detailed characterizations● Secondary use, e.g. of routine clinical data for research● Collaborative research, e.g. data sharing
Initiatives to improve transparency, reproducibility and re-usability of research● NIH Statement on Sharing Research Data, Notice NOT-OD-03-032; 2003.● NIH Genomic Data Sharing Policy, Notice NOT-OD-14-124; 2014.● EMA Policy 0070 on Publication of Clinical Data for Medicinal Products for Human
Use; 2014.
Role of data protection● Compliance with legal requirements, e.g. anonymous data processing, risk
management, data minimization● Fostering and deepening trust, maintaining societal acceptance
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 2 / 1422.06.2017
Technische Universität MünchenIEEE CBMS 2017
Data anonymization (a.k.a de-identification)One building block of data protection. Example:
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 3 / 1422.06.2017
GeneralizationDeletionAggregationSampling CategorizationMasking
Reduction of uniqueness
Technische Universität MünchenIEEE CBMS 2017
Challenges
Privacy risk
Dat
a qu
ality
Potential solutions
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 4 / 1422.06.2017
Trade-off: privacy risks vs. usefulness of data processing
Usefulness: flexibility, scalability of data processing or quality of data
Models and methods are needed for analyzing and quantifying both aspects
Here: focus on measuring and optimizing usefulness for statistical classification
Original dataHighest risk
No dataNo risk
Transformation of data
Example
Technische Universität MünchenIEEE CBMS 2017
Statistical classificationTask● To predict the value of a class attribute from a set of values of feature attributes as
accurately as possible● Typically implemented with supervised learning where a model is created from a
training set
Generic use case with quantifiable quality of outcome ● Alternative to general-purpose data quality models, e.g. entropy-based models,
distance measures● Performance of classifiers built from anonymous data is an indicator for the
usefulness of data anonymization methods● Early work suggested that anonymization destroys data mining utility
Data mining and predictive modeling are important in medicine● Knowledge discovery: detection of unknown relationships between biomedical
parameters● Decision support: inference / prediction of parameters, e.g. diagnoses or outcomes
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 5 / 1422.06.2017
Technische Universität MünchenIEEE CBMS 2017
Evaluation of anonymous classifiersInterwoven k-fold cross-validation● Model is trained on anonymous output data but evaluated against input data● Results for the different folds are combined to obtain the final result
Relative performance is reported● Lower baseline: performance of trivial ZeroR method trained on input data● Upper baseline: performance of non-trivial model trained on input data
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 6 / 1422.06.2017
Input dataOutput data Input data TemporaryTemporary Output data
Fold 2Fold 1
Technische Universität MünchenIEEE CBMS 2017
Optimizing output dataAnonymization is an optimization process (cf. trade-offs)● Objective functions must be efficient● Repeatedly training and evaluating a classifier is too expensive
Goal: Remove noise but preserve structure
Extension of the model by Iyengar for assessing output data
1) Group records by features
2) Penalize records with values of the class attribute which rarely occur together with according set of features
3) Penalize records which have been removed
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 7 / 1422.06.2017
[Iyengar: „Transforming data to satisfy privacy constraints“, KDD, 2002]
Technische Universität MünchenIEEE CBMS 2017
ARX Data Anonymization Tool
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 8 / 1422.06.2017
Oriented towards guidelines for biomedical data anonymization● Comprehensive feature set: 15 different risk and privacy models, several data
transformation models, many visualizations
Graphical tool (Windows, MacOS, Linux) and programming library● Highly scalable: millions of records on commodity hardware
Internationally recognized● About 25.000 downloads since 2012● Covered in various official reports and guidelines● Training workshops in Germany, France, UK● Used for research and development by large technology
companies, also in commercial products
Open Source Software● License: Apache 2
Technische Universität MünchenIEEE CBMS 2017
Results: Optimizing output data
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 9 / 1422.06.2017
Competing general-purpose data quality models● Data granularity: cell-based model, degree of generalization ● Non-Uniform Entropy: column-based model, differences in data distributions● KL-Divergence: row-based model, differences in distribution of class sizes
Patient discharge dataset (40.000 records, 10 attributes)● 5-Anonymization using attribute generalization and record suppression● Logistic regression model to predict cost of hospital stays● Classes: between $10,000 and $50,000 or <$10,000 or >$50,000
Granularity NU-Entropy KL-Divergence Own
Accuracy Accuracy Accuracy Accuracy
Sco
re
Technische Universität MünchenIEEE CBMS 2017
Results: Evaluating anonymous classifiers
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 10 / 1422.06.2017
Technische Universität MünchenIEEE CBMS 2017
Results: Evaluating anonymous classifiers
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 11 / 1422.06.2017
Relative and absolute prediction accuracy,
Brier score
Precision, recall andf-score for various
cut-off points
Technische Universität MünchenIEEE CBMS 2017
Results: Evaluating anonymous classifiers
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 12 / 1422.06.2017
Technische Universität MünchenIEEE CBMS 2017
Results: Evaluating anonymous classifiers
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 13 / 1422.06.2017
AUC for differentclass values
ROC curves(input & output)
Technische Universität MünchenIEEE CBMS 2017
Thank you for your attention!
Tel +49 89 4140-4328Fax +49 89 4140-4850fabian.prasser@tum.dewww.imse.med.tum.dearx.deidentifier.org
Dr. rer. nat. Fabian Prasser
University Medical Center rechts der IsarInstitute of Medical Statistics and EpidemiologyTechnical University of Munich
Ismaninger Straße 22, 81675 Munich, Germany
Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 14 / 1422.06.2017