a tool for optimizing de-identified health data for use in statistical classification

14
Technische Universität München IEEE CBMS 2017 Fabian Prasser , Johanna Eicher, Raffael Bild, Helmut Spengler and Klaus A. Kuhn Chair for Medical Informatics Institute for Medical Statistics and Epidemiology University Medical Center rechts der Isar Technical University of Munich (TUM) Munich, Germany A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Upload: arx-deidentifier

Post on 22-Jan-2018

42 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Fabian Prasser, Johanna Eicher, Raffael Bild, Helmut Spengler and Klaus A. Kuhn

Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiology

University Medical Center rechts der IsarTechnical University of Munich (TUM)

Munich, Germany

A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Page 2: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Motivation and backgroundModern big data initiatives in medical research● High case numbers, detailed characterizations● Secondary use, e.g. of routine clinical data for research● Collaborative research, e.g. data sharing

Initiatives to improve transparency, reproducibility and re-usability of research● NIH Statement on Sharing Research Data, Notice NOT-OD-03-032; 2003.● NIH Genomic Data Sharing Policy, Notice NOT-OD-14-124; 2014.● EMA Policy 0070 on Publication of Clinical Data for Medicinal Products for Human

Use; 2014.

Role of data protection● Compliance with legal requirements, e.g. anonymous data processing, risk

management, data minimization● Fostering and deepening trust, maintaining societal acceptance

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 2 / 1422.06.2017

Page 3: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Data anonymization (a.k.a de-identification)One building block of data protection. Example:

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 3 / 1422.06.2017

GeneralizationDeletionAggregationSampling CategorizationMasking

Reduction of uniqueness

Page 4: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Challenges

Privacy risk

Dat

a qu

ality

Potential solutions

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 4 / 1422.06.2017

Trade-off: privacy risks vs. usefulness of data processing

Usefulness: flexibility, scalability of data processing or quality of data

Models and methods are needed for analyzing and quantifying both aspects

Here: focus on measuring and optimizing usefulness for statistical classification

Original dataHighest risk

No dataNo risk

Transformation of data

Example

Page 5: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Statistical classificationTask● To predict the value of a class attribute from a set of values of feature attributes as

accurately as possible● Typically implemented with supervised learning where a model is created from a

training set

Generic use case with quantifiable quality of outcome ● Alternative to general-purpose data quality models, e.g. entropy-based models,

distance measures● Performance of classifiers built from anonymous data is an indicator for the

usefulness of data anonymization methods● Early work suggested that anonymization destroys data mining utility

Data mining and predictive modeling are important in medicine● Knowledge discovery: detection of unknown relationships between biomedical

parameters● Decision support: inference / prediction of parameters, e.g. diagnoses or outcomes

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 5 / 1422.06.2017

Page 6: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Evaluation of anonymous classifiersInterwoven k-fold cross-validation● Model is trained on anonymous output data but evaluated against input data● Results for the different folds are combined to obtain the final result

Relative performance is reported● Lower baseline: performance of trivial ZeroR method trained on input data● Upper baseline: performance of non-trivial model trained on input data

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 6 / 1422.06.2017

Input dataOutput data Input data TemporaryTemporary Output data

Fold 2Fold 1

Page 7: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Optimizing output dataAnonymization is an optimization process (cf. trade-offs)● Objective functions must be efficient● Repeatedly training and evaluating a classifier is too expensive

Goal: Remove noise but preserve structure

Extension of the model by Iyengar for assessing output data

1) Group records by features

2) Penalize records with values of the class attribute which rarely occur together with according set of features

3) Penalize records which have been removed

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 7 / 1422.06.2017

[Iyengar: „Transforming data to satisfy privacy constraints“, KDD, 2002]

Page 8: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

ARX Data Anonymization Tool

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 8 / 1422.06.2017

Oriented towards guidelines for biomedical data anonymization● Comprehensive feature set: 15 different risk and privacy models, several data

transformation models, many visualizations

Graphical tool (Windows, MacOS, Linux) and programming library● Highly scalable: millions of records on commodity hardware

Internationally recognized● About 25.000 downloads since 2012● Covered in various official reports and guidelines● Training workshops in Germany, France, UK● Used for research and development by large technology

companies, also in commercial products

Open Source Software● License: Apache 2

Page 9: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Results: Optimizing output data

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 9 / 1422.06.2017

Competing general-purpose data quality models● Data granularity: cell-based model, degree of generalization ● Non-Uniform Entropy: column-based model, differences in data distributions● KL-Divergence: row-based model, differences in distribution of class sizes

Patient discharge dataset (40.000 records, 10 attributes)● 5-Anonymization using attribute generalization and record suppression● Logistic regression model to predict cost of hospital stays● Classes: between $10,000 and $50,000 or <$10,000 or >$50,000

Granularity NU-Entropy KL-Divergence Own

Accuracy Accuracy Accuracy Accuracy

Sco

re

Page 10: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Results: Evaluating anonymous classifiers

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 10 / 1422.06.2017

Page 11: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Results: Evaluating anonymous classifiers

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 11 / 1422.06.2017

Relative and absolute prediction accuracy,

Brier score

Precision, recall andf-score for various

cut-off points

Page 12: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Results: Evaluating anonymous classifiers

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 12 / 1422.06.2017

Page 13: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Results: Evaluating anonymous classifiers

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 13 / 1422.06.2017

AUC for differentclass values

ROC curves(input & output)

Page 14: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification

Technische Universität MünchenIEEE CBMS 2017

Thank you for your attention!

Tel +49 89 4140-4328Fax +49 89 4140-4850fabian.prasser@tum.dewww.imse.med.tum.dearx.deidentifier.org

Dr. rer. nat. Fabian Prasser

University Medical Center rechts der IsarInstitute of Medical Statistics and EpidemiologyTechnical University of Munich

Ismaninger Straße 22, 81675 Munich, Germany

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 14 / 1422.06.2017