a tool for optimizing de-identified health data for use in statistical classification

Technische Universität MünchenIEEE CBMS 2017

Fabian Prasser, Johanna Eicher, Raffael Bild, Helmut Spengler and Klaus A. Kuhn

Chair for Medical InformaticsInstitute for Medical Statistics and Epidemiology

University Medical Center rechts der IsarTechnical University of Munich (TUM)

Munich, Germany

A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification


Motivation and backgroundModern big data initiatives in medical research● High case numbers, detailed characterizations● Secondary use, e.g. of routine clinical data for research● Collaborative research, e.g. data sharing

Initiatives to improve transparency, reproducibility and re-usability of research● NIH Statement on Sharing Research Data, Notice NOT-OD-03-032; 2003.● NIH Genomic Data Sharing Policy, Notice NOT-OD-14-124; 2014.● EMA Policy 0070 on Publication of Clinical Data for Medicinal Products for Human

Use; 2014.

Role of data protection● Compliance with legal requirements, e.g. anonymous data processing, risk

management, data minimization● Fostering and deepening trust, maintaining societal acceptance

Fabian Prasser et al.: A Tool for Optimizing De-Identified Health Data for Use in Statistical Classification 2 / 1422.06.2017


Data anonymization (a.k.a de-identification)One building block of data protection. Example:


GeneralizationDeletionAggregationSampling CategorizationMasking

Reduction of uniqueness


Challenges

Privacy risk

Dat

a qu

ality

Potential solutions


Trade-off: privacy risks vs. usefulness of data processing

Usefulness: flexibility, scalability of data processing or quality of data

Models and methods are needed for analyzing and quantifying both aspects

Here: focus on measuring and optimizing usefulness for statistical classification

Original dataHighest risk

No dataNo risk

Transformation of data

Example


Statistical classificationTask● To predict the value of a class attribute from a set of values of feature attributes as

accurately as possible● Typically implemented with supervised learning where a model is created from a

training set

Generic use case with quantifiable quality of outcome ● Alternative to general-purpose data quality models, e.g. entropy-based models,

distance measures● Performance of classifiers built from anonymous data is an indicator for the

usefulness of data anonymization methods● Early work suggested that anonymization destroys data mining utility

Data mining and predictive modeling are important in medicine● Knowledge discovery: detection of unknown relationships between biomedical

parameters● Decision support: inference / prediction of parameters, e.g. diagnoses or outcomes



Evaluation of anonymous classifiersInterwoven k-fold cross-validation● Model is trained on anonymous output data but evaluated against input data● Results for the different folds are combined to obtain the final result

Relative performance is reported● Lower baseline: performance of trivial ZeroR method trained on input data● Upper baseline: performance of non-trivial model trained on input data


Input dataOutput data Input data TemporaryTemporary Output data

Fold 2Fold 1


Optimizing output dataAnonymization is an optimization process (cf. trade-offs)● Objective functions must be efficient● Repeatedly training and evaluating a classifier is too expensive

Goal: Remove noise but preserve structure

Extension of the model by Iyengar for assessing output data

1) Group records by features

2) Penalize records with values of the class attribute which rarely occur together with according set of features

3) Penalize records which have been removed


[Iyengar: „Transforming data to satisfy privacy constraints“, KDD, 2002]


ARX Data Anonymization Tool


Oriented towards guidelines for biomedical data anonymization● Comprehensive feature set: 15 different risk and privacy models, several data

transformation models, many visualizations

Graphical tool (Windows, MacOS, Linux) and programming library● Highly scalable: millions of records on commodity hardware

Internationally recognized● About 25.000 downloads since 2012● Covered in various official reports and guidelines● Training workshops in Germany, France, UK● Used for research and development by large technology

companies, also in commercial products

Open Source Software● License: Apache 2


Results: Optimizing output data


Competing general-purpose data quality models● Data granularity: cell-based model, degree of generalization ● Non-Uniform Entropy: column-based model, differences in data distributions● KL-Divergence: row-based model, differences in distribution of class sizes

Patient discharge dataset (40.000 records, 10 attributes)● 5-Anonymization using attribute generalization and record suppression● Logistic regression model to predict cost of hospital stays● Classes: between $10,000 and $50,000 or <$10,000 or >$50,000

Granularity NU-Entropy KL-Divergence Own

Accuracy Accuracy Accuracy Accuracy

Sco

re


Results: Evaluating anonymous classifiers





Relative and absolute prediction accuracy,

Brier score

Precision, recall andf-score for various

cut-off points




AUC for differentclass values

ROC curves(input & output)


Thank you for your attention!

Tel +49 89 4140-4328Fax +49 89 4140-4850fabian.prasser@tum.dewww.imse.med.tum.dearx.deidentifier.org

Dr. rer. nat. Fabian Prasser

University Medical Center rechts der IsarInstitute of Medical Statistics and EpidemiologyTechnical University of Munich

Ismaninger Straße 22, 81675 Munich, Germany


mailto:[email protected]

a tool for optimizing de-identified health data for use in statistical classification

Technology