an experimental comparison of globally-optimal data de-identification algorithms
TRANSCRIPT
Technische Universität München
Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn
Chair for Biomedical InformaticsInstitute for Medical Statistics and Epidemiology
Klinikum rechts der Isar der TU München
An Experimental Comparison of Globally-Optimal Data
De-Identification Algorithms
Technische Universität München
Optimal de-identification algorithms• Generalization hierarchies
• Pruning: predictive tagging• Optimization: roll-up• Privacy models, e.g.: k-anonymity, l-diversity, t-closeness, δ-presence
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 2
• Generalization lattice
K=2K=2
Age Gender Zipcode34 male 8166745 female 8166766 male 8192570 female 8192570 male 81925
Age Gender Zipcode20-60 * 8166720-60 * 81667≥ 61 * 81925≥ 61 * 81925≥ 61 * 81925
Technische Universität München
Algorithms – Incognito• LeFevre et al.
– SIGMOD 2005
• Dynamic programming
– Breadth-first search on lattices for powerset of quasi-identifiers
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 3
Technische Universität München
Algorithms – OLA & Flash• Emam et al.
– JAMIA 2009
• Divide & conquer
– Optimal Lattice Anonymization– Binary search on sublattices
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 4
• Kohlmayer & Prasser et al.
– PASSAT 2012
• Greedy search
– Binary depth-first search– Total order & priority queue
Technische Universität München
Algorithms – BFS, DFS & Questions
• Generic search methods
– Breadth-first search (BFS)
– Depth-first search (DFS)
→ Extended to use predictive tagging
• Research questions– How do the algorithms compare in terms of performance?
– Are there further differences between them?
– Are the algorithms' properties influenced by the privacy models used?
– How do problem-specific methods compare to generic search algorithms?
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 5
Technische Universität München
Benchmark – Method• Use all reasonable combinations of common privacy models with
typical parameters– (k)-anonymity, (l)-diversity, (t)-closeness, (δ)-presence
• Properties of the search space are influenced by combining privacy models:
– (k), (l), (t), (δ)– (k, l), (k, t), (k, δ), (l, δ), (t, δ)– (k, l, δ), (k, t, δ)
• Report three basic performance measures– Pruning power: number of anonymity checks– Optimizability: number of roll-ups– Execution times in a highly efficient runtime environment (ARX)
• Five well-known benchmark datasets
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 6
Technische Universität München
Results – Averaged over datasets
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 7
# R
oll-u
ps#
Che
cks
Exe
c. t
ime
[s]
Lower isbetter
Higher isbetter
Lower isbetter
● Allows analyzing variations in results for different sets of privacy models
Technische Universität München
Results – Averaged over datasets
● Repeating patterns
→ Consistent results for different configurations→ Differences between algorithms not influenced by privacy models used
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 8
# R
oll-u
ps#
Che
cks
Exe
c. t
ime
[s]
Lower isbetter
Higher isbetter
Lower isbetter
Technische Universität München
Results – Averaged over datasets
● Breadth-first search is a worst-case strategy
→ No pruning-power, no optimizability→ Incognito suffers from similar performance problems
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 9
# R
oll-u
ps#
Che
cks
Exe
c. t
ime
[s]
Lower isbetter
Higher isbetter
Lower isbetter
Technische Universität München
Results – Averaged over datasets
● Depth-first search is pretty efficient
→ Can outperform domain-specific methods (OLA)→ Because of its optimizability (best method in terms of #roll-ups)
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 10
# R
oll-u
ps#
Che
cks
Exe
c. t
ime
[s]
Lower isbetter
Higher isbetter
Lower isbetter
Technische Universität München
Results – Averaged over datasets
● Number of checks: OLA < Flash < DFS < Incognito < BFS● Number of roll-ups: DFS > Flash > Incognito > OLA > BFS● Execution times: Flash < OLA < DFS < Incognito < BFS
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 11
# R
oll-
ups
# C
hec
ksE
xec.
tim
e [s
]
Lower isbetter
Higher isbetter
Lower isbetter
Technische Universität München
Results – Averaged over privacy models
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 12
– OLA provides performance comparable to Flash for smaller datasets
– DFS provides performance comparable to Flash for larger datasets
# C
heck
s#
Rol
l-ups
Exe
c. t
ime
[s]
Lower isbetter
Higher isbetter
Lower isbetter
● Shows variations in results for different datasets
● Algorithms exhibit similar properties
● Flash provides the best overall performance
● Differences are mostly independent of datasets
● But
Technische Universität München
Lessons learned• In general, domain-specific algorithms outperform generic methods
→ Up to several orders of magnitude (BFS)
→ OLA and Flash only check between 0.2% and 1.1% of all transformations in the solution space
→ Not necessarily true for large datasets (DFS)
• Flash effectively balances optimizability with pruning power
→ Should be used if optimized runtime environments are available
• OLA provides best pruning power
→ Should be used in general-purpose environments
• DFS outperforms OLA for large datasets
→ In these cases, optimizability is more important than pruning power
→ Optimized runtime environments required
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 13
Technische Universität München
Thank you for your attention!
• ARX is free software– Download – Use – Contribute
– Repository: https://github.com/arx-deidentifier/arx
• Further information– Website: http://arx.deidentifier.org– Contact
● Fabian Prasser ([email protected])
● Florian Kohlmayer ([email protected])
F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 14