an experimental comparison of globally-optimal data de-identification algorithms

Technische Universität München

Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn

Chair for Biomedical InformaticsInstitute for Medical Statistics and Epidemiology

Klinikum rechts der Isar der TU München

An Experimental Comparison of Globally-Optimal Data

De-Identification Algorithms


Optimal de-identification algorithms• Generalization hierarchies

• Pruning: predictive tagging• Optimization: roll-up• Privacy models, e.g.: k-anonymity, l-diversity, t-closeness, δ-presence

F. Prasser, F. Kohlmayer et al.: A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data, CBMS 201412/19/16 2

• Generalization lattice

K=2K=2

Age Gender Zipcode34 male 8166745 female 8166766 male 8192570 female 8192570 male 81925

Age Gender Zipcode20-60 * 8166720-60 * 81667≥ 61 * 81925≥ 61 * 81925≥ 61 * 81925


Algorithms – Incognito• LeFevre et al.

– SIGMOD 2005

• Dynamic programming

– Breadth-first search on lattices for powerset of quasi-identifiers



Algorithms – OLA & Flash• Emam et al.

– JAMIA 2009

• Divide & conquer

– Optimal Lattice Anonymization– Binary search on sublattices


• Kohlmayer & Prasser et al.

– PASSAT 2012

• Greedy search

– Binary depth-first search– Total order & priority queue


Algorithms – BFS, DFS & Questions

• Generic search methods

– Breadth-first search (BFS)

– Depth-first search (DFS)

→ Extended to use predictive tagging

• Research questions– How do the algorithms compare in terms of performance?

– Are there further differences between them?

– Are the algorithms' properties influenced by the privacy models used?

– How do problem-specific methods compare to generic search algorithms?



Benchmark – Method• Use all reasonable combinations of common privacy models with

typical parameters– (k)-anonymity, (l)-diversity, (t)-closeness, (δ)-presence

• Properties of the search space are influenced by combining privacy models:

– (k), (l), (t), (δ)– (k, l), (k, t), (k, δ), (l, δ), (t, δ)– (k, l, δ), (k, t, δ)

• Report three basic performance measures– Pruning power: number of anonymity checks– Optimizability: number of roll-ups– Execution times in a highly efficient runtime environment (ARX)

• Five well-known benchmark datasets



Results – Averaged over datasets


# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

● Allows analyzing variations in results for different sets of privacy models



● Repeating patterns

→ Consistent results for different configurations→ Differences between algorithms not influenced by privacy models used


# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter



● Breadth-first search is a worst-case strategy

→ No pruning-power, no optimizability→ Incognito suffers from similar performance problems


# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter



● Depth-first search is pretty efficient

→ Can outperform domain-specific methods (OLA)→ Because of its optimizability (best method in terms of #roll-ups)


# R

oll-u

ps#

Che

cks

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter



● Number of checks: OLA < Flash < DFS < Incognito < BFS● Number of roll-ups: DFS > Flash > Incognito > OLA > BFS● Execution times: Flash < OLA < DFS < Incognito < BFS


# R

oll-

ups

# C

hec

ksE

xec.

tim

e [s

]

Lower isbetter

Higher isbetter

Lower isbetter


Results – Averaged over privacy models


– OLA provides performance comparable to Flash for smaller datasets

– DFS provides performance comparable to Flash for larger datasets

# C

heck

s#

Rol

l-ups

Exe

c. t

ime

[s]

Lower isbetter

Higher isbetter

Lower isbetter

● Shows variations in results for different datasets

● Algorithms exhibit similar properties

● Flash provides the best overall performance

● Differences are mostly independent of datasets

● But


Lessons learned• In general, domain-specific algorithms outperform generic methods

→ Up to several orders of magnitude (BFS)

→ OLA and Flash only check between 0.2% and 1.1% of all transformations in the solution space

→ Not necessarily true for large datasets (DFS)

• Flash effectively balances optimizability with pruning power

→ Should be used if optimized runtime environments are available

• OLA provides best pruning power

→ Should be used in general-purpose environments

• DFS outperforms OLA for large datasets

→ In these cases, optimizability is more important than pruning power

→ Optimized runtime environments required



Thank you for your attention!

• ARX is free software– Download – Use – Contribute

– Repository: https://github.com/arx-deidentifier/arx

• Further information– Website: http://arx.deidentifier.org– Contact

● Fabian Prasser ([email protected])

● Florian Kohlmayer ([email protected])


mailto:[email protected]

an experimental comparison of globally-optimal data de-identification algorithms

Science