protecting privacy when disclosing information pierangela samarati latanya sweeney

Protecting Privacy when Disclosing Information

Pierangela Samarati Latanya Sweeney

INTRODUCTION

• Today’s society places demands on person-specific data.

• more and more historically public information is also electronically available

• combined, you can identify the personal information

• This paper addresses the problem of releasing person-specific data while preserving the person's anonymity

• k-anonymity: Specific information is ambiguously mapped to k-persons

EXAMPLE

RELATED WORK

• several protection techniques in statistical databases

• scrambling, adding noise, swapping values etc..

• suppression and generalization techniques but no formal foundation

• Different from traditional access control - protecting the data vs identity of the data

OUTLINE

• Formal foundation for anonymity problem and against linking

• quasi-identifiers: attribute that can be exploited for linking

• k-anonymity: degree of protection of data with respect to inference by linking

• preferred generalization: allows user to select among possible minimal generalizations - choose attributes

• Here, they protect the link between the identity and data but not the data itself

DEFINITIONS & ASSUMPTIONS

• Quasi-identifier: Let T(A1,..,An) be a table. A quasi-identifier is a set of attributes (A1,..,Aj) subset of (A1,..,An) whose release must be controlled.

• Goal: Allow release of information in the table which is related to atleast a given number k of individuals, k is set by data holder

• k-anonymity requirement: Each release of the data must be such that every combination of quasi-identifier can be indistinctly matched to atleast k individuals

• Issue: It is impossible to match the released data to externally available data!!

DEFINITIONS & ASSUMPTIONS

• Although the data holder knows the external attributes(contributes to quasi-identifiers), the specific values can not be assumed.

• Key: Translate the requirement in terms of the released data• Assumption: All attributes in table PT which are to be released and

which are externally available in combination to a data recipient are defined in a quasi-identifier

• Not a trivial assumption• Sweeney examines this risk and shows that this can not be perfectly

resolved.• k-anonymity for a table: Let T(A1,…,An) be the table and QT be

the set of quasi-identifiers of T. T is said to satisfy k-anonymity iff for each QI belongs to QT, each sequence of values in T[QI] appears at least with k occurences in T[QI].

GENERALIZING DATA

• first approach is based on the definition and use of generalization relationships between domains and between values that attributes can assume.

• Z0 is the zip code domain and Z1 is the domain where last digit is replaced by 0.

• to achieve k-anonymity, map the attributes in domain Z0 to Z1 where Z1 is more general

• This mapping between domains is stated by means of a generalization relationship which represents a partial order ≤ D on the set Dom of domains– each domain Di has at most one direct generalized domain

– all maximal elements of Dom are singleton(eventually all domains can be generalized to single value)

DOMAIN & VALUE GENERALIZATION HIERARCHIES

DOMAIN GENERALIZATION HIERARCHY

• Let Dom be the set of domains, given a tuple DT = (D1, …, Dn) such that Di belongs to Dom for i = 1,…,n, DGHDT = DGHD1x…xDGHDn, assuming the cartesian product is ordered by imposing coordinate wise order.

• Each path from DT to unique maximal element of DGHDT in the graph defines a possible alternative path

• The set of nodes in each such path together with the generalization relationship is called a generalization strategy for DGHDT

GENERALIZED TABLE

• Tj is a Generalized Table of Ti, written Ti ≤ Tj iff– Ti and Tj have same number of tuples

– Domain of each attribute of Tj (denoted by dom(Az,Tj) )is equal to or generalization of the domain of the attribute in Ti and

– Each tuple ti in Ti has a corresponding tuple tj in Tj (and vice versa) such that the value for each attribute in tj is equal to or generalization of the value of corresponding attribute in ti.

• Not all generalized tables are satisfactory• Don’t need extreme generalized table if more specific table exists

which satisfies k-anonymity• k-minimal generalization

k-minimal generalization

• Distance vector: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. The distance vector of Tj from Ti is the vector DVi,j = [d1,…,dn] where dz is the length of unique path between dom(Az,Ti) and dom(Az,Tj) in DGHD

• Given two distance vectors DV = [d1,…,dn] and DV’ = [d1’,…,dn’], DV ≤ DV’ iff di ≤ di’ for all I = 1,…,n; DV < DV’ iff DV ≤ DV’ and DV ≠ DV’.

• k-minimal generalization: Let Ti(A1,…,An) and Tj(A1,…,An) be two tables such that Ti ≤ Tj. Tj is said to be a k-minimal generalization of Ti iff – Tj satisfies k-anonymity

– There is no Tz : Ti ≤ Tz, Tz satisfies k-anonymity and DVi,z < DVi,j

EXAMPLE

•For k=2, GT[1,0] and GT[0,1] are k-minimal generalizations, but not GT[0,2] and GT[1,1] For k=3, GT[1,0] and GT[0,2] are k-minimal generalizations.

SUPPRESSING DATA

• Complementary approach to generalization• Used to moderate the generalization process when there are limited

number of tuples(with less than k occurences)• Generalized Table with suppression: Ti(A1,…,An) and Tj(A1,

…,An) be two tables defined on same attributes. Tj is said to be a generalization of Ti – if sizeof(Tj) ≤ sizeof(Ti) – For all z = 1,…,n : dom(Az,Ti) ≤ dom(Az,Ti) – There is an injective mapping between Ti and Tj that associates

tuples ti (in Ti) and tj(in Tj) such that ti[Az] ≤ tj[Az]

• Minimal Required suppression: Let Tj be a generalization of Ti satisfying k-anonymity, Tj is said to enforce minimal required suppression iff there is no Tz such that Ti ≤ Tz, DVi,z = DVi,j, and sizeof(Tj) < sizeof(Tz) and Tz satisfies k-anonymity.

EXAMPLE

•The tuples written in bold face and marked with double lines in each table are the tuples that must be suppressed to achieve k-anonymity of 2. Suppression of any superset would not satisfy minimal required suppression.

k-minimal generalization with suppression

• Generalization and suppression are used in conjunction to obtain k-anonymity

• Tradeoff between generalization and suppression• Acceptable suppression threshold MaxSup • Within the threshold, suppression is considered better.• Reason: Generalization affects all the tuples whereas Suppression

affects single tuple.• k-minimal generalization with suppression: Ti(A1,…,An) and

Tj(A1,…,An) be two tables such that Ti ≤ Tj and MaxSup be the specific threshold of acceptance suppression. Tj is k-minimal generalization of Ti iff – Tj satisfies k-anonymity– Sizeof(Ti) - Sizeof(Tj) ≤ MaxSup– There is no Tz: Ti ≤ Tz, Tz satisfies conditions 1 and 2 and DVi,z < DVi,j

EXAMPLE

protecting privacy when disclosing information pierangela samarati latanya sweeney

Documents

data slide

available data

data holder

data recipient

generalizing data

released data assumption

data vs identity

anonymity problem