privacy and k-anonymity guy sagy november 2008 seminar in databases (236826)

38
Privacy and k- Privacy and k- Anonymity Anonymity Guy Sagy Guy Sagy November 2008 November 2008 Seminar in Databases (236826) Seminar in Databases (236826)

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Privacy and k-AnonymityPrivacy and k-Anonymity

Guy SagyGuy SagyNovember 2008November 2008

Seminar in Databases (236826) Seminar in Databases (236826)

Page 2: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

22

OutlineOutline

Introduction Introduction k-Anonymityk-Anonymity Generalization & SuppressionGeneralization & Suppression MinGen – Theoretical AlgorithmMinGen – Theoretical Algorithm Mondrian – A greedy partition algorithmMondrian – A greedy partition algorithm

Page 3: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

33

What is Privacy ?What is Privacy ? Society is experiencing exponential growth in the Society is experiencing exponential growth in the

number and variety of number and variety of data collectionsdata collections containing containing person-specific information.person-specific information.

SharingSharing these collected information is valuable both in these collected information is valuable both in research and business. Publishing the data may put research and business. Publishing the data may put person person privacyprivacy in risk. in risk.

Objective: Maximize data utility while limiting disclosure Objective: Maximize data utility while limiting disclosure risk to an acceptable levelrisk to an acceptable level

Note :Note : There is no clear definition for disclosure and acceptable levelThere is no clear definition for disclosure and acceptable level Not the traditional security of data e.g. access control, theft, Not the traditional security of data e.g. access control, theft,

hacking etc.hacking etc.

Page 4: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

44

ExampleExample

For medical research (e.g., Gene, infection For medical research (e.g., Gene, infection diseases) a hospital has some person-specific diseases) a hospital has some person-specific patient data which it wants to publishpatient data which it wants to publish

It wants to publish such that:It wants to publish such that: Information remains practically usefulInformation remains practically useful Identity of an individual cannot be determinedIdentity of an individual cannot be determined

Adversary might Adversary might inferinfer the secret/sensitive data the secret/sensitive data from the published databasefrom the published database

Page 5: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

55

Example – cont.Example – cont.

The data contains:The data contains: Identifiers - {name, ssn}Identifiers - {name, ssn} Non-Sensitive data - {zip-code, nationality, age}Non-Sensitive data - {zip-code, nationality, age} Sensitive data - { medical condition, salary, location }Sensitive data - { medical condition, salary, location }

IdentifiersNon-Sensitive dataNon-Sensitive dataSensitive data

##NameZipZipAgeAgeNationalityNationalityCondition

11Kumar13053130532828IndianIndianHeart Disease

22Bob13067130672929AmericanAmericanHeart Disease

33Ivan13053130533535CanadianCanadianViral Infection

44Umeko13067130673636JapaneseJapaneseCancer

Page 6: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

66

Example – cont Example – cont [SW02-A][SW02-A]

Non-Sensitive DataNon-Sensitive DataSensitive DataSensitive Data

##ZipZipAgeAgeNationalityNationalityConditionCondition

1113053130532828IndianIndianHeart Disease

2213067130672929AmericanAmericanHeart Disease

3313053130533535CanadianCanadianViral Infection

4413067130673636JapaneseJapaneseCancer

PublishedData

ChrisChrisBobBobJohnJohnNameName

AmericanAmerican2323130531305333AmericanAmerican2929130671306722AmericanAmerican2828130531305311NationalityNationalityAgeAgeZipZip##

Voter List

Data leak! Do we have a privacy violation ?Do we have a privacy violation ?

Page 7: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

77

The Group Insurance Commission (GIC) in Massachusetts sold a believed to be anonymous data of state employees health.

Voter registration list for Cambridge Massachusetts – sold for 20$

William Weld was governor of Massachusetts- Lived in Cambridge Massachusetts Six people had his particular birth date Three of them were men He was the only with 5-digit ZIP code.

Example – contExample – cont[SW02-A][SW02-A]

ZipBirthdateGender

EthnicityVisit dateDiagnosisProcedureMedicationTotal charge

NameAddressDate registeredParty affiliationDate last voted

Medical data Voter List

Quasi Identifier)QI)

Page 8: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

88

Example-2 – AOL (2006)Example-2 – AOL (2006)Anon

IDQueryQueryTimeItemRankClickURL

1326konig wheels18/04/2006 13:291http://www.konigwheels.com

1326jet blue airlines27/04/2006 15:29  

1326coats tire equipment28/04/2006 15:53  

1326coats tire equipment03/05/2006 19:15  

1326verizon wireless09/05/2006 00:09  

1326www.crazyradiodeals.com23/05/2006 18:00  

1337uslandrecords.com01/03/2006 11:501http://www.seda-cog.org

1337titlesourcein.com14/03/2006 15:45  

1337titlesourceinc14/03/2006 15:451http://www.titlesourceinc.com

1337select business services14/03/2006 15:51  

1337select business services title14/03/2006 15:52  

1337cbc companies14/03/2006 15:522http://www.cbc-companies.com

1337cbc companies14/03/2006 15:523http://www.cbc-companies.com

1337national real estate settlement services14/03/2006 15:591http://www.realtms.com

Page 9: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Example2 – cont.

Page 10: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Example-3

Page 11: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1111

k-Anonymity k-Anonymity [SW02-A][SW02-A]

Change data in such a way that for each Change data in such a way that for each tuple in the resulting table there are at least tuple in the resulting table there are at least ((k-1) k-1) other tuples with the same value for the other tuples with the same value for the quasi-identifier – quasi-identifier – k-Anonymized tablek-Anonymized table

#ZipAgeNationalityCondition

1130**< 40*Heart Disease

2130**< 40*Heart Disease

3130**< 40*Viral Infection

4130**< 40*Cancer

This is a 4-anonymizedTable. Why ?

Page 12: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1212

K-Anonymity K-Anonymity –– Formal Definition Formal Definition

RT - Released TableRT - Released Table (A1,A2,(A1,A2,……,An) - Attributes,An) - Attributes QIQIRTRT - Quasi Identifier - Quasi Identifier

RT[QIRT[QIRTRT] – Projection of RT on QI] – Projection of RT on QIRTRT

Page 13: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1313

K-Anonymity Example K-Anonymity Example [SW02-B][SW02-B]

 CountryBirthGenderZIPProblem

t1USA1965m02141short breath

t2USA1965m02141chest pain

t3USA1964f02138obesity

t4USA1964f02138chest pain

t5Non-USA1964m02138chest pain

t6Non-USA1964m02138obesity

t7Non-USA1964m02138short breath

Example of k-anonymity, where k=2 and QI={Country, Birth, Gender, ZIP}

Page 14: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1414

K-Anonymity K-Anonymity –– The challenge The challenge

Theorem 1 in [SW02-B] claims :Let RT(A1,...,An) be a table, QIRT =(Ai,…, Aj) be the quasi-identifier associated with RT, Ai,…,AjA1,…,An, and RT satisfy k-anonymity. Then, each sequence of values in RT[Ax] appears with at least k occurrences in RT[QIRT] for x=i,…,j.

Can we use this property for easily building of a k-Anonymity table ? (Can we claim the opposite ?)(each sequence of values in RT[Ax] appears with at least k occurrences then the table is k-anonymity?)

Page 15: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1515

K-Anonymity K-Anonymity –– The challenge – cont. The challenge – cont.

#ZipAgeNationalityCondition

1120*Heart Disease

2130*Heart Disease

3220*Viral Infection

4230*Cancer

No!!!

Page 16: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1616

GeneralizationGeneralization Replace the original value by a semantically Replace the original value by a semantically

consistent but consistent but lessless specific value specific value SuppressionSuppression

Data not released at allData not released at all Can be viewed as first level of generalizationCan be viewed as first level of generalization

How to create k-Anonymity ?How to create k-Anonymity ?

##ZipZipAgeAgeNationalityNationalityConditionCondition

1130**< 40*Heart Disease

2130**< 40*Heart DiseaseGeneralization Suppression

Page 17: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1717

Generalization & HierarchiesGeneralization & Hierarchies

ZIP

1305813053

1305

130

1306713063

1306

Age

2928

< 30

< 40

*

3536

3*

Nationality

USCanadian

American

JapaneseIndian

Asian

*

Z0={13053,13058,13063,13067}

Z1={1305*,1306*}

Z2={130**}

Z3={*****}

Z0

Z1

Z2

Z3

Z0

Z1

Z2

Page 18: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1818

Generalization & HierarchiesGeneralization & Hierarchies The number of generalized tables is :The number of generalized tables is :

(DGH (DGHi i = Maximum generalization level of A= Maximum generalization level of A ii))

(note, not all generalization creates a k-anonymity table)(note, not all generalization creates a k-anonymity table)

n

iiDGH

1

)1|(|

Page 19: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

1919

#ZipAgeNationalityCondition

113053< 40*Heart Disease

213053< 40*Viral Infection

313067< 40*Heart Disease

413067< 40*Cancer

#ZipAgeNationalityCondition

1130**< 30AmericanHeart Disease

2130**< 30AmericanViral Infection

3130**3*AsianHeart Disease

4130**3*AsianCancer

#ZipAgeNationalityCondition

1130**< 40*Heart Disease

2130**< 40*Viral Infection

3130**< 40*Heart Disease

4130**< 40*Cancer

Page 20: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2020

K-minimal GeneralizationsK-minimal Generalizations

Intuition: The one that does not generalize the data Intuition: The one that does not generalize the data more than needed (decrease in utility of the more than needed (decrease in utility of the published dataset!)published dataset!)

K-minimal generalizationK-minimal generalization: : TTmm is said to be a minimal generalization of RT if is said to be a minimal generalization of RT if TTm m satisfies the k-anonymity requirement with respect to satisfies the k-anonymity requirement with respect to

QIQIRTRT

TTzz: RT: RTTTz z ,T,Tzz T Tmm, T, Tzz satisfies the k-anonymity satisfies the k-anonymity

requirement with respect to QIrequirement with respect to QIRT RT T Tzz=T=Tmm

Page 21: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2121

#ZipAgeNationalityCondition

113053< 40*Heart Disease

213053< 40*Viral Infection

313067< 40*Heart Disease

413067< 40*Cancer

#ZipAgeNationalityCondition

1130**< 30AmericanHeart Disease

2130**< 30AmericanViral Infection

3130**3*AsianHeart Disease

4130**3*AsianCancer

2-minimal Generalizations

#ZipAgeNationalityCondition

1130**< 40*Heart Disease

2130**< 40*Viral Infection

3130**< 40*Heart Disease

4130**< 40*Cancer

NOT a2-minimal Generalization

There are many k-minimal anonymized tables –

There are many k-minimal anonymized tables –

which which one

one to pick?to pick?

Page 22: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2222

K-minimal GeneralizationsK-minimal Generalizations There are many k-minimal generalizations – which one is There are many k-minimal generalizations – which one is

preferredpreferred then then?? No clear and “correct” answer :No clear and “correct” answer :

The one that creates The one that creates min. min. distortion to datadistortion to data, where distortion, where distortion

Normalized averageNormalized average equivalence class size metric equivalence class size metric

The one with min. The one with min. suppressionsuppression Best support the research (less damaging the “interesting” Best support the research (less damaging the “interesting”

attributes)attributes)

attributesofnumber

DGHAtiongeneralizaoflevel

D iA i

i

)/()__

_( k

classesequivtotal

recordstotalCAVG

Page 23: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2323

Algorithm for finding minimal Algorithm for finding minimal generalization generalization [SW02-B][SW02-B]

Theoretical Model (MinGen)Theoretical Model (MinGen) Store the set of all possible generalizations of Store the set of all possible generalizations of

RT over QI into RT over QI into allgensallgens Store from Store from allgensallgens all the tables which all the tables which

satisfied k-anonymity into satisfied k-anonymity into protectedprotected Define comparing measure Define comparing measure scorescore From From protectedprotected choose the table with best choose the table with best

scorescore

Page 24: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2424

Algorithm for finding minimal Algorithm for finding minimal generalizationgeneralization

The search space is exponentialThe search space is exponential The problem is NP-Hard!The problem is NP-Hard! We present one proposed algorithm[LDR06]-We present one proposed algorithm[LDR06]-

LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 -LeFevre, D.J. DeWitt, R. Ramakrishnan,2006 - Multi-dimensional algorithm (Mondrian)Multi-dimensional algorithm (Mondrian)

Page 25: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2525

Single Dimensional PartitioningSingle Dimensional Partitioning

A single dimensional A single dimensional partitioning defines for partitioning defines for each attribute Aeach attribute Ai i , a , a

set of non overlapping set of non overlapping single-dimensional single-dimensional intervals that cover intervals that cover DDXi.Xi.

Age

20

22

22

24

26

30

30

31

38

40

42

44

Age

20-24

20-24

20-24

20-24

26-31

26-31

26-31

26-31

38-44

38-44

38-44

38-44

Data Partitioning

Page 26: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2626

Single Dimensional PartitioningSingle Dimensional Partitioning

20

24

26

31

38

44

2120 2130 2140

Age

Zip Code2129 2139 2149

12 Areas of Partitioning

Page 27: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2727

Multidimensional PartitioningMultidimensional Partitioning

Assume all attributes are from discrete Assume all attributes are from discrete numeric domain (every set can be mapped numeric domain (every set can be mapped to a one)to a one)

The domain of AThe domain of Ai i is denoted by Dis denoted by DXiXi

Each tuple can be presented as Each tuple can be presented as (v(v11,v,v22,…,v,…,vdd))DDX1X1 D DX2X2… D… DXnXn

A multidimensional partitioning defines a A multidimensional partitioning defines a set of multidimensional regions.set of multidimensional regions.

Page 28: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2828

Multidimensional Partitioning – Multidimensional Partitioning – cont.cont.

Attributes = {ZipCode,Age)

Page 29: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

2929

Multidimensional Partitioning – Why Multidimensional Partitioning – Why is it good ? is it good ?

NameAgeSexZipcode

Ahmed25Male53710

Bob28Male53711

Claire31Female90210

Dave19Male2174

Evelyn40Female2237

Voter Registration Data Patient Data

AgeSexZipcodeDisease

25Male53710Flu

25Female53712Hepatitis

26Male53711Brochitis

27Male53710Broken Arm

27Female53712AIDS

28Male53711Brochitis

Page 30: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3030

Multidimensional Partitioning –cont.Multidimensional Partitioning –cont.

Single Dimensional Multi Dimensional

Bronchitis53710-11Male25-28

Broken Arm53710-11Male25-28

Bronchitis53710-11Male25-28

Flu53710-11Male25-28

DiseaseZipcodeSexAge

Bronchitis53710-11Male27-28

Broken Arm53710-11Male27-28

Bronchitis53710-11Male25-26

Flu53710-11Male25-26

DiseaseZipcodeSexAge

AgeSexZipcodeDisease

25Male53710Flu

26Male53711Bronchitis

27Male53710Broken Arm

28Male53711Bronchitis

25Female53712Hepatitis

27Female53712AIDS

AIDS53712Female25-28

Hepatitis53712Female25-28

AIDS53712Female25-27

Hepatitis53712Female25-27

Page 31: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3131

Finding k-Anonymous Finding k-Anonymous Multidimensional PartitioningMultidimensional Partitioning

Given a set P of unique (point,count), with Given a set P of unique (point,count), with points in d-dimensional space, is there a points in d-dimensional space, is there a multidimensional partitioningmultidimensional partitioning for P such for P such that:that: For every region RFor every region R ii, , ppRiRicount(p)count(p)k or k or

ppRiRicount(p) =0 count(p) =0 (k-anonymity)(k-anonymity)

CCAVG AVG c (positive constant)?c (positive constant)? (average number of records in each partition)(average number of records in each partition)

This problem is NP-CompleteThis problem is NP-Complete Proof : reduction from partitionProof : reduction from partition

Page 32: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

We

igh

t

35 4540 5550 6560 7050

55

60

65

70

75

80

85

Age

Mondrian - A Greedy Partitioning A Greedy Partitioning Algorithm Algorithm [LDR06][LDR06]

k-anonymity, k = 3 Mondrian(partition) if (no allowable multidimensional cut for

partition)return : partition summary

else dim choose dimension() fs frequency set(partition, dim) splitVal find median(fs) lhs {t partition : t.dim splitVal} rhs {t partition : t.dim > splitVal} return Mondrian(rhs) Mondrian(lhs)

Page 33: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3333

Mondrian – ExampleMondrian – Example[LDR06] [LDR06]

Anonymizations for two attributes with a discrete normal distribution ( = 25, = 2)

Page 34: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3434

Mondrian QualityMondrian Quality

By definition of k-Anonymity:By definition of k-Anonymity:

From Theorem 2 in [LeFevre et al. 06’]:From Theorem 2 in [LeFevre et al. 06’]:The maximum number of points in any region (RThe maximum number of points in any region (Rii) is ) is

2d*(k-1)+m2d*(k-1)+m, where , where mm is the maximum number of copy of is the maximum number of copy of any distinct point in Pany distinct point in P

For constant For constant d,m,kd,m,k - C - CAVGAVG2*C2*CAVG*AVG*

1)/()__

_(* k

classesequivtotal

recordstotalCAVG

k

mkd

C

C

AVG

AVG

)1(*2

*

Page 35: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Piet Mondrian (1872-1944)

(*) wikipedia

Page 36: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

Privacy – Last Example

Page 37: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3838

Page 38: Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)

3939

BibliographyBibliography

[SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. [SW02-A] “k-ANONYMITY: A Mode for Protecting privacy”, L. Sweeney,2002Sweeney,2002

[SW02-B] “Achieving k-Anonymity Privacy Protection Using [SW02-B] “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression”, L. Sweeney, 2002Generalization and Suppression”, L. Sweeney, 2002

[LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, [LDR06] “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006D.J. DeWitt, R. Ramakrishnan,2006

http://en.wikipedia.org/wiki/Piet_Mondrian http://en.wikipedia.org/wiki/Piet_Mondrian Presentations:Presentations:

““Privacy In Databases”, B. Aditya PrakashPrivacy In Databases”, B. Aditya Prakash ““K-Anonymity and Other Cluster-Based Methods”, Ge. RuanK-Anonymity and Other Cluster-Based Methods”, Ge. Ruan