hiding emerging patterns with local recoding generalization

Post on 04-Jan-2016

41 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION. Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi. Presentation Flow. Privacy-Preserving Data Publishing Introduction to Emerging Patterns (EPs) Introduction to Equivalence Class - PowerPoint PPT Presentation

TRANSCRIPT

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATIONPresented by: Michael ChengSupervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi

Presentation Flow

Privacy-Preserving Data Publishing Introduction to Emerging Patterns

(EPs) Introduction to Equivalence Class Introduction to Generalization Proposed Problem and Motivation Heuristic for the Problem Experimental Results Future research plan

Privacy Preserving Data Publishing- Introduction Organizations often need to publish

or share their data for legitimate reasons

Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data

Privacy Preserving Data Publishing- Objective Transform the dataset before publishing,

such that:1. Sensitive information In our case: Emerging Patterns (EPs)2. Subsequence analysis In our case: Frequent Itemset (FIS)

Mining

Introduction to Emerging Patterns (EPs) Emerging Patterns (EPs) are itemsets

exist in pair of datasets whose supports are significant in one dataset but insignificant in another

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Exec Married

MSE Worker Never

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

{MSE, Exec} is an Emerging Pattern

Income >= 50k Income < 50k

Introduction to Emerging Patterns (EPs) Formally, growth rate and EPs are

defined as follow:

Manager

Introduction to Equivalence Class Tuples are said to be in the same

Equivalence Class w.r.t. a set of Attribute A if they take same values of A

ID Edu Occup Marital

1 MSE

2 MSE

3 BA

4 BA Married

5 BA Repair Never

Exec Married

Exec Married

Exec Married

Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup,

Marital}

Introduction to Generalization Extensively studied in achieving k-Anonymity

Not studied before for hiding itemsets

Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values

Example:In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”

Types of Generalization

Single Dimensional Global Recoding Multi Dimensional Global Recoding Multi Dimensional Local Recoding

Single Dimensional Global Recoding If we decide to generalize some values

to a single value, all tuples which contains these values will be affected

Occup

Exec

Exec

Exec

Manager

Repair

Occup

Occupation

Occupation

Occupation

Occupation

Occupation

Single Dimensional

Global Recoding

Multi Dimensional Global Recoding If we decide to generalize some values

to a single value, all tuples in the same equivalence class which contains those values will be affected

Occup

Exec

Exec

Exec

Manager

Repair

Multi Dimensional

Global Recoding

Occup

Manager

Repair

Occupation

Occupation

Occupation

Multi Dimensional Local Recoding Same as the Multi Dimensional Global

Recoding except no Equivalence Class constraint

Occup

Exec

Exec

Exec

Manager

Repair

Multi Dimensional

Local Recoding

Occup

Manager

Repair

Exec

Occupation

Occupation

Proposed Problem- Why EP and FIS ? Emerging Pattern may reveal sensitive

information

E.g. In the Adult dataset from UCI Repository, we found that: {Never-Married, Own-Child} is an EP from the class

“Income < 50k” to the class “Income >=50k” Growth Rate: 35

Frequent Itemset is a popular data mining task and supported by commercial data-mining software

Proposed Problem-Why Generalization ? Other methods studied in PPDP

For example: Adding unknowns, remove tuples, adding fake tuples

randomly Either

Incomplete information Fake information

In some applications, completeness and truthfulness of data are important

By using generalization, we can preserve the completeness and truthfulness of the data

Proposed problem- Problem Illustration

D D’Transformati

on(Local

Recoding)

Emerging PatternsFrequent Itemsets

Intuition of Local Recoding

Support of FIS = 40% Growth Rate of EP = 3

Frequent Itemset = {Exec, Married} Emerging Pattern = {MSE ,Exec}

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE Manager Never

Intuition of Local RecodingEdu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE Exec

MSE Exec

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE Manager Never

Edu Occup Marital

Married

Married

BA Exec Married

BA Manager Married

BA Repair Never

MSE White col

MSE White col

Income >= 50k Income < 50k

Edu Occup Marital

BA Exec Married

BA Exec Married

BA Exec Married

BA Worker Married

MSE White Col Never

Heuristic for the Problem- Greedy Approach

Repeat…

Until…

All Emerging Patterns are removed

DEmerging Patterns Mining

Applying the generalization

EPs

EP 1

EP 2

EP 3

EP 4

Equivalence ClassesUtility Gain

Class1 40

Class 2 90

Class 3 60

Class 4 20

Class 5 15

Heuristic for the Problem-Greedy Approach Drawbacks:

Trapped into some local minima Solution:

Simulated Annealing Style Approach for choosing equivalence class

Heuristic for the Problem- Simulated Annealing Style Approach

Choose Equivalence Class probabilistically

Two parameters: Initial temperature ( T0 ) Cooling Rate ( α )

Acceptance Probability: exp Utility Gain / Temperature

Temperature updating: Tn = α Tn-1

Utility Gain

T=1000

T=100 T=10

90 0.209 0.302 0.945

60 0.203 0.223 0.047

40 0.199 0.183 0.006

20 0.195 0.150 0.0009

15 0.194 0.142 0.0005Acceptance probability of different utility gain and temperature

Heuristic for the Problem- Simulated Annealing Style Approach

Repeat…

Until…

All Emerging Patterns are removed

DEmerging Patterns Mining

Applying the generalizationand

Decrease the temperature

EPs

EP 1

EP 2

EP 3

EP 4

Equivalence ClassesProbability

Class1 0.2

Class 2 0.4

Class 3 0.1

Class 4 0.25

Class 5 0.05

Two questions

How to choose an EP for generalization? How to calculate the utility gain?

How to choose an EP for generalization? Choose the EP which overlaps with the

remaining EPs the most More likely to hide other EPs

simultaneouslyEmerging Patterns

MSE Never Married

BA Divorced

BA Divorced Worker

BA Divorced Repairman

BA DivorcedOwn-Child

How to calculate utility gain?

Utility gain is a function of: Recoding Distance (RD) Reduction of Growth Rate (RG)

How to calculate utility gain ?- Recoding Distance (RD) The detail derivation is stated in the paper Intuitively, it measures…

How many and how much FIS have been generalized?

How many FIS disappeared? High level definition of RD:

θq x (generalized FIS) + ( 1- θq ) x (disappeared FIS)

,where θq is user defined parameterThe larger the value of RD, the more the distortion generated on

the Frequent Itemset

How to calculate utility gain ?- Reduction of Growth Rate(RG) After taken a local recoding, RG is

defined as: The reduction of growth rate of all EPs

Emerging Patterns

Growth Rate

Executive , Married

10

BA, Divorced 20

Executive 30

Sum of Growth Rate

60

Emerging Patterns

Growth Rate

White col, Married

5

BA, Divorced 20

Sum of Growth Rate

25

Local Recoding

RG = 60 – 25 = 35

How to calculate utility gain? Putting all these together, utility gain is defined

as:θp x RG – (1- θp ) x RD

,where θp is user defined parameters

It favors: Local recoding which can reduce lots of growth rate

It penalizes: Local recoding which generate large distortion on

FIS

Experimental Setup

Dataset: Adult dataset from UCI Repository Popular benchmark dataset used for generalization

Total number of records: 30162 Income > 50k : 7508 Income <= 50k : 22654

Use only 8 categorical attributes for experiment A well accepted hierarchy is defined

Parameters: Support of FIS : 40% Growth rate of EP : 5 Initial Temperature : 10 Cooling Rate : 0.4

Performance

RD / No. of FIS disappeared of the Greedy Approach

RD / No. of FIS disappeared ofSimulated Annealing Style Approach

(Best of 5)

Maximum RD: 623.1

Runtime (in minutes)

Greedy Approach

Simulated Annealing Style Approach(Best of 5)

Future Research Plan

Hide EPs in temporal datasets Consider multi-level FIS Hiding a group of emerging patterns at a

time

Q & A

Any Questions?

top related