thesis1

101
DECLARATION I hereby declare that the academic research work entitled Privacy preserving Data mining using group based Anonymizationhas not been presented anywhere for the award of degree of M .Tech. to the best of my knowledge and belief. . Date: Rani Srivastava ~ 1 ~

Upload: nitin-mishra

Post on 14-Oct-2014

512 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: THESIS1

DECLARATION

I hereby declare that the academic research work entitled “Privacy preserving Data mining

using group based Anonymization” has not been presented anywhere for the award of degree

of M .Tech. to the best of my knowledge and belief.

.

Date: Rani Srivastava

~ 1 ~

Page 2: THESIS1

Certificate

This is to certify that the dissertation (Dissertation code – IS 602) entitled “Privacy preserving

data mining using group based Anonymization” done by Rani Srivastava, Roll

No.0021010808 is an authentic work carried out by her at Ambedkar Institute of Technology

under my guidance. The matter embodied in this project work has not been submitted earlier for

the award of any degree or diploma to the best of my knowledge and belief.

Date: Signature of the Guide

Name of the Guide:

Mr.Vishal Bhatnagar,

Asst.Professor,

CSE Department,

Ambedkar Institute of Technology,

Geeta Colony, Delhi – 110031.

~ 2 ~

Page 3: THESIS1

Abstract

Group based Anonymization method including k -anonymity, l-diversity and t-closeness was

introduced for data privacy. The t- closeness model was introduced in order to provide a

safeguard against the similarity attacks on published data set. It requires that the EMD between

the distribution of a sensitive attribute within each equivalence class and distribution of sensitive

attribute in the whole table should not differ more than a threshold t. Our aim in this thesis is to

provide the logical security of data through data Anonymity. There are many other models which

can work as a substitute of Group based Anonymization but can not remove all shortcomings and

also suffer from some limitations so this is efficient.

~ 3 ~

Page 4: THESIS1

Acknowledgement

Success of any project is the outcome of hard work, dedication & co-ordination. This work

would not have been in its present form without co-operation & co-ordination of many

people who helped me whenever I got hindered in between but also kept my moral high. I

express my deep sense of gratitude and sincere, thanks to my guide Mr.Vishal Bhatnagar

not only for his guidance but also for his enthusiasm towards work that encouraged me to

bring about something new and never stopped me from experimenting. He was always there

to motivate me whenever needed. I would express my gratitude to HOD Mr. Manoj Kumar

for understanding students need .I would like to express my gratitude to Mr. Suresh Kumar

for his instant direction. I also express my gratitude to principal Prof. Asok De for providing

me the necessary resources and for his worthy suggestions. I owe a deep debt of gratitude to

faculty members, librarians, lab technicians, and friends for their constructive suggestions

and giving me the benefit of their substantial experience.

Rani Srivastava

~ 4 ~

Page 5: THESIS1

TABLE OF CONTENTS

Page N.o

A.SYNOPSIS…………………………………………………………7-58

1. INTRODUCTION………………………………………………7-19

1.1 Objectives

1.2 An Overview of Privacy Preserving Data Mining

1.3 Motivation

1.4 Literature Review

1.5 Problems identified

1.6 Thesis Outline

2. DATA MINING: AN INTRODUCTION……………………………….20-29

2.1 What kind of information are we collecting?

2.2 Data Mining and Knowledge Discovery

2.3 Data Mining in Business

2.4 Privacy, ethics and data mining

2.5 Issues in Data Mining

3. PRIVACY PRESERVING DATA MINING…………………………….30-35

3.1 Confidentiality issues in data mining.

4. ANONYMIZATION TECHNIQUES……………………………………….36-40

4.1Method based on data reduction

4.2Method based on data perturbation

~ 5 ~

Page 6: THESIS1

5. t-CLOSENESS PRIVACY PRESERVING DATA MINING…………….41-53

5.1 k- Anonymity

5.2 l –diversity

5.3 t - Closeness:

5.4 Proposed work

6. CONCLUSION AND FUTURE RESEARCH………………………………..54-55

6.1 Conclusion

6.2 Future Research

B.LIST OF TABLE

1. Table 1…………………………………………………………………….43

2. Table 2…………………………………………………………………….44

3.. Table 3…………………………………………………………………….45

4. Table 4. ………………………………………………………………….47

C.LIST OF FIGURE

1. Figure 1…………………………………………………………………….25

2. Figure 2…………………………………………………………48

3. Figure 3……………………………………………………………………53

ABBREVIATIONS………………………………………………………….56

REFERENCES…………………………………………………………….57-58

REPRINT AND PUBLICATIONS………………………………………59-68

~ 6 ~

Page 7: THESIS1

CHAPTER 1

INTRODUCTION

Privacy is an issue that is hotly debated today, and it is likely that it will continue to be debated

in the future. In the information age, it sometimes seems as if everyone wants to know

everything about you. The rapid transfer of personal information has led to the rise of identity

~ 7 ~

Page 8: THESIS1

theft. Because of privacy concerns, it is likely that data mining will become a well known topic

of discussion within the next 10 years. In the light of developments in technology to analyze

personal data, public concerns regarding privacy are rising. While some believe that statistical

and KDDM research is detached from this issue, we can certainly see that the debate is gaining

momentum as KDDM and statistical tools are more widely adopted by public and private

organizations hosting large databases of personal records. One of the key requirements of a data

mining project is access to the relevant data. Privacy and Security concerns can constrain such

access, threatening to derail data mining projects. This area will bring together researchers and

practitioners to identify problems and solutions where data mining interferes with privacy and

security. Here our objectives are as follows,

1.1 Objectives

a. Techniques for protecting confidentiality of sensitive information, including work on

statistical databases, and obscuring or restricting data access to prevent violation of privacy and

security policies.

b. Underlying methods and techniques to support data mining while respecting privacy and

security e.g., secure multi-party computation.

Meanings and measuring of “privacy” in privacy-preserving data mining.

Use of data mining results to reconstruct private information, and corporate

security in the face of analysis by KDDM and statistical tools of public data by

competitors.

Use of anonymity techniques to protect privacy in data mining.

1.2 An Overview of Privacy Preserving Data Mining

~ 8 ~

Page 9: THESIS1

The significant advances in data collection and data storage technologies have provided the

means for the inexpensive storage of enormous amounts of data in data warehouses that reside in

companies and public organizations. A data warehouse as a store house is a repository of data

collected from multiple data sources often heterogeneous and is intended to be used as a whole

under the same unified schema. A data warehouse gives the option to analyze data from different

sources under the same roof [1]. Apart from the benefit of using this data like for keeping up to

date profiles of the customers and their purchases, maintaining a list of the available products,

their quantities and price etc. The mining of these datasets with the existing data mining tools

can reveal invaluable knowledge that was unknown to the data holder before hand. The extracted

knowledge patterns can provide insight to the data holders as well as be invaluable in tasks such

as decision making and strategic business planning. Moreover, companies are often willing to

collaborate with other entities who conduct similar business, towards the mutual benefit of their

businesses. Significant knowledge patterns can be derived and shared among the collaborative

partners through the aggregate mining of their datasets. Furthermore, public sector organizations

and civilian federal agencies usually have to share a portion of their collected data or knowledge

with other organizations having a similar purpose, or even make this data and knowledge public.

For example, NIH endorses research that leads to significant findings which improve human

health and provides a set of guidelines which sanction the sharing of NIH supported research

findings with research institutions. As it becomes evident, there exists an extended set of

application scenarios in which data or knowledge derived from the data has to be shared with

other possibly not trusted entities. The sharing of data and/or knowledge may come at a cost to

privacy primarily due to two reasons:

If the data refers to individuals e.g. as in customers’ market basket data Market basket

analysis is a common mathematical technique used by marketing professionals to reveal

affinities between individual products or product grouping. Then the disclosure of this

data or any knowledge extracted from the data can potentially violate the privacy of the

individuals if their identity is revealed to un- trusted third parties.

If the data concerns business information, then the disclosure of this data or any

knowledge extracted from the data may potentially reveal sensitive trade secrets, whose

knowledge can provide a significant advantage to business competitors and thus can

cause the data holder to lose business over his/her peers.

~ 9 ~

Page 10: THESIS1

The aforementioned privacy issues in the course of data mining are amplified due to the fact that

un-trusted entities means adversaries may utilize other external and publicly available sources of

information e.g. public reports in conjunction with the released data or knowledge, in order to

reveal any protected sensitive information. Since its inception in 2000 with the pioneering work

of [2] and [3], privacy preserving data mining has gained increasing popularity in the data

mining research community. As a result, a whole new set of approaches were introduced to allow

the mining of data, while at the same time prohibiting the leakage of any private and sensitive

information. The majority of the existing approaches can be classified along two broad

categories:

Methodologies that protect the sensitive data itself in the mining process, and

Methodologies that protect the sensitive data mining results (i.e. the extracted

knowledge) that were produced by the application of data mining.

The first category refers to methodologies that apply perturbation, sampling,

generalization/suppression, transformation, etc. techniques to the original datasets in order to

generate their sanitized counterparts that can be safely disclosed to un-trusted third parties. The

goal of this category of approaches is to enable the data miner to get accurate data mining results

when is not provided with the real data. As part of this category we highlight Secure Multiparty

Computation methodologies that have been proposed to enable a number of data holders to

collectively mine their data without having to reveal their datasets to each other. On the other

hand, the second category deals with techniques that prohibit the disclosure of sensitive

knowledge patterns derived through the application of data mining algorithms, as well as

techniques for downgrading the effectiveness of the classifiers in classification tasks, such that

they do not reveal any sensitive knowledge. In what follows, we further investigate each of these

two categories of approaches.

1.2.1 Protecting the Sensitive Data

A wide range of methodologies have been proposed in the research literature to effectively shield

the sensitive information contained in a dataset by producing its privacy-aware counterpart that

can be safely released. The goal of all these privacy preserving methodologies is to ensure that

the sanitized dataset

~ 10 ~

Page 11: THESIS1

(a) It properly shields all the sensitive information that was contained in the original dataset.

(b) It has similar properties e.g. first/second order statistics etc to the original dataset possibly

resembling it to a high extent and

(c)It maintains reasonably accurate data mining results when compared to those attained when

mining the original dataset. The protection of sensitive data from disclosure has been extensively

studied in the context of micro data release, where methodologies have been proposed for the

protection of sensitive information regarding individuals, which are recorded in a dataset. In

micro data we consider each record of the dataset to represent an individual for whom the values

of a number of attributes are being recorded e.g. name, date of birth, residence, occupation,

salary, etc. Among the complete set of attributes, there exist some attributes that explicitly

identify the individual e.g. name, social security number, etc, as well as attributes which, once

combined together or with publicly available external resources, may lead to the identification of

the individual e.g. address, gender, age, etc. The first type of attributes, also known as identifiers,

must be removed from the data prior to its publishing. On the other hand, the second type of

attributes, also known as quasi-identifiers, have to be handled by the privacy preservation

algorithm in such a way that in the sanitized dataset, the knowledge of their values regarding an

individual does no longer pose a threat to the identification of his/her identity. The existing

methodologies for the protection of sensitive micro data can be partitioned in two directions:

(a) data modification approaches.

(b) synthetic data generation approaches.

In [4] it is further partitioned data modification approaches into perturbative and non-

perturbative, depending on whether they introduce false information in the attribute-values of the

data e.g. by the addition of noise based on a data distribution or they operate by altering the

precision of the existing attribute-values e.g. by changing a value to an interval that contains it.

1.2.2 Secure Multiparty Computation

The approaches discussed so far aim at generating a sanitized dataset from the original one,

which can be safely shared with untrustworthy third parties as it contains only non-sensitive

~ 11 ~

Page 12: THESIS1

data.SMC provides an alternative family of approaches that effectively protect the sensitive data.

SMC considers a set of collaborators who wish to collectively mine their data but are unwilling

to disclose their own datasets to each other. As it turns out, this distributed privacy preserving

data mining problem can be reduced to the secure computation of a function based on distributed

inputs and is thus solved by using cryptographic approaches. [5] Elaborates on this close relation

that exists between privacy-aware data mining and cryptography. In SMC, each party contributes

to the computation of the secure function by providing its private input. A secure cryptographic

protocol that is executed among the collaborating parties ensures that the private input that is

contributed by each party is not disclosed to the others. Most of the applied cryptographic

protocols for multi-party computation result to some primitive operations that have to be

securely performed: secure sum, secure set union, and secure scalar product. As a final remark,

we should point out that the operation of the secure protocols in the course of distributed privacy

preserving data mining depends highly on the existing distribution of the data in the sites of the

collaborators. Two types of data distribution have been so far investigated

1. In a horizontal data distribution, each collaborator holds a number of records and for each

record he/she has knowledge of the same set of attributes as his/her peers.

2. In a vertical partitioning of the data, each collaborator is aware of different attributes referring

to the same set of records.

1.2.3 Protecting the Sensitive Knowledge

In this section, we focus our attention on privacy preserving methodologies that protect the

sensitive knowledge patterns that would otherwise be revealed after the course of mining the

data.[4] Similarly to the methodologies that we have presented for protecting the sensitive data

prior to its mining, the approaches in this category also modify the original dataset but in such a

way that certain sensitive knowledge patterns are suppressed, when mining the data. In what

follows, we briefly discuss some categories of methodologies that have been proposed for the

hiding of sensitive knowledge in the context of association and classification rule mining.

~ 12 ~

Page 13: THESIS1

1.2.4 Association Rules Hiding

The association rule mining framework, along with some computationally efficient heuristic

methodologies for the generation of association rules, have been proposed in the work of [2]

briefly stated, the goal of association rule mining is to produce a set of interesting and potentially

useful rules that hold in a dataset. The holding of a rule in a dataset is judged based on its

statistical significance, quantified with the aid of two measures: confidence and support.[5] All

the association rules whose confidence and support are above some user-specified thresholds are

thus mined. However, some of these rules may be sensitive from the owner's perspective. The

association rules hiding methodologies aim to sanitize the original dataset in a way that:

All the sensitive rules as indicated by the data holder that appear when mining the

original dataset for association rules, do not appear when mining the sanitized dataset for

association rules at the same or higher` levels of support and confidence.

All the non-sensitive rules can be successfully mined from the sanitized dataset at the

same or higher levels of support and confidence.

No rule that was not found when mining the original dataset can be found in its sanitized

counterpart, when mining the latter at the same or higher levels of support and

confidence.

The first goal simply states that all the sensitive association rules are properly hidden in the

sanitized dataset. The hiding of the sensitive knowledge comes at a cost to the utility of the

sanitized outcome. The second and the third goals aim at minimizing this cost. Specifically, the

second goal requires that only the sensitive knowledge is hidden in the sanitized dataset and thus

no other, non-sensitive rules are lost due to side-effects of the sanitization process. On the other

hand, the third rule requires that no artifacts i.e. false association rules are generated by the

sanitization process. To recapitulate, in association rule hiding the sanitization process has to be

accomplished in a way that minimally affects the original dataset, preserves the general patterns

and trends of the dataset, and achieves to conceal all the sensitive knowledge, as indicated by the

data holder. Association rule hiding has been studied along three principal directions:

(a) Heuristic approaches.

~ 13 ~

Page 14: THESIS1

(b) Border-based approaches.

(c) Exact approaches.

The first direction collects time and memory efficient algorithms that heuristically select a

portion of the transactions of the original dataset to sanitize, in order to facilitate sensitive

knowledge hiding. Due to their efficiency and scalability, these approaches have been

investigated by the majority of the researchers in the knowledge hiding field of privacy

preserving data mining. However, as in all heuristic methodologies, the approaches of this

category take locally best decisions when performing knowledge hiding, which may not always

be and usually are no globally best. As a result, there are several cases in which these

methodologies suffer from undesirable side-effects and may not identify optimal hiding

solutions, when such solutions exist. Heuristic approaches can rely on a distortion i.e.

inclusion/exclusion of items from selected transactions or on a blocking i.e. replacing some of

the original values in a transaction with question marks scheme. The second class of approaches

collects methodologies that hide the sensitive knowledge by modifying only a selected portion of

item sets which belong to the border in the lattice of the frequent i.e. statistically significant and

the infrequent i.e. statistically insignificant patterns of the original dataset. In particular, the

sensitive knowledge is hidden by enforcing the revised borders which accommodate the hiding

of the sensitive item sets in the sanitized database. The algorithms in this class differ in the

borders they track, as well as in the methodology that they apply to enforce the revised borders in

the sanitized dataset .Finally, the third class of approaches involves non-heuristic algorithms

which conceive the knowledge hiding process as a constraints satisfaction problem an

optimization problem that is solved through the application of integer or linear programming.

This class of approaches differs from the previous two, primarily due to the fact that it collects

methodologies that can guarantee optimality in the computed hiding solution provided that an

optimal hiding solution exists or a very good approximate solution in the case that an optimal

one does not exist. On the negative side, these approaches are usually several orders of

magnitude slower than the heuristic ones, especially due to the runtime that is required for the

solution of the constraints satisfaction problem by the integer/linear programming solver.

~ 14 ~

Page 15: THESIS1

1.2.5 Classification Rule Hiding

Privacy-aware classification has been studied to a substantially lower extent than privacy

preserving association rule mining. Similarly to association rules hiding, classification rules

hiding algorithms consider a set of classification rules as sensitive and proceed to protect them

from disclosure by using either suppression-based or reconstruction-based techniques. In

suppression-based techniques the confidence of a classification rule measured in terms of the

owner’s belief regarding the holding of the rule when given the data is reduced by distorting a set

of attributes in the data set that belongs to transactions related to its existence. On the other hand,

reconstruction-based approaches target at reconstructing the dataset by using only those

transactions of the original dataset that support the non-sensitive classification rules, thus leaving

the sensitive rules unsupported.

1.3 Motivation

Huge databases exist in the society today, they include census data, media data, consumer data,

data gathered by government agencies and like research data need to publish publically. Now it

is possible for an adversary to learn a lot of information about individual from public data like

purchasing pattern, family history, medical data, media data and also business trend. This is an

age of competitiveness and every organization either government or non government wants to be

ahead of each other .Then why allow adversary to get useful information to make any inference

and to share confidential information. Privacy is becoming an increasingly important issue in

many data-mining applications. This has triggered the development of many privacy-preserving

data-mining techniques.

1.4 Literature Review

It is estimated that more than a half data collected by government and non government

organization are confidential so there are more risk associated with data breaches. So it is a

big question how to publish data publically without compromising individual’s privacy for

the purpose of data mining. There are lots of model regarding privacy preserving data mining

~ 15 ~

Page 16: THESIS1

like Injector model, Anatomy model, Masking etc. model. But these models can not work on

every condition. There are two models which is better than this first [6] and second [7].But

these model also have shortcomings but if we enhance first model then it remove

shortcomings.

1.4.1 A.

In a paper titled “providing k-anonymity in data mining” [6] extended definitions of k -

anonymity was presented and used to show that a given data mining model does not violate the

k-anonymity of the individuals represented in the learning examples. It shows that model can be

applied to various data mining problems, such as classification, association rule mining and

clustering. In k-anonymity method, the values of quasi identifier QI attributes of each tuple in a

table are identical to those of at least (k-1) other tuples. The larger the value of k, the greater the

implied privacy since no individual can be identified with probability exceeding 1/k through

linking attack .the process of k- Anonymization involves Data Generalization and cell value

Suppression. The k anonymity model makes two major assumptions:

1. The database owner is able to separate the columns of the table into a set of quasi-identifiers,

which are attributes that may appear in external tables the database owner does not control, and a

set of private columns, the values of which need to be protected. We prefer to term these two sets

as public attributes and private attributes, respectively.

2. The attacker has full knowledge of the public attribute values of individuals, and no

knowledge of their private data. The attacker only performs linking attacks. A linking attack is

executed by taking external tables containing the identities of individuals, and some or all of the

public attributes. When the public attributes of an individual match the public attributes that

appear in a row of a table released the database owner, then we say that the individual is linked

to that row. Specifically the individual is linked to the private attribute values that appear in that

row. A linking attack will succeed if the attacker is able to match the identity of an individual

against the value of a private attribute.

~ 16 ~

Page 17: THESIS1

1.4.2 B.

In this paper a novel privacy preserving algorithm was developed that overcomes all the above

problems.[7] The core of solution is the concept of transforming a part of quasi identifiers and

personalizing the sensitive information so that privacy preserving micro data can be released

with less information loss. This concept introduces 3 different questions to be answered:

First, what is the transformation to be done on QI and why?

Second, how is the personalized privacy to be introduced in the micro data?

And third, how do we prove that information loss is less? By answering all these queries we

formalize an algorithm which preserves privacy in the real data sets.

Quasi-Identifier attributes Aq is a set of attributes Aq= {Ax... Ay) which is subset of {AI, A2....,

Am} whose values are able to fetch unique record in the table T (A1, A2…, Am ). This property

only leads to the problem of linking attack .In simple term, QI attributes are candidate key

(minimal super key) in a data table. All the attributes in a data base table e. g. Patient table may

be classified into four categories as identifying attributes Ai e.g. Name, sensitive attributes A,

- e.g. disease, neutral attributes An- e.g. Length of stay and quasi-identifier attributes Aq e.g.

Age , Gender , Zip. To preserve privacy, identifying attributes are not published Here we take an

assumption that one of the QI members is numeric data type and value of that attribute is

transformed to fuzzy membership value. If actual value of any of the QI member attribute is not

known, it can not identify unique record. There by, linking attack problem is solved. The values

of sensitive attribute as may be confidential for an individual and therefore to be published

according to his/her preference. For each sensitive attribute the user is allowed to set PL and DL

Boolean values. If the individual is willing to disclose his actual information, he/she has to set

PL and DL value as True and so no transformation is done. If he does not mind linking him with

other sibling value of taxonomy tree with less probability then he has to set the value PL as true

and DL as False. Then his sensitive attribute value is replaced with its ancestor generalized value

followed by an arbitrary value, to differentiate among the sibling values.

1.5 Problems identified.

Since k-anonymity does not put any restriction on sensitive attributes, it faces homogeneity

attack and background knowledge attack. In other words, k-anonymity creates groups which leak

information due to lack of diversity within the group. This limitation is overcome by l-diversity

~ 17 ~

Page 18: THESIS1

principle. But l-diversity principles suffer from similarity attack and skew attack so this problem

can be overcome closeness. And also in (b) following assumptions can create problems.

a. It is assumed that one of the quasi identifier is numeric and it is transformed in to fuzzy based

transformation. In fuzzy based transformation number of fuzzy set is calculated which is equal to

the number of linguistic term set by deciding size of fuzzy set (k), min, max and mid point of

each fuzzy set m1….m k .Transform the actual value(x) using the functions f1(x), f2(x),…f

k(x).and replace actual value with x n plus category number. If actual value of any of the quasi

identifier member attribute is not known, it can not identify unique record. There by, linking

attack problem is solved but what happen if quasi identifier does not have numerical attribute

like Netflix movie rating then this approach can not solve linking problem.

b. Identifier attribute are not disclosed and replaced by auto generated id number if present. This

leads loss of truthfulness of information and also information loss.

c. For categorical sensitive attribute transformation is performed using the mapping table

prepared with prepared with domain knowledge considering privacy level (PL) and disclosure

level (DL) set by the user? If any individual willing to disclose his information then his PL and

DL is set to be true and so no transformation is done if he does not mind to linking him another

with less probability then PL as true and DL as false, then its sensitive value is replaced by

generalized value plus arbitrary value but if privacy level is false then value is replaced by

overall general value plus arbitrary value. But if we have to do research then if Disclosure level

and privacy level both are true then there is no problem for do research because true data is

available but there is a privacy breach if linking problem can not be solved above then attribute

disclosure is not solved .If no one want to disclose his sensitive information then how will

research etc. take place.

1.6 Thesis outline.

1. In chapter 1 we look an overview of privacy preserving data mining, specify literature survey

related our method and find out problems.

2. In chapter 2 we will study what is data mining, what kind of information can be mined, what

are the steps in knowledge discovery process what is the privacy, ethics issues in data mining

and what is the controversial issues in data mining.

~ 18 ~

Page 19: THESIS1

3. In chapter 3 we will study what is privacy preserving data mining, what is the different

dimension in privacy preserving data mining and detail study of confidentiality issue in data

mining.

4. In chapter 4 we will study category of Statistical disclosure limitation methods and their

division.

5. In this chapter 5 we will study that what anonymity methods are going to make a group

anonymity method and how we apply it on privacy preserving data mining

6. In chapter 6 is concerned for Conclusion, Research work.

And at the end Abbreviations and References are included.

In this chapter we look an overview of privacy preserving data mining, specified literature

survey related our method and find out problems. In the next chapter we will study what is

data mining, what kind of information can be mined, what are the steps in knowledge

discovery process, what is the privacy, ethics issues in data mining and what is the

controversial issues in data mining.

~ 19 ~

Page 20: THESIS1

CHAPTER 2

DATA MINING: AN

INTRODUCTION

Initially, with the advent of computers and means for mass digital storage, we started collecting

and storing all sorts of data, counting on the power of computers to help sort through this

amalgam of information. Unfortunately these massive collections of data stored on different

structures very rapidly became overwhelming. [1]This initial chaos has led to the creation of

structured databases and DBMS. The efficient database management systems have been very

~ 20 ~

Page 21: THESIS1

important assets for management of a large corpus of data and especially for effective and

efficient retrieval of particular information from a large collection whenever needed. The

proliferation of database management systems has also contributed to recent massive gathering

of all sorts of information. Today, we have far more information than we can handle: from

business transactions and scientific data, to satellite pictures, text reports and military

intelligence. Information retrieval is simply not enough anymore for decision-making.

Confronted with huge collections of data, we have now created new needs to help us make better

managerial choices. These needs are automatic summarization of data, extraction of the

“essence” of information stored, and the discovery of patterns in raw data.

2.1 What kind of information are we collecting?

We have been collecting a myriad of data, from simple numerical measurements and Text

Documents to more complex information such as Spatial data, Multimedia Channels and

Hypertext Documents. Here is a non-exclusive list of a variety of information collected in

Digital form in Database and Flat Files.

2.1.1 Business transactions:

Every transaction in the business industry is often “memorized” for perpetuity. Such transactions

are usually time related and can be inter business deals such as purchases, exchanges, banking,

stock, etc., or intra business operations such as management of in house wares and assets. Large

department stores, for example, thanks to the widespread use of bar codes, store millions of

transactions daily representing often terabytes of data. Storage space is not the major problem, as

the price of hard disks is continuously dropping, but the effective use of the data in a reasonable

time frame for competitive decision making is definitely the most important problem to solve for

businesses that struggle to survive in a highly competitive world.

2.1.2 Scientific data:

~ 21 ~

Page 22: THESIS1

Whether in a Swiss nuclear accelerator laboratory counting particles, in the Canadian forest

studying readings from a grizzly bear radio collar, on a South Pole iceberg gathering data about

oceanic activity, or in an American university investigating human psychology, our society is

amassing colossal amounts of scientific data that need to be analyzed. Unfortunately, we can

capture and store more new data faster than we can analyze the old data already accumulated.

2.1.3 Medical and personal data:

From government census to personnel and customer files, very large collections of information

are continuously gathered about individuals and groups. Governments, companies and

organizations such as hospitals, are stockpiling very important quantities of personal data to help

them manage human resources, better understand a market, or simply assist clientele. Regardless

of the privacy issues this type of data often reveals, this information is collected, used and even

shared .When correlated with other data this information can shed light on customer behavior

and the like.

2.1.4 Surveillance video and pictures:

With the amazing collapse of video camera prices, video cameras are becoming ubiquitous.

Video tapes from surveillance cameras are usually recycled and thus the content is lost.

However, there is a tendency today to store the tapes and even digitize them for future use and

analysis.

2.1.5 Satellite sensing:

There are a countless number of satellites around the globe: some are geo-stationary above a

region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to

the surface. NASA, which controls a large number of satellites, receives more data every second

than what all NASA researchers and engineers can cope with. Many satellite pictures and data

~ 22 ~

Page 23: THESIS1

are made public as soon as they are received in the hopes that other researchers can analyze

them.

2.1.6 Games:

Our society is collecting a tremendous amount of data and statistics about games, players and

athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times,

boxer’s pushes and chess positions and all the data are stored. Commentators and journalists are

using this information for reporting, but trainers and athletes would want to exploit this data to

improve performance and better understand opponents.

2.1.7 Digital media:

The proliferation of cheap scanners, desktop video cameras and digital cameras is one of the

causes of the explosion in digital media repositories. In addition, many radio stations, television

channels and film studios are digitizing their audio and video collections to improve the

management of their multimedia assets. Associations such as the NHL and the NBA have

already started converting their huge game collection into digital forms.

2.1.8 CAD and Software engineering data:

There are a multitude of CAD systems for architects to design buildings or engineers to

conceive system components or circuits. These systems are generating a tremendous amount of

data. Moreover, software engineering is a source of considerable similar data with code, function

libraries, objects, etc., which need powerful tools for management and maintenance.

2.1.9Virtual Worlds:

There are many applications making use of three-dimensional virtual spaces. These spaces and

the objects they contain are described with special languages such as VRML. Ideally, these

virtual spaces are described in such a way that they can share objects and places. There is a

remarkable amount of virtual reality object and space repositories available. Management of

~ 23 ~

Page 24: THESIS1

these repositories as well as content-based search and retrieval from these repositories are still

research issues, while the size of the collections continues to grow.

2.1.10 Text reports and memos like e-mail messages:

Most of the communications within and between companies or research organizations or even

private people, are based on reports and memos in textual forms often exchanged by e-mail.

These messages are regularly stored in digital form for future use and reference creating

formidable digital libraries.

2.1.11 The World Wide Web repositories:

Since the inception of the World Wide Web in 1993, documents of all sorts of formats, content

and description have been collected and inter-connected with hyperlinks making it the largest

repository of data ever built. Despite its dynamic and unstructured nature, its heterogeneous

characteristic, and its very often redundancy and inconsistency, the World Wide Web is the most

important data collection regularly used for reference because of the broad variety of topics

covered and the infinite contributions of resources and publishers. Many believe that the World

Wide Web will become the compilation of human knowledge [1].

2.2 Data Mining and Knowledge Discovery

Data Mining, also popularly known as KDD refers to the nontrivial extraction of implicit,

previously unknown and potentially useful information from data in databases. While data

mining and KDD are frequently treated as synonyms, data mining is actually part of the

knowledge discovery process. The following figure shows data mining as a step in an iterative

knowledge discovery process.

Knowledge

Pattern

Evaluation

~ 24 ~

Page 25: THESIS1

Data Mining

Task-relevant Selection and

Data

Transformation

Data

Warehouse

Data Cleaning

Data Integration

Databases

Figure.1 Data Mining is the core of Knowledge Discovery process [1]

2.3 Data Mining in Business

Through the use of data mining techniques, businesses are discovering new trends and patterns

of behavior that previously went unnoticed. Once they have uncovered this vital intelligence, it

~ 25 ~

Page 26: THESIS1

can be used in a predictive manner for a variety of applications. Data mining in customer

relationship management applications can contribute significantly. Rather than randomly

contacting a customer through a call center or sending mail, a company can concentrate its

efforts on customer that are predicted to have a high likelihood of responding to an offer. More

sophisticated methods may be used to optimize resources across campaigns so that one may

predict which channel and which offer an individual is most likely to respond to across all

potential offers. Data mining can also be helpful to human-resources departments in identifying

the characteristics of their most successful employees. Information obtained, such as universities

attended by highly successful employees, can help HR focus recruiting efforts

accordingly. Another example of data mining, often called the market basket analysis, relates to

its use in retail sales. If a clothing store records the purchases of customers, a data-mining system

could identify those customers who favor silk shirts over cotton ones. Although some

explanations of relationships may be difficult, taking advantage of it is easier. The example deals

with association rules within transaction-based data.

2.4 Privacy, Ethics and Data mining

Absolute privacy is not possible to achieve in the information age. Because, Individuals

disseminate data in their common daily activities such as web browsing , e-commerce and e-

government dealings, e-mail and mobile phone communications , credit card and ATM

transactions etc.. Major data sources include server and cookie logs, customer information,

intelligent Internet agents and centralized demographic and other official records. Powerful data

processing storage and communication technologies allow these data to be manipulated and used

by the other people and agencies freely and in most cases indiscreetly. The reason is that data

mining does not supply information about the social and ethical consequences of the results and

also it does not necessarily discover causal relationships .Therefore how and where to use the

results is largely a choice of the “miner”. This mechanism is inevitably prone to have negative

impacts on privacy and individual rights. In any data processing system, it is fair to expect that

sensitive personal data must be protected from misuses and abuses by the outside entities. The

position of data mining is even more critical in this respect, because of its power and potential

naturally every individual should have control over his or her personally sensitive data. But what

constitutes “personal” or “sensitive”? What are the limits? Of course these are questions whose

~ 26 ~

Page 27: THESIS1

answers may differ from individual to individual. Another important problem arises when

individual’s rights and public interest contradict. In most cases, public or business oriented data

mining applications are claimed to produce beneficial outcomes for individuals, organizations

and the society. Unfortunately, it is very difficult to impose universally acceptable guidelines and

rules for distinguishing the right from the wrong. Because, data mining applications are generally

open ended and they can as well lead to unpredictable and possibly harmful personal results. For

example very detailed data about buying habits, times and locations are extracted from customer

transactions and data mined regularly by most supermarket chains. It is impossible for an

average customer to be knowledgeable on possible uses of sensitive data and their consequences.

Data mining for crime prevention is based on the records of activities of the individuals, such as

travel, electronic and phone communications, shopping and encounters with the other people.

Data mining in these applications can produce inaccurate and faulty results, which usually

constitute breaches of privacy and can be harmful for the individuals. For example, a person may

be classified into a “suspect” short list while he or she has nothing to do with the affair at hand.

The consequences could be serious in such a case and the whole process is certainly unethical if

not illegal .The negative impacts such as litigation, adverse publicity, loss of reputation,

discrimination could be further aggravated if the data are unreliable, faulty or even

fabricated .This is where an inherently unethical process could also turn into an “illegal” one.

The legal issues related to data mining applications are complex and difficult to evaluate and as

such, can not be put into the framework of any particular law. Besides the challenges of the

information age have not yet been resolved properly by legal doctrines and by the legal systems

of countries. Most countries choose to strengthen the privacy of personal information by a

specific law. But such a law cannot be expected to foresee all possible violations and types of

offences that might come about. In practice, legal court cases are usually resolved by applying

some other available code or legislations which aim to protect the individual’s rights in general.

The major problem here is the growth rate of information and information technologies, which

render obsolete the rules, regulations and even laws in relatively short time spans. In many

situations, it is very difficult to decide what constitutes “legal” or “private” let alone to establish

consistent court rules. The main reason is that, most physical limitations of the hardware and

software technologies loose their meaning rapidly and newer and more versatile ones replace

them. This phenomenon influences the application methodologies and environments, including

~ 27 ~

Page 28: THESIS1

data mining .Various privacy preservation methods and measures aim to find solutions for this

kind of legal and ethical problems by reducing their likelihoods of occurrence in data mining

applications

2.5 Issues in Data Mining

Before data mining develops into a conventional, mature and trusted discipline many issues have

to be addressed .Some of these issues are addressed below.

2.5.1 Security and social issues:

Security is an important issue with any data collection that is shared and/or is intended to be used

for strategic decision-making. When data is collected for customer profiling, user behavior,

understanding, correlating personal data with other information etc., large amounts of sensitive

and private information about individuals or companies is gathered and stored .This become

controversial due to confidential nature of some of the data and the potential illegal access to the

information.

2.5.2 User interface issues:

Data visualization simplify data interpretation results, helps users better understand their needs.

There are many visualization ideas and proposals for effective data graphical presentation.

However there is still much research to obtain good visualization tools for large datasets that

could be used to display and manipulate mined knowledge.

2.5.3 Mining methodology issues:

These issue concerns data mining approaches applied and there limitations. Topics such as

versatility of the mining approaches, the diversity of data available, the dimensionality of the

domain, the broad analysis needs (when known), the assessment of the knowledge discovered,

the exploitation of background knowledge and metadata, the control and handling of noise in

data, etc. are all examples that can dictate mining methodology choices.

~ 28 ~

Page 29: THESIS1

2.5.4 Performance issues:

There are many statistical methods exist for data analysis but these method are not designed for

very large data sets data mining is dealing with today .This raises the issues of scalability and

efficiency of the data mining methods when processing considerably large data.

2.5.5 Data source issues:

There are many issue related to data sources like diversity of data. We are storing different types

of data in a variety of repositories. It is difficult to expect a data mining system to effectively and

efficiently achieve good mining results on all kinds of data and sources. Different kinds of data

and sources may require distinct algorithms and methodologies. Currently, there is a focus on

relational databases and data warehouses, but other approaches need to be pioneered for other

specific complex data types. A versatile data mining tool, for all sorts of data, may not be

realistic.

In this chapter we studied that what is data mining, what kind of information can be

mined, what are the steps in knowledge discovery process what is the privacy, ethics issues

in data mining and what is the controversial issues in data mining. In the next chapter we

will study what is privacy preserving data mining, what is the different dimension in

privacy preserving data mining and detail study of confidentiality issue in data mining.

~ 29 ~

Page 30: THESIS1

CHAPTER 3

PRIVACY PRESERVING

DATA MINING

The problem of privacy-preserving data mining has become more important in recent years

because of the increasing ability to store personal data about users, and the increasing

understanding of data mining algorithms to open this information. It’s estimated that more than

50 percent of the data stored in databases could be classified as confidential. With your IT

~ 30 ~

Page 31: THESIS1

organization collecting and storing more data and more sensitive data for longer periods of time,

there are risk associated with data breaches, a relatively new research area, is focused on

preventing privacy violations that might arise during data mining operations.[8] In implementing

this goal, PPDM algorithms modify original datasets in order to preserve privacy even after the

mining process is activated. The aim is to ensure minimal data loss and to obtain qualitative data

mining results. Describe PPDM approaches based on five dimensions:

1. Data distribution - whether the data is centralized or distributed.

2. Data modification – where the modification technique is used to transform the data

values.

3. Data mining algorithm the data mining task to which the approach is applied

4. Data or rule hiding refers to whether raw data or aggregated data should be hidden

5. Privacy preservation the type of selective modification performed on the data as a part of the

PPDM technique: heuristic-based, cryptography-based or reconstructing-based.

3.1 Confidentiality issues in data mining.

A key problem that arises in any large collection of data is that of confidentiality. The need for

privacy is sometimes due to law e.g., for medical databases or can be motivated by business

interests. However, there are situations where the sharing of data can lead to mutual gain. A key

utility of large databases today is research, whether it is scientific or economic and market

oriented. Thus, for example, [8] the medical field has much to gain by pooling data for research;

as can even competing businesses with mutual interests. Despite the potential gain, this is often

not possible due to the confidentiality issues which arise. We address this question and show that

highly efficient solutions are possible. Our scenario is the following:

Let P1 and P2 be parties owning large private databases D1 and D2. The parties wish to apply a

data-mining algorithm to the joint database D1/D2 without revealing any unnecessary

information about their individual databases. That is, the only information learned by P1 about

D2 is that which can be learned from the output of the data mining algorithm, and vice versa. We

do not assume any “trusted” third party who computes the joint output.

3.1.1 Semi-honest adversaries.

~ 31 ~

Page 32: THESIS1

In any multi-party computation setting, a malicious adversary can always alter its input. In the

data-mining setting, this fact can be very damaging since the adversary can define its input to be

the empty database. Then, the output obtained is the result of the algorithm on the other party’s

database alone. Although this attack cannot be prevented, we would like to prevent a malicious

party from executing any other attack. However, for this initial work we assume that the

adversary is Semihonest also known as passive. That is, it correctly follows the protocol

specification, yet attempts to learn additional information by analyzing the transcript of messages

received during the execution. The key directions in the field of privacy-preserving data mining

are as follows:

3.1.2 Privacy-Preserving Data Publishing:

These techniques tend to study different transformation methods associated with privacy.

Another related issue is how the perturbed can be used with classical data mining methods such

as association rule mining. Other related problems include that of determining privacy-

preserving methods to keep the underlying data useful, or the problem of studying the different

definitions of privacy, and how they compare in terms of effectiveness in different scenarios.

3.1.3 Changing the results of Data mining applications to preserve privacy:

There are many cases in which the results of data mining applications such as association rule or

classification rule mining can compromise the privacy of data. This has resulted a field of

privacy in which data mining algorithms such as association rule mining are modified in order to

preserve the privacy of data for example association rule hiding methods.

3.1.4 Query Auditing

Here, we are either modifying or restricting the results of queries.

3.1.5 Cryptographic Methods for Distributed Privacy:

There are many cases in which data is distributed across multiple sites. So owner of data may

want to compute a common function. That makes function computation possible without

~ 32 ~

Page 33: THESIS1

revealing sensitive information. SMC for Privacy-Preserving Data Mining Privacy-preserving

data mining considers the problem of running data mining algorithms on confidential data that is

not supposed to be revealed even to the party running the algorithm. There are two classic

settings for privacy-preserving data mining although these are by no means the only ones. In the

first, the data is divided amongst two or more different parties, and the aim is to run a data

mining algorithm on the union of the parties' databases without allowing any party to view

anyone else's private data. In the second, some statistical data that is to be released so that it can

be used for research using statistics and/or data mining may contain confidential data and so is

first modified so that

(a) The data does not compromise anyone's privacy.

(b) It is still possible to obtain meaningful results by running data mining algorithms on the

modified data set. A classical example of a privacy-preserving data mining problem of the first

type is from the field of medical research. Consider the case that a number of different hospitals

wish to jointly mine their patient data for the purpose of medical research. Furthermore, let us

assume that privacy policy and law prevents these hospitals from ever pooling their data or

revealing it to each other, due to the confidentiality of patient records. In such a case, classical

data mining solutions cannot be used. Rather it is necessary to find a solution that enables the

hospitals to compute the desired data mining algorithm on the union of their databases, without

ever pooling or revealing their data. Privacy-preserving data mining solutions have the property

that the only information provably earned by the different hospitals is the output of the data

mining algorithm. This problem whereby different organizations cannot directly share or pool

their databases, but must nevertheless carry out joint research via data mining, is quite common.

For example, consider the interaction between different intelligence agencies. For security

purposes, these agencies cannot allow each other free access to their confidential information if

they did, then a single mole in a single agency would have access to an overwhelming number of

sources. Nevertheless, as we all know, homeland security also mandates the sharing of

information! It is much more likely that suspicious behavior will be detected if the different

agencies were able to run data mining algorithms on their combined data.

3.1.6 Secure computation and privacy-preserving data mining.

~ 33 ~

Page 34: THESIS1

There are two distinct problems that arise in the setting of privacy-preserving data mining. The

first is to decide which functions can be safely computed, where safety means that the privacy of

individuals is preserved. For example, is it safe to compute a decision tree on confidential

medical data in a hospital, and publicize the resulting tree? For the most part, we will assume

that the result of the data mining algorithm is either safe or deemed essential. Thus, the question

becomes how to compute the results while minimizing the damage to privacy. For example, it is

always possible to pool all of the data in one place and run the data mining algorithm on the

pooled data. However, this is exactly what we don't want to do hospitals are not allowed to hand

their raw data out, security agencies cannot afford the risk, and governments risk citizen outcry if

they do. Thus, the question we address is how to compute the results without pooling the data,

and in a way that reveals nothing but the final results of the data mining computation. This

question of privacy-preserving data mining is actually a special case of a long-studied problem in

cryptography called secure multiparty computation. This problem deals with a setting where a set

of parties with private inputs wish to jointly compute some function of their inputs. Loosely

speaking, this joint computation should have the property that the parties learn the correct output

and nothing else, even if some of the parties maliciously collude to obtain more information.

Clearly, a protocol that provides this guarantee can be used to solve privacy-preserving data

mining problems of the type discussed above.

3.1.7 Privacy-Preserving Data Mining Algorithms:

In recent years, data mining has been viewed as a threat to privacy because of the widespread

proliferation of electronic data maintained by corporations. This has lead to increased concerns

about the privacy of the underlying data. In recent years, a number of techniques have been

proposed for modifying or transforming the data in such a way so as to preserve privacy. A

survey on some of the techniques used for privacy-preserving data mining may be found in. Most

methods for privacy computations use some form of transformation on the data in order to

perform the privacy preservation. Typically, such methods reduce the granularity of

representation in order to reduce the privacy. This reduction in granularity results in some loss of

effectiveness of data management or mining algorithms. This is the natural trade-off between

information loss and privacy. The randomization method: The randomization method is a

technique for privacy-preserving data mining in which noise is added to the data in order to mask

~ 34 ~

Page 35: THESIS1

the attribute values of records. The noise added is sufficiently large so that individual record

values cannot be recovered. The k-anonymity model and l-diversity : The k-anonymity model

was developed because of the possibility of indirect identification of records from public

databases. This is because combinations of record attributes can be used to exactly identify

individual records. In the k-anonymity method, we reduce the granularity of data representation

with the use of techniques such as generalization and suppression. This granularity is reduced

sufficiently that any given record maps onto at least k other records in the data. The l-diversity

model was designed to handle some weaknesses in the k-anonymity model since protecting

identities to the level of k-individuals is not the same as protecting the corresponding sensitive

value s, especially when there is homogeneity of sensitive values within a group.

3.1.8 Distributed privacy preservation:

In many cases, individual entities may wish to derive aggregate results from data sets which are

partitioned across these entities. Such partitioning may be horizontal when the records are

distributed across multiple entities or vertical when the attributes are distributed across multiple

entities. Downgrading application effectiveness: In many cases even though the data may not be

available, the output of applications such as association rule mining, classification or query

processing may result in violations of privacy. This has lead to research in downgrading the

effectiveness of applications by either data or application modifications. Some examples of such

techniques include association rule hiding.

In this chapter we studied that what is privacy preserving data mining, what is the

different dimension in privacy preserving data mining and detail study of confidentiality

issue in data mining. In the next chapter we will study category of Statistical disclosure

limitation methods and their division

~ 35 ~

Page 36: THESIS1

CHAPTER 4

ANONYMIZATION

TECHNIQUES

Anonymization is the making data publicly without compromising the individual privacy. One

of the functions of a federal statistical agency is to collect individually sensitive data, process it

and provide statistical summaries, and/or public use micro data files to the public. Some of the

data collected are considered proprietary by respondents. On the other hand, not all data

collected and published by the government are subject to disclosure limitation techniques. Some

data on businesses that is collected for regulatory purposes are considered public. In addition,

~ 36 ~

Page 37: THESIS1

some data are not considered sensitive and are not collected under a pledge of confidentiality.

The statistical disclosure limitation techniques described however confidentiality is required and

data or estimates are made publicly available. All disclosure limitation methods result in some

loss of information and sometimes the publicly available data may not be adequate for certain

statistical studies. However, the intention is to provide as much data as possible, without

revealing individually sensitive data. Statistical disclosure limitation methods can be classified in

two categories [9]:

4.1 Methods based on data reduction. Such methods aim at increasing the number of

individuals in the sample population sharing the same or similar identifying characteristics

presented by the investigated statistical unit. Such procedures tend to avoid the presence of

unique or rare recognizable individuals.

4.2 Methods based on data perturbation. Such methods aim at increasing the number of

individuals in the sample/population sharing the same or similar identifying characteristics

presented by the investigated statistical unit. Such procedures tend to avoid the presence of

unique or rare recognizable individuals.

4.1 Methods based on data reduction

4.1.1 Removing variables

The first obvious application of this method is the removal of direct identifiers from the data file.

A variable should be removed when it is highly identifying and no other protection methods can

be applied. A variable can also be removed when it is too sensitive for public use or irrelevant

for analytical purpose.

4.1.2 Removing records

Removing records can be adopted as an extreme measure of data protection when the unit is

identifiable in spite of the application of other protection techniques. For example, in an

enterprise survey dataset, a given enterprise may be the only one belonging to a specific industry.

In this case, it may be preferable to remove this particular record rather than removing the

~ 37 ~

Page 38: THESIS1

variable "industry" from all records. Since it largely impacts the statistical properties of the

released data, removing records has to be avoided as much as possible.

4.1.3 Global recoding

The global recoding method consists in aggregating the values observed in a variable into pre-

defined classes for example, recoding the age into five-year age groups, or the number of

employees in three-size classes: small, medium and large. The method applies to numerical

variables, continuous or discrete. It affects all records in the data file.

4.1.4 Top and bottom coding

Top and bottom coding can be referred to as a special case of global recoding that can be applied

to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two typical

examples. The highest values of these variables are usually very rare and therefore identifiable.

Top coding at certain thresholds introduces new categories such as "monthly salary higher than

6000 dollars" or "age higher than 75", leaving unchanged the other observed values. The same

reasoning applied to the smaller observed values defines bottom coding. When dealing with

ordinal categorical variables, a top or bottom category is defined by aggregating the "highest" or

"smallest" categories.

4.1.5 Local suppression

Local suppression consists in replacing the observed value of one or more variables in a certain

record with a missing value. Local suppression is particularly suitable for the setting of

categorical key variables and when combinations of scores on such variables are at stake. In this

case, local suppression consists in replacing an observed value in a combination with a missing

value. The aim of the method is to reduce the information content of rare combinations.

4.2 Methods based on data perturbation

4.2.1 Micro-aggregation

Micro-aggregation is a perturbation technique first proposed by Euro-stat as a statistical

disclosure method for numerical variables. The idea is to replace an observed value with the

average computed on a small group of units, small aggregate or micro-aggregate, including the

investigated one. The units belonging to the same group will be represented in the released file

~ 38 ~

Page 39: THESIS1

by the same value. The groups contain a minimum predefined number k of units. The k minimum

accepted value is 3. For a given k, the issue consists in determining the partition of the whole set

of units in groups of at least k units k-partition minimizing the information loss usually

expressed as a loss of variability. Therefore, the groups are constructed according to a criterion

of maximum similarity between units. The micro-aggregation mechanism achieves data

protection by ensuring that there are at least k units with the same value in the data file.

When micro-aggregation is independently applied to a set of variables, the method is

called individual ranking. When all the variables are averaged at the same time for each group,

the method is called multivariate micro-aggregation.

4.2.2 Data swapping

Data swapping was initially proposed as a perturbation technique for categorical micro-data, and

aimed at protecting tabulation stemming from the perturbed micro data file. Data swapping

consists in altering a proportion of the records in a file by swapping values of a subset of

variables between selected pairs of records swap pairs.

4.2.3 PRAM

As a statistical disclosure control technique, PRAM induces uncertainty in the values of some

variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be

considered as a randomized version of data swapping.

4.2.4 Adding noise

Adding noise consists in adding a random value ε, with zero mean and predefined variance σ2, to

all values in the variable to be protected. Generally, methods based on adding noise are not

considered very effective in terms of data protection4.2.5 Re-sampling

Re-sampling is a protection method for numerical micro-data that consists in drawing with

replacement samples of n values from the original data, sorting the sample and averaging the

sampled values. Data protection level guaranteed by this procedure is generally considered quite

low.

4.2.6 Synthetic micro-data

~ 39 ~

Page 40: THESIS1

Synthetic micro-data are an alternative approach to data protection, and are produced by using

data simulation algorithms. The rationale for this approach is that synthetic data do not pose

problems with regard to statistical disclosure control because they do not contain real data but

preserve certain statistical properties.

In this chapter we studied category of Statistical disclosure limitation methods and their

division. In the next chapter we will study that what anonymity methods we are going to

make a group anonymity method and how we apply this on privacy preserving data

mining.

~ 40 ~

Page 41: THESIS1

CHAPTER 5

t- CLOSENESS PRIVACY

PRESERVING DATA

MINING

Here we try to achieve privacy that is required during data mining process by the use of Group

based Anonymization which will include k-anonymity l-diversity and t-closeness.

5.1 k -Anonymity

~ 41 ~

Page 42: THESIS1

Data holder often remove or encrypt explicit identifiers such as name and social security

numbers .De-identifying data, however, provide no guarantee of anonymity. Released

information often contains other data, such as race, birth date, sex, and ZIP code that can be

linked to publicly available information to re-identify respondents and to infer information that

was not intended for release [10].

k –anonymity is one of the micro-data protection concept k-anonymity demands that every tuple

in the micro-data table released be indistinguishably related to no fewer than k respondents.

Generalization and suppressions are used to achieve k-anonymity.

If there are many numbers of respondent whose dataset we have to publish then k-anonymity

says that each release of data must satisfy constraint that every combination of value of quasi

identifier can be indistinctly matched to at least k-respondents. If T(A1…..An) be table and QI

be a quasi identifier associated with it .T is said to satisfy k-anonymity with respect to Quasi

identifier iff each sequence of values in T[QI] appears at least with k occurrence in T[QI

Race DOB Sex ZIP Disease

Asian 64 F 941** hypertension

Asian 64 F 941** obesity

Asian 64 F 941** chest pain

~ 42 ~

Page 43: THESIS1

Asian 63 M 941** obesity

Asian 63 M 941** obesity

black 64 F 941** short breath

black 64 F 941** short breath

white 64 F 941** chest pain

white 64 F 941** short breath

Table 1. Micro data which is 2-Anonymous [11]

If there is a Inpatient micro data with Quasi identifier ( Race, DOB, Sex, Zip) and sensitive

attribute Disease .Then this table will satisfy 2- anonymity if each tuple for quasi identifier as at

least one other tuples in the table [6][7][8]. To achieve anonymity, k-anonymity focuses on two

techniques known as generalization and suppression which preserve truthfulness unlike another

technique scrambling and swapping.

5.2 l –diversity

To overcome homogeneous attack and back ground detail attack faced by k-anonymity l-

diversity was introduced [12].A q⋆-block is l-diverse if contains at least l “well-represented”

values for the sensitive attribute S. A table is l-diverse if every q⋆-block is l-diverse. We present

a 3-diverse version in Table 2.

Non-

Sensitive(N.S.)

N.S N.S Sensitive

~ 43 ~

Page 44: THESIS1

Zip Code Age Nationality Condition

1

2

3

4

5

6

7

8

9

10

11

12

13053

13068

13068

13053

14853

14853

14850

14850

13053

13053

13068

13068

28

29

21

23

50

55

47

49

31

37

36

35

Russian

American

Japanese

American

Indian

Russian

American

American

American

Indian

Japanese

American

Heart Disease

Heart Disease

Viral Infection

Viral Infection

Cancer

Heart Disease

Viral Infection

Viral Infection

Cancer

Cancer

Cancer

Cancer

Table 2. Inpatient Micro-data [13]

This is Micro- data table which indicate that it is original table stored on computer. Now we will

find how to calculate anonymity, diversity and closeness from this table up to some extent.

~ 44 ~

Page 45: THESIS1

Non-Sensitive Non-Sensitive Non-Sensitive Sensitive

Zip Code Age Nationality Condition

1

4

9

10

1305*

1305*

1305*

1305*

≤ 40

≤40

≤40

≤40

*

*

*

*

Heart Disease

Viral Infection

Cancer

Cancer

5

6

7

8

1485*

1485*

1485*

1485*

>40

>40

>40

>40

*

*

*

*

Cancer

Heart Disease

Viral Infection

Viral Infection

2

3

11

12

1306*

1306*

1306*

1306*

≤40

≤40

≤40

≤40

*

*

*

*

Heart Disease

Viral Infection

Cancer

Cancer

Table3. 3-diverse 4-anonymous Inpatient Micro-data[13]

5.2.1 Homogeneity Attack:

~ 45 ~

Page 46: THESIS1

If there is 4-anonymous table in which quasi attributes Zip code, Birth date and gender and

sensitive attribute is disease. A set of no sensitive values is called quasi identifier if these

attribute can be linked with external data to uniquely identify at least one individual in the

general population. A sensitive attribute is an attribute whose value for any particular individual

must be kept secret from people who have no direct access to the original data.

If Alice and Bob are two person and Alice want to know the sensitive attribute of Bob so he

discovered 4 anonymous table published by hospital .By the knowledge of value of quasi

identifier Alice can know Bob is in which equivalence class and if sensitive attributes in that

equivalence have common values then he can know sensitive attribute of Bob. This is called

homogeneous attack.

5.2.2 Background Knowledge Attack:

If there are 4 anonymous tables in which quasi attributes Zip code, Birth date and gender and

sensitive attribute is disease. Alice knows that Bob is a 21 years old Japanese male who currently

lives in zip code 12087.Based on this information; Alice learns that Bob’s information is

contained in record number 1, 2, 3 or 4.Without additional information, Alice is not sure weather

Bob caught a virus or has heart disease. It is well known that Japanese have an extremely low

incidence of heart disease. Therefore Alice concludes with near certainty that Bob has a viral

infection l-diversity was introduced to overcome this problem A q*equivalence class is l-diverse

if contains at least l-“well-represented” values for the sensitive attribute S. A table is l-diverse if

every q* equivalence class is l-diverse, where q*equivalence class to be the set of tuples in table

T* whose non sensitive attribute values generalize to q*. Let Inpatient Micro data table shown

above. It has non sensitive attribute Zip code. Age, Nationality and sensitive attribute condition.

It is 3 diverse because it has at least 3 distinct sensitive attribute in each equivalence class.

5.3 t - Closeness:

~ 46 ~

Page 47: THESIS1

Privacy measured by the information gain by an adversary. And he gain information by his

Posterior belief and Prior belief. [14] If Q is the distribution of the sensitive attribute in the whole

table and P is the distribution of sensitive attribute in equivalence class .An equivalence class is

said to have t-closeness if the distance between the distribution of a sensitive attribute in this

class and the distribution of the attribute in the whole table is no more than a threshold t. A table

is said to have t-closeness if all equivalence classes have t-closeness.

S.No. ZIP

Code

Age Salary Disease

1

3

8

4767*

4767*

4767*

≤ 40

≤ 40

≤ 40

3K

5K

9K

gastric

ulcer

stomach

cancer

pneumonia

4

5

6

4790*

4790*

4790*

≥ 40

≥ 40

≥ 40

6K

11K

8K

gastritis

flu

bronchitis

2

7

9

4760*

4760*

4760*

≤ 40

≤ 40

≤ 40

4K

7K

10K

gastritis

bronchitis

stomach

cancer

Table 4.

This table is 3 Anonymous, 3 Diverse and having 0.167-closeness w.r.t. salary, 0.278

closeness w.r.t. diseases [12]

5.4 PROPOSED WORK

~ 47 ~

Page 48: THESIS1

Interpretation/Evaluation

Data mining Patterns

Group based Anonymization

Transformation

Transformed Data

Preprocessing

Preprocessed Data

Selection

Target Date

Figure2. Steps in privacy preserving data mining

~ 48 ~

Data

Know.l.ee

Page 49: THESIS1

These are steps in privacy preserving data mining .Here firstly target data are selected from

database then data are preprocessed after then transformation is applied then we apply Group

based Anonymization then Data mining to find patterns and then Evaluate patterns to generate

knowledge.t-closeness come in to focus in order to prevent skewness attack for this it makes

calculation based on EMD between two distribution and similarity attack. Let us take one

example in which there are 1000 students enrolled and their corresponding records. Say 1% of

the students have failed and rest has passed. Suppose one equivalent class has equal number of

pass and fail records. Anyone belonging to that equivalent class would be considered to have

50% chance of having failed as compared with the 1% initially. This is thus a major privacy risk.

Again consider an equivalence class with 49 fail records and 1 pass record. It would be satisfy 2-

diverse but still there will be 98% chance of having failed for someone in that equivalence class,

which is much more than 1% initially. This equivalence class has the same diversity as a class

that has 1 failed and 49 pass records but we can clearly see that both have different levels of

sensitivity. This is skewness attack. Similarity attacks occur when the sensitive attributes are

semantically similar. t-closeness requires that the earth mover's distance between the distribution

of the sensitive attribute within each equivalence class does not differ from the distribution of the

sensitive attribute in the whole table by more than a predefined parameter t. The EMD is based

on the minimum amount of work needed to transform one distribution to another by moving

distribution mass between each other.EMD can be formally defined using the well studied

transportation problem. Let p= (p1,p2,…….pm),Q= (q1,q2………….qm), and dij be the ground

distance between element I of p and element j of Q. We want to find a flow F= [fij] where fij is

the flow of mass from element i of p to element j of Q that minimize the the overall work [12].

m m

WORK (P,Q,F) =∑ ∑ d ij f ij

i=1,j=1

Subject to the following constraints,

fij≥0 1≤i≤m , 1≤j≤m (c1)

~ 49 ~

Page 50: THESIS1

m m

pi - ∑f ij+∑f ji=q i 1≤i≤ m (c2)

j=1 j= 1

m m m m

∑ ∑ f ij=∑ p i =∑ q i =1 (c3)

i=1,j=1 i=1 j=1

These three constraints guarantee that P is transformed to Q by the mass flow F. Once the

transportation problem is solved, the EMD is defined to be the total work.

m m

D [P,Q]=WORK (P,Q,F)=∑ ∑ d ij f ij

i=1, j=1

1. If 0≤dij≤1 for all i, j then0≤D[P,Q]≤1.The above fact follows directly from constraint (c1) and

(c3). It says that if grounds distances are normalized, i.e., all distances are between 0 and 1, then

the EMD between any two distributions is between 0 and 1. This gives a range from which one

can choose the t value for t-closeness.

2. Given two equivalences classes E1 and E2, Let P1, P2 and P be the distribution of a sensitive

attribute in E1,E2 and E1 E2 respectively then group based Anonymization is efficient if we

want to perform data mining in a privacy preserving way .

D[P,Q] ≤( |E1|/|E1| + |E2| ) D[P1,Q] + |E2|/(|E1| + |E2|) D[P2,Q] .

~ 50 ~

Page 51: THESIS1

5.4.1 Group Based Anonymization

The need of Group based Anonymization occurs due to the following weakness in existing

system during privacy preserving Data mining. The randomization method is a simple technique

which can be easily implemented at data collection time, because the noise added to a given

record is independent of the behavior of other data records. This is also a weakness because

outlier records can often be difficult to mask. Clearly, in cases in which the privacy-preservation

does not need to be performed at data-collection time, it is desirable to have a technique in which

the level of accuracy depends upon the behavior of the locality of that given record. Another key

weakness of the randomization framework is that it does not consider the possibility that publicly

available records can be used to identify the owners of that record. The use of publicly available

records can lead to the privacy getting heavily compromised in high-dimensional cases. This is

especially true of outlier records which can be easily distinguished from other records in their

locality. Therefore, a broad approach too many privacy transformations are to construct groups

of anonymous records which are transformed in a group-specific way. Group based

Anonymization is efficient if we want to perform data mining in a privacy preserving way .If

we apply this which is the integration of k-anonymity, l-diversity and t-closeness on micro-data

as shown in figure 2.It takes Micro-data from organization where prevention of sensitive

information is needed ,run Group based Anonymization and then perform data mining task . It

preserve privacy in the following way.

a. Prevent attribute disclosure .Attribute disclosure occurs when confidential information about

a data subject is revealed and can be attributed to the subject. Attribute disclosure may occur

when confidential information is revealed exactly or when it can be closely estimated. Thus,

attribute disclosure comprises identification of the subject and divulging confidential information

pertaining to the subject. Identity disclosure occurs if a third party can identify a subject or

respondent from the released data [6].Revealing that an individual is a respondent or subject of a

data collection may or may not violate confidentiality requirements.

b. Prevent Inference disclosure. Inference disclosure occurs when information can be inferred

with high confidence from statistical properties of the released data.

~ 51 ~

Page 52: THESIS1

c. It decreases utility measure, which tells how useful the given candidate is? For example let

medical information of a person disclose, he is suffer from sugar and has insured in a company

then competitor company approach him and try to offer more beneficial scheme and he also

get opportunity to increase his business more and more. But t-closeness privacy preserving data

mining avoid this problem.

d. It preserves the truth fullness of data.

5.4.2 Group based Anonymization applied in Secure multiparty communication.

If there are number of parties P1,P2…………….Pn ,their data sets are D1,D2…………….Dn

respectively, and they want to do data mining on their databases collectively to obtain a function

F (x) where x uses sensitive value of each party. If we add trusted third party to the computation

then he collect all the input from the parties and gives the results but what happen if there are no

trusted third party, we can assume two types of abnormal behavior by parties.

1. We assume they follow protocol but they can collaborate and try to learn additional

information.

2. Abnormal parties can collaborate in any way to gather information and disrupt the secure

computations .If we apply Group based Anonymization on data held by all parties where

confidentiality needed ,then we achieve data anonymity and then we do data mining task and get

result without compromising privacy. This is shown in following figure

~ 52 ~

Page 53: THESIS1

P1

Pn

Figure 3. Group based Anonymization applied on SMC.

In this chapter we studied that what anonymity methods are going to make a group

anonymity method and how we apply it on privacy preserving Data mining. Our next

chapter is for Conclusion and Future Research.

~ 53 ~

DM

D1

Dn

GA

GA

Page 54: THESIS1

CHAPTER 6

CONCLUSION AND

FUTURE RESEARCH

~ 54 ~

Page 55: THESIS1

6.1 Conclusion

There are a number of privacy preserving model for micro data release to prevent sensitive

information from disclosure .But they are not very promising because they do not guarantee

privacy .Also we read most recent publications for privacy preserving data mining, but they

also suffer from attacks like skewness attack or attribute disclosure attack. So we narrowed down

on t-closeness privacy preserving model because it guarantee complete privacy. But it also has

limitations that we can not have threshold value close to zero because data mining become

difficult .Also we want to compute t-closeness effectively for high dimensional data. In a high

dimensional data where high degree of generalization and suppression is needed for data

Anonymization, so more information loss occur. Slicing is advanced approach to achieve data

anonymity which split high dimensional data to low dimensional data .Now it depend on us how

we will achieve t-closeness for efficient data Anonymization.

6.2 Future research

Group Anonymization method can be efficient method for privacy preserving data mining but in

high dimensional data. It is a big question that how we apply group Anonymization because

there is no universal rule that which method should be applied for group Anonymization to

calculate values of individual model. . So Slicing is a good concept for high dimensional data to

achieve data Anonymization.

~ 55 ~

Page 56: THESIS1

ABBREVIATIONS

DM…………….Data mining

KDDM …….. Knowledge Discovery and Data Mining

DBMS …….. Database Management System

KDD ………. Knowledge Discovery in Databases

NHL ………. National Hockey League

NBA ……….. National Basketball Association

NIH …………… National Institute of Health

PL ……………. Privacy Level

DL …………… Disclosure Level

VRML……………The virtual reality modeling language

PPDM ………… Privacy preserving data mining

PRAM ……….. Post-randomization

EMD …………. Earth mover’s distance

SMC ………. Secure Multiparty Computation

PRAM ……….. Post -randomization

NASA ………. National Aeronautics and Space Administration

CAD …………… Computer Assisted Design

GA…………………Group based Anonymization

~ 56 ~

Page 57: THESIS1

REFERENCES

[1] Osmar R. Zaiane, 1999 CMPUT690 ‘Principles of Knowledge Discovery in Databases’.

[2] R. Agrawal and R. Srikant, 2000. Privacy preserving data mining. In Proceedings of the

ACM SIGMOD Conference on management of Data. 439- 450.

[3] Y. Lind ell and B. Pinkas, 2000,Privacy preserving data mining.

J. Cryptology 15, 3. 36-54.

[4] L.Wallenberg, and T. De Waal 2001. Elements of Statistical disclosure

Control Springer-Verlag, Berlin, Germany.

[5] H .Mannila and H.Toivonen, 1997. Level wise search and borders of

theories in knowledge discovery. Data mining Knowledge discovery. 1, 3,

241-258.

[6] A. Friedman, R. Wolff, A. Schuster ‘Providing k-anonymity in data mining’

The VLDB Journal, Vol.17, 2008, pp. 789-804.

[7] E. Poovammal and M. Ponnavaikko ‘An Improved Method for Privacy Preserving Data

Mining’, International Advance Computing Conference, 2009, pp. 1453-1458.

[8] C. C. Aggarwal and P. S. Yu, “Privacy preserving data mining Model and Algorithms”,

Kluwer Academic Publishers Boston/Dordrecht/London.

[9]“Anonymization techniques’, International Household survey Network”,

http://www.surveynetwork.org/home/index.php?q=tools/anonymization/techniques.

[10]Y. Ting and S. Jajodia ‘k-Anonymity’, Secure Data management in Decentralized systems,

Springer-Verlag, 2007.

~ 57 ~

Page 58: THESIS1

[11] V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati, “k-Anonymity”

Springer US, Advances in Information Security, 2007

[12] D.Kifer and J. Gehrke, ‘l-diversity: Privacy Beyond k-Anonymity’, International

Conference on Data Engineering (ICDE), 2006.

[13] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam.

ℓ-diversity: Privacy beyond k-anonymity. Available athttp://www.cs.cornell.edu/_mvnak, 2005

[14] N .Li, T. Li and S. Venkatasubramanian,‘t-Closeness: Privacy Beyond k-Anonymity and l-

Diversity’, International Conference on Data Engineering(ICDE), 2007, pp. 106-115.

~ 58 ~

Page 59: THESIS1

Publications and Reprint:

Topic name- t-closeness privacy preserving data mining .

Conference name-Have to go for presentation in 2010 International Conference on the Business and Digital Enterprises (ICBDE 2010)

Place- Gopalan Educational Society, Bangalore, India, in cooperationwith the Digital Information Research Foundation (DIRF).

~ 59 ~

Page 60: THESIS1

t- Closeness Privacy Preserving Data Mining

Rani Srivastava, Vishal Bhatnagar

Ambedkar Institute of Technology, Geeta Colony, Delhi

[email protected], vishalbhatnaga,@yahoo.com

~ 60 ~

Page 61: THESIS1

Abstract

Group based Anonymization method including k -anonymity, l-diversity and t-closeness was introduced for data privacy. The t- closeness model was introduced in order to provide a safeguard against the similarity attacks on published data set. It requires that the earth mover’s distance (EMD) between the distribution of a sensitive attribute within each equivalence class and distribution of sensitive attribute in the whole table should not differ more than a threshold t. Our aim in this paper is to provide the logical security of data through data Anonymity. There are many other model which can work as a substitute of t-closeness but can not remove all shortcomings and also suffer from some limitations so t-closeness yet be efficient.

Keywords: Data Mining, Anonymization, t-closeness, EMD.

1 .Introduction

Data mining methodologies have been widely adopted in research area which may be economic, medical, or various business oriented domains, such as marketing, credit scoring, fraud detection, where data mining has become an indispensable tool for business success.Increasingly data mining methods are also being applied to industrial process optimization and control. Generally, data mining (sometimes called information or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - that can be used to increase revenue, cuts costs, or both. Data mining tool is one of the analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Basically data mining is the looking at large number of information that has been collected on computer and using it to provide new information. Privacy in data mining is needed sometimes for business purpose or sometime due to law for example medical database, where no doctor is allowed to tell sensitive information about any patient to someone else. Now our focus is to do data mining in a privacy preserving way using anonymization method such that no adversary is able to determine any sensitive information and data measure. When we saw database with respect to privacy there is two type of attribute first is sensitive attribute and second is non sensitive .An attribute is marked sensitive if an adversary must not allowed discovering value of that attribute for any individual in the database .Attribute not marked sensitive is non sensitive in a dataset. Sets of attributes that can be linked with external data to uniquely identify individuals in the dataset are called quasi-identifier. To counter linking attacks using quasi identifiers, there is a concept of k- anonymity. A table satisfies k-anonymity if every record in the table is indistinguishable from at least k − 1 other records. But k-anonymity suffers from homogeneous attack and background detail attack so a new concept l-diversity is introduced. l- diversity tries to put constraints on minimum number of distinct values seen within an equivalence class for any sensitive attribute. An equivalence class has l-diversity if there is l or more well-represented values for the sensitive attribute. A table is said to be l-diverse if each equivalence class of the table is l-diverse, but l-diverse table has disadvantage due to skew attack and similarity attack so t-closeness introduced. An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and distribution of the attribute in the whole table is no more than a threshold t .A table is said to have t-closeness if all equivalence classes have t-closeness.

61

Page 62: THESIS1

2. Motivation and Related Research

Huge databases exist in the society today, they include census data, media data, consumer data, data gathered by government agencies and like research data need to publish publically. Now it is possible for an adversary to learn a lot of information about individual from public data like purchasing pattern, family history, medical data, media data and much more data. This is an age of competitiveness and every organization either government or non government want to ahead of each other .Then why allow adversary to get useful information to make any inference and to share confidential information. Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques.

1.In a paper titled ‘An Improved Method for Privacy Preserving Data Mining[1] we noticed following point.

a. It is assumed that one of the quasi identifier is numeric and it is transformed in to fuzzy based transformation. In fuzzy based transformation number of fuzzy set is calculated which is equal to the number of linguistic term set by deciding size of fuzzy set (k), min, max and mid point of each fuzzy set m1 ….mk. Transform the actual value(x) using the functions f1(x), f2(x),…fk(x).and replace actual value with xn plus category number. If actual value of any of the quasi identifier member attribute is not known, it can not identify unique record. There by, linking attack problem is solved but what happen if Quasi identifier does not have numerical attribute like Netflix movie rating then this approach can not solve linking problem.

b. Identifier attribute are not disclosed and replaced by auto generated id number if present. This leads loss of truthfulness of information and also information loss.

c. For categorical sensitive attribute transformation is performed using the mapping table prepared with prepared with domain knowledge considering privacy level (PL) and disclosure level (DL) set by the user. If any individual willing to disclose his information then his PL and DL is set to be true and so no transformation is done if he does not mind to linking him another with less probability then PL as true and DL as false, then its sensitive value is replaced by generalized value plus arbitrary value but if privacy level is false then value is replaced by overall general value plus arbitrary value. But if we have to do research then if Disclosure level and privacy level both are true then there is no problem for do research because true data is available but there is a privacy breach if linking problem can not be solved above then attribute disclosure is not solved .If no one want to disclose his sensitive

information then how will research place.

2. Problem of privacy can be resolved by using k-anonymity and extended definition of k-anonymity can be used to prove that a given data mining model does not violate the k-anonymity of the individuals represented in the data [2]. Extension of this provides a tool that measures the amount of anonymity retained during Data mining. It is also shown that this model can be applied to various data mining problems, such as classification, association rule mining and clustering. Two data mining algorithms are explained which exploit this extension to guarantee that they will generate only k-anonymous output. Finally, it is shown that this method contributes new and efficient ways to anonymize data and preserve patterns during Anonymization. We have to prevent identity disclosure, attribute disclosure, utility measure which tells how useful the given respondent is and inference disclosure for publishing data publically so we require t-closeness.

62

Page 63: THESIS1

3. Data Mining: An Introduction

Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction [3] [4]. It is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. While data mining is a technology that has a large number of advantage but biggest problem that need to be addressed is privacy. In the information age it sometimes seems as if everyone wants to know everything about you that led to the rise of identity theft. Whenever you go to a bank to fill out a loan application, the information you put on it will probably be placed in a database. When you conduct an interview over the phone or on the internet, the information that you submit is also placed in a database. Medical data is stored in a database for research purpose.

Many proponents of data mining assume that the information held by an organization will exist in one location. In reality, this information can fall into the hands of anyone, and once a single copy of it surfaces on the internet, it can be replicated numerous times and also data need to publish for research purpose.. Many of the consumers who buy products or services are not aware of data mining technology. They may not know that their shopping habits, names, addresses, and other information is being stored in a database. While data mining might be a term that is well understood in certain circles, If you are the owner of a business, do your customers know that you are putting their information in a database? Do you bother to tell them? If you are adding them to a database without their knowledge, how do you think they would feel if they found out? While data mining has a number of advantage. Customers should be assured about privacy of information placed in a database. Large corporations that are fiercely competitive may avoid giving assurance to their customers of not publishing their data because they don't want to lower their chances of being able to have an edge on their competition. Because of this, they are faced with the ethical problem of whether or not they should give customers the option of not publishing their data for data mining. One of the important issues in such processes is how to protect the trade secrecy of corporations and privacy of customers contained in the data sets collected and used for the purpose of data mining. Detailed person-specific data, present in the centralized server or in the distributed environment, in its original form often contains sensitive information about individuals, and publishing such data immediately violates individual privacy. The main problem in this regard is to develop method for publishing data in a more hostile environment so that the published data remains practically useful while individual privacy is preserved. There are n parties, each having a private database, want to jointly conduct a data mining operation on the union of their databases. How could these parties accomplish this without disclosing their database to the other parties or any third party. Given all these facts it seems that privacy preserving data mining will play an increasing role in the future for data privacy in the information age.

4 .Anonymization Methods: An Introduction:

Anonymization is the making data publicly without compromising the individual privacy. One of the functions of a federal statistical agency is to collect individually sensitive data, process it and provide statistical summaries, and/or public use micro data files to the public. Some of the data collected are considered proprietary by respondents. On the other hand, not all data collected and published by the government are subject to disclosure limitation techniques. Some data on businesses that is collected for regulatory purposes are considered public. In addition, some data are not considered sensitive and are not collected under a pledge of confidentiality. The statistical disclosure limitation techniques described however confidentiality is required and data or estimates are made publicly available. All disclosure limitation methods result in some loss of information and sometimes the

63

Page 64: THESIS1

publicly available data may not be adequate for certain statistical studies. However, the intention is to provide as much data as possible, without revealing individually sensitive data. Statistical disclosure limitation methods can be classified in two categories [5]:

a. Methods based on data reduction. Such methods aim at increasing the number of individuals in the sample/population sharing the same or similar identifying characteristics presented by the investigated statistical unit. Such procedures tend to avoid the presence of unique or rare recognizable individuals.

b. Methods based on data perturbation. Such methods achieve data protection from a twofold perspective. First, if the data are modified, re-identification by means of record linkage or matching algorithms is harder and uncertain. Secondly, even when an intruder is able to re-identify a unit, he/she cannot be confident that the disclosed data are consistent with the original data. An alternative solution consists in generating synthetic micro-data Data reduction includes removing variables, removing records, global recoding, top and bottom coding and local suppression. Data Perturbation includes micro aggregation, data swapping, post randomization, adding noise, re-sampling. Synthetic data is an alternative approach to data protection and are produced by using data simulation algorithms. The rationale for this approach is that synthetic data do not approach problem with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties. Synthetic data can be generated using bootstrap method, multiple imputations and data distribution by probability.

5. t-closeness privacy preserving data mining

Group based Anonymization method is the result of use of k-anonymity, l-diversity and t-closeness .k-anonymity make similar data in a group and then anonymize each group individually

5.1 k -anonymity

If there are many numbers of respondent whose dataset we have to publish then k-anonymity says that each release of data must satisfy constraint that every combination of value of quasi identifier can be indistinctly matched to at least k-respondents. If T(A1…..An) be table and QI be a quasi identifier associated with it .T is said to satisfy k-anonymity with respect to Quasi identifier iff each sequence of values in T[QI] appears at least with k occurrence in T[QI] .If there is a Inpatient micro data with Quasi identifier (Zip code, Age, Nationality) and sensitive attribute Condition .Then this table will satisfy 4- anonymity if each tuple for quasi identifier as at least three other tuples in the table [6][7][8]. To achieve anonymity, k-anonymity focuses on two techniques known as generalization and suppression which preserve truthfulness unlike another technique scrambling and swapping.

5.2 l –diversity

To overcome homogeneous attack and back ground detail attack faced by k-anonymity l-diversity was introduced. A q*equivalence class is l-diverse if contains at least l- “well-represented” values for the sensitive attribute S. A table is l-diverse if every q* equivalence class is l-diverse, where q*equivalence class to be the set of tuples in table T* whose non sensitive attribute values generalize to q* [9]. Let 4-anonymious inpatient micro-data table .It is 3 diverse , say l= 3 if every block have at least 3 distinct sensitive values.

5.3 t -closeness

Privacy measured by the information gain by an adversary. And he gain information by his Posterior belief and Prior belief. If Q is the distribution of the sensitive attribute in the whole table and P is the distribution of sensitive attribute in equivalence class .An equivalence class is said to have t-closeness if the distance between the

64

Page 65: THESIS1

distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness [10].

S.No. ZIP Code

Age Salary Disease

1

3

8

4767*

4767*

4767*

≤ 40

≤ 40

≤ 40

3K 5K

9K

gastric ulcerstomach cancer pneumonia

456

4790* 4790* 4790*

≥ 40≥ 40≥ 40

6K11K8K

gastritisflubronchitis

279

4760* 4760* 4760*

≤ 40≤ 40≤ 40

4K7K10K

gastritisbronchitis stomach cancer

Table 1

This table is 3 Anonymous, 3 Diverse and having 0.167-closeness w.r.t Salary, 0.278 closeness w.r.t. Disease [10]

D

Figure.1

65

DB1

DB n

t

t

DM

Page 66: THESIS1

t-closeness come in to focus in order to prevent skewness attack for this it makes calculation based on EMD between two distribution and similarity attack. Let us take one example in which there are 1000 students enrolled and their corresponding records. Say 1% of the students have failed and rest has passed. Suppose one equivalent class has equal number of pass and fail records. Anyone belonging to that equivalent class would be considered to have 50% chance of having failed as compared with the 1% initially. This is thus a major privacy risk. Again consider an equivalence class with 49 fail records and 1 pass record. It would be satisfy 2-diverse but still there will be 98% chance of having failed for someone in that equivalence class, which is much more than 1% initially. This equivalence class has the same diversity as a class that has 1 failed and 49 pass records but we can clearly see that both have different levels of sensitivity. This is skewness attack. Similarity attacks occur when the sensitive attributes are semantically similar.t-closeness requires that the earth mover's distance between the distribution of the sensitive attribute within each equivalence class does not differ from the distribution of the sensitive attribute in the whole table by more than a predefined parameter t. The EMD is based on the minimum amount of work needed to transform one distribution to another by moving distribution mass between each other.EMD can be formally defined using the well studied transportation problem. Let p= (p1,p2,…….pm),Q= (q1,q2………….qm),and dij be the ground distance between element I of p and element j of Q. We want to find a flow F= [fij] where fij is the flow of mass from element i of p to element j of Q that minimize the the overall work [10].

m m WORK (P,Q,F) =∑ ∑ d ij f ij i=1,j=1

subject to the following constraints.

fij≥0 1≤i≤m , 1≤j≤m (c1) m m pi - ∑f ij+∑f ji=q i 1≤i≤ m (c2) j=1 j= 1 m m m m ∑ ∑ f ij=∑ p i =∑ q i =1 (c3) i=1,j=1 i=1 j=1These three constraints guarantee that P is transformed to Q by the mass flow F. Once the transportation problem is solved. The EMD is defined to be the total work.

m m

D [P,Q]=WORK (P,Q,F)=∑ ∑ d ij f ij

i=1, j=1

66

Page 67: THESIS1

1. If 0≤dij≤1 for all i, j then0≤D[P,Q]≤1.The above fact follows directly from constraint (c1) and (c3). It says that if ground distances are normalized, i.e., all distances are between 0 and 1, then the EMD between any two distributions is between 0 and 1. This gives a range from which one can choose the t value for t-closeness.

2. Given two equivalences classes E1 and E2, Let P1, P2 and P be the distribution of a sensitive attribute in E1,E2

and E1 E2 respectively then group based Anonymization is efficient if we want to perform data mining in a

privacy preserving way .

D[P,Q] ≤( |E1|/|E1| + |E2| ) D[P1,Q] + |E2|/(|E1| + |E2|) D[P2,Q] .

Group based Anonymization is efficient if we want to perform data mining in a privacy preserving way .If we apply k-anonymity l-diversity and t-closeness on micro-data as shown in figure 1..It takes Micro-data from all department ,organization etc. where prevention of sensitive information is neded and run t-closeness separately on each microdata of concerned organization then perform data mining task .Here DB1…….DBn represent original database which is also termed microdata.DM is for data mining and t for t-closeness. It preserve privacy in the following way.

a. Prevent attribute disclosure .Attribute disclosure occurs when confidential information about a data subject is revealed and can be attributed to the subject. Attribute disclosure may occur when confidential information is revealed exactly or when it can be closely estimated. Thus, attribute disclosure comprises identification of the subject and divulging confidential information pertaining to the subject. Identity disclosure occurs if a third party can identify a subject or respondent from the released data[6].

Revealing that an individual is a respondent or subject of a data collection may or may not violate confidentiality requirements.

b. Prevent inference disclosure. Inference disclosure occurs when information can be inferred with high confidence from statistical properties of the released data.

c. It decreases utility measure, Which tells how useful the given candidate is? For example let medical information of a person disclose, he is suffer from sugar and has insured in a company then competitor company approach him and try to offer more beneficial scheme and he also get opportunity to increase his business more and more. But t-closeness privacy preserving data mining avoid this problem.

d. It preserves the truth fullness of data.

6. Conclusion and future research

There are many privacy preserving model for micro data release to prevent sensitive information from disclosure .But they are not very promising because they do not guarantee privacy .Also we read most recent publications for privacy preserving data mining, but they also suffer from attacks like skewness attack or attribute disclosure attack. So we narrowed down on t-closeness privacy preserving model because it guarantee complete privacy. But it also has limitations that we can not have threshold value close to zero because data mining become

difficult .Also we want to compute t-closeness effectively for high dimensional data. In a high dimensional data where high degree of generalization and suppression is needed for data Anonymization ,so more information loss

67

Page 68: THESIS1

occur. Slicing is advanced approach to achieve data anonymity which split high dimensional data to low dimensional data .Now it depend on us how we will achieve t-closeness for efficient data Anonymization.

.

Reference[1] Poovammal E., Ponnavaikko M., ‘An Improved Method for Privacy Preserving Data Mining’, International Advance Computing Conference, 2009, pp. 1453-1458.

[2] Friedman A., Wolff R., Schuster A. ‘Providing k-anonymity in data mining’The VLDB Journal , Vol.17 ,2008 pp. 789-804.

[3] Jeffrey W. S., ‘Data mining: An overview’, CRS report RL 31798.

[4] Aggarwal C.C, Yu P.S., ‘Models and Algorithms: Privacy-Preserving Data Mining’ Springer, 2008.

[5]‘ Anonymization techniques’, International Household Survey Network”, http://www.surveynetwork.org/home/index.php?q=tools/anonymization/techniques.

[6] Samarati P., ‘Protecting Respondents Identities in Micro-data Release’, IEEE Trans. Knowledge and Data Eng. Vol. 13, No. 6, 2001, pp. 1010-1027.

[7] Jajodia S. , Yu T , Bayar do R. J., Agrawal R. , ‘k-Anonymity. Security in Decentralized Data Management’ Springer, 2006.

[8] Ting Y.,Jajodia S. ‘k-Anonymity’, Secure Data Management in Decentralized Systems, Springer-Verlag, 2007.

[9] Kifer D. and Gehrke J., ‘l-diversity: Privacy Beyond k-Anonymity’, International Conference on Data Engineering (ICDE), 2006.

[10] Li N., Li T. and Venkatasubramanian S. , ‘t-Closeness: Privacy Beyond k-Anonymity and l- Diversity’, International Conference on Data Engineering(ICDE), 2007, pp. 106-115.

68