association rule visualization technique

Republic of Iraq

Ministry of Higher Education & Scientific Research

Iraqi Commission for Computers and Informatics

Informatics Institute for Postgraduate Studies

Study of Association Rules'Visulalization Techniques

A Project

Submitted to the Informatics Institute

For Postgraduate Studies of the Iraqi Commission

For Computers and Informatics as a partial fulfillment of the

Requirements for the degree of Higher Diploma in Web Site Technology in Computer Science

By

Mustafa S.Shaheed

Supervised by

Dr. Hussein K. Khafaji Baghdad, Iraq

Feb 2011 1432

I

بسم هللا الرحمن الرحيم

ــْْ َبِّ زٍدْنـي عِـلْـمًا َُب

صدق هللا العظيم

114سوَة طه- آيه

II

Dedication

To My Family With Love

And Affection

III

Acknowledgments

My first and deepest gratitude goes to ALLAH the

almighty for his uncountable blessing, help, and

guidance.

I would like to express my deepest appreciation to

my supervisor Dr. Hussein K. Khafaji for his guidance,

helpful, comments, and suggestions.

IV

Supervisor's Certification

I certify that the project entitled "Comparative Study of Association Rules'Visulalization Techniques” was prepared under

my supervision at the Informatics Institute for Postgraduate Studies in Iraqi Commission for Computers and Informatics as a partial fulfillment of the requirements for the degree of Higher

Diploma in Web Site Technology in Computer Science.

Signature:

Name: Dr. Hussein K. Khafaji

Date: /2/2011

V

Examining Committee Certification

We certify that we read this project, entitled " Comparative Study of Association Rules'Visulalization Techniques ", and as an examining committee, examined the student " Mustafa S. Shaheed", in the contents and what is related to it and that in our opinion it meet the standard of a project for the Higher Diploma in Web Site Technology in Computer Science.

Signature

Name: Dr. Hussein K. Khafaji

Title:

Date: /2/2011

Supervisor

Approved by the Informatics Institute for Postgraduate Studies of the

Iraqi Commission for Computers and Informatics.

Signature

Name: Prof. Dr. Imad Hussain Al-Hussaini

Date: /10/2010

Dean of the Institute

Signature

Name: Dr.

Title:

Date: /2/2011

Chairman

Signature

Name: Dr.

Title:

Date: /2/2011

Member

Signature

Name: Dr.

Title:

Date: /2/2011

Member

VI

Abstract

Computers are used in more and more areas, large volumes of data have been collected and stored in the database continuously. An important issue is to figure out how to find the useful information from these massive data.

Data mining, also known as knowledge discovery in databases, is such a research area to extract implicit, understandable, previously unknown and potentially useful information from data.

Association Rules are one of the most widespread data mining tools because they provide valuable information for many application fields, in spite of their mining difficulties.

The exploration of large data sets is an important but difficult problem.

Information visualization techniques can be useful in solving this problem. Visual data exploration has a high potential, and many applications.

Association Rules Visualization is emerging as a crucial step in a data mining process in order to profitably use the extracted knowledge.

In this project, most important techniques of association rule visualization are study which used to present the association rule that discovered from databases by used algorithms 0Tdeveloped 0T1T 0T1Tfor this0T1T 0T1Tpurpose and identify0T1T 0T1Tthe strengths0T1T 0T1Tand weaknesses 0T1T 0T1Tof 0T1T 0T1Tthese0T1T 0T1Ttechniques to reach 0T1T 0T1Tthe0T1T 0T1Tmost 0T1T 0T1Tappropriate0T1T 0T1Ttechnology0T1T 0T1Tto solve 0Tthe main drawback of Association Rules.

VII

Title Page

Chapter One: Introduction 1

1.1 Introduction 2

1.2 Introduction to Data Mining 2

1.3 Introduction to Association Rule 3

1.4 Introduction to Functional Dependencies 4

1.4.1 Candidate Key 5

1.5 Aim of the study 6

Chapter Two: Data Mining And Functional Dependency 8

2.1 Introduction 9

2.2 Data Mining Overview 9

2.2.1 Data Mining Application 10

2.2.2 The process before Data Mining 10

2.2.3 Data Mining tasks 11

2.2.3.1 Association Rules 12

2.2.3.2 Apriori algorithm 15

2.3 Functional depe 16

2.3.1 Definition (1) 17

2.3.2 Definition (2) 18

2.3.3 Multi Valued Dependencies 23

2.4 Candidate Keys 24

2.5 Primary Key 25

2.6 Super key 26

List of Contents

VIII

2.7 Armstrong's Axioms 27

Chapter Three: proposed System To Determine the Candidate Keys

31

3.1 Introduction 32

3.2 The relation between data mining and functional

dependency

32

3.3 An Algorithm of determining closure sets 32

3.4 System Architecture 34

3.4.1 Sets Generator 35

3.4.2 Candidate key tester 36

3.5 Set closure producer 42

3.6 key filter 46

3.7 Candidate keys system execution 47

Chapter four: Discussion, and Future works 52

4.1 Discussion 53

4.2 Future works 54

IX

List of algorithms

Algorithm (3-1) testing the closure of sets of attributes algorithm 33

Algorithm (3-2) Rule testing algorithm 43

Algorithm (3-3) Closure generator algorithm 44

List of programs

Program (3-1) Candidate key tester 41

Program (3-2) Candidate key function 42

Program (3-3) merge program 45

List of Figures

Figure (3-1) the architecture of generating candidate keys 34

Figure (3-2) the main view of application 47

Figure (3-3) the interface of set generator 48

Figure (3-4) the interface of canidiate key tester 49

Figure (3-5) the interface of table (sets) 50

Figure (3-6) the in oterfacef table (candid) 51

X

List of tables

Table (2.1) A database with 4 items and 5 transactions 12

Table (2.2) How employees get to work 19

Table (2.3) Functional Dependencies defined over two sets 20

Table (2.4) Employees information 21

Table (2.5) Students information 22

Table (2.6) Managers phone# 23

Table (2.7) Manager- employee 23

Table (2.8) Relation of Managers, phone, and employee 24

Table (3.1) Sets stored table 36

Table (3.2) Candidate keys stored table 37

Table (3.3) Temporary values stored table 37

1

Chapter One

Introduction

Introduction

2

Chapter 1

Chapter one

Introduction

Knowledge discovery in databases (KDD) is a new field depending on ideas from statistics, machine learning, databases, parallel computing, computer graphics, data visualization, and other fields. KDD systems generally use methods , algorithms, and techniques from all of these fields. It has been materialized due to the extraordinary growth of data in all specialties of human activities, disability of database management system (DBMS) to extract hidden knowledge in databases,

1.1 Overview

Recent years have seen an enormous increase in the amount of information stored in electronic format. It has been estimated that the amount of collected information in the world doubles every 20 months and the size and number of databases are increasing even faster and the ability to rapidly collect data has outpaced the ability to analyze it. Information is crucial for decision making, especially in business operations. As a response to those trends, the term 'Data Mining' (or 'Knowledge Discovery') has been coined to describe a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. Automated tools must be developed to help extract meaningful information from a flood of information. Moreover, these tools must be sophisticated enough to search for correlations among the data unspecified by the user, as the potential for unforeseen relationships to exist among the data is very high. A successful tool set to accomplish these goals will locate useful nuggets of information in the otherwise chaotic data space, and present them to the user in a contextual format.

Introduction

3

Chapter 1

and the need for economic and scientific tools such knowledge. KDD includes techniques and tools to address this need.

defines knowledge discovery in databases as follows[27]:

"KDD is the non-trivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns in the data".

Many literatures used the terms data mining (DM) and KDD

interchangeably and regard them as synonymous. At the first

international KDD conference in Montreal in 1995, it was proposed that

the term "KDD" be employed to describe the whole process of

extraction of knowledge from data. It was further proposed that the term

'data mining' should be used exclusively for the discovery stage of the

KDD process. A more or less official definition of DM is the process of

automatic extraction of novel, useful, and understandable patterns

in large databases[20,21]. Hence, KDD

includes many steps such as Focussing, Preprocessing,

Transformation, Data Mining and Evaluation. Figure (1.1) abstracts

the KDD process[14].

1- Focussing :- define the goal of the particular KDD task.

2- Preprocessing :- specified data has to be integrated.

3- Transformation :- assure that each data object is represented in a

common form which is suitable as input in the next step.

4- Data Mining :- detect the desired patterns contained within the

given data.

5- Evaluation :- the user evaluates the extracted patterns with

respect to the task defined in the focussing step.

Introduction

4

Chapter 1

data mining is the most important step within the KDD

process, defines data mining as follows[27]:

Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. According to this definition data mining is the step that is responsible

for the actual knowledge discovery and the data minig has many step

such as Association Rules (AR), Sequential Patterns, Classification,

Clustering, Similarity search.

Association Rules is the most important task of DM. ARs represent the correlation between sets of items in transaction database. An AR is an implication of the form: X c%

means that the person who reads the novels "The love in cholera era",

Y , where X, and Yare sets of items each of which is called itemsets.{X} is called antecedent, while Y is called consequent such that {X} ∩ {Y}=∅ and C% is the confidence of the implication, for example the following rule

The Merchant of Venice

The ARs are extracted from mined frequent itemsets. Mining of frequent itemsets is a very complex process[3].the mining of association rules consists of two steps; the first one is mining of frequent itemsets

", and "Zoorba", also reads the novels {"The Trees and Marzooq's Association", "One Hundred Years of Segregation}, with certainty factor of 60%. The confidence of a rule is calculated as follows: Confidence = support (X∪Y)/support (X). where the support of an itemset is the number of its occurrences in the database. The confidante rule is of confidence greater than or equal to the user defined threshold called minimum confidence, minconf

{ “The love in cholera era” , “The Merchant of Venice “ , “Zoorba”} 60% {“ the Tree and Marzooq’s Association” , “One Hundred Years of egregation”}

Introduction

5

Chapter 1

while the second one is extracting the rules from these frequent ilemsets. The first step, intermediate step, is massive computational step and attains the interest of the researcher since for many years many algorithms have been produced to accomplish this complicated mining process such as apriori, aprioriTID, aprioriHyprid [20], FP-growth [12], and CHARM [17], . The second step is extracting the association rules from the results of the previous step.

The main drawback of Association Rules is thus the huge number of extracted rules that cannot be manually inspected by that and the existence of trivial or meaningless associations that are usually mined due to the exhaustive nature of the extraction algorithms[24]. Graphical tools and pruning methods are the main approaches used to face these problems and to make data mining to be effective and well-Evaluated, it is important to include the human in the data exploration process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of today’s computers. Visual data exploration aims at integrating the human in the data exploration process, applying human perceptual abilities to the analysis of large data sets available in today’s computer systems. The basic idea of visual data exploration is to present the data in some visual form, allowing the user to gain insight into the data, draw conclusions, and directly interact with the data. Visual data mining techniques have proven to be of high value in exploratory data analysis, and have a high potential for exploring large databases. Visual data exploration is especially useful when little is known about the data and the exploration goals are vague. Since the user is directly involved in the exploration process, shifting and adjusting the exploration goals is automatically done if necessary.There are many techniques used to visually represent the data we will discuss some of them in this project.

Introduction

6

Chapter 1

Figure (1-1) Visualization and Data Mining

The aim of the project is a Study of techniques which used to present the association rule that discovered from databases by used algorithms

1.2 Aim of the project

developed for this purpose and identify the strengths and weaknesses of these techniques to

Introduction

7

Chapter 1

reach the most appropriate technology to solve the main drawback of Association Rules.

1.3 Project Outline Chapter two explains the stage of Knowledge Discovery in

Databases (KDD), task of data mining and concentrates on

Association rules(AR).

Chapter three focus on concept of Visualization, Visualization Benefits and Visualization Techniques which used to visualize the association rules (AR) due to their importance as an interesting field of this study.

Chapter four presents the summary and future work of the techniques used to visualized association rules.

8

Chapter Two

Data Mining And

Association Rules

Data mining and Association Rules

9

Chapter 2

Chapter Two


2.1 Introduction

This chapter presents the general steps of Knowledge discovery

in databases (KDD) and its relation with data mining. Also, it presents

the tasks of data mining (DM) and concentrates on Association rules

due to their importance as an interesting field of DM.

2.2 Knowledge Discovery in Databases

In recent years the amount of data that is collected by advanced

information systems has increased tremendously. Although very useful

information of strategic importance is buried within this data, this

information is not readily available for the users To analyze these huge

amounts of data, the interdisciplinary field of Knowledge Discovery in

Databases (KDD) has emerged. Applies efficient algorithms to extract

interesting patterns and regularities from the data.

KDD is defined as follows[27] :

Knowledge Discovery in Databases is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately

understandable patterns in data.


10

Chapter 2

According to this definition, data is a set of facts that is somehow

accessible in electronic form. The term patterns indicate models and

regularities which can be observed within the data. Patterns have to be

valid, i.e. they should be true on new data with some degree of certainty.

A novel pattern is not previously known or trivially true. The potentially

usefulness of patterns refers to the possibility that they lead to an action

providing a benefit.

A pattern is understandable if it is interpretable by a human user.

At last KDD is a process, indicating that there are several steps that are

repeated in several iterations.

Figure 2.1 displays the process of KDD in its basic form.

Figure (2-1) The KDD process


11

Chapter 2

1- Focussing

2.3 KDD Process Stages

KDD process is an interactive and iterative multi-step process

which uses five steps to extract interesting knowledge according to

some specific measures and thresholds.[14]

2- Preprocessing

3- Transformation

4- Data Mining

5- Evaluation

2.3.1 Focussing

The first step is to define the goal of the particular KDD task.

Another important aspect of this step is to determine the data to be

analyzed and how to obtain it.

2.3.2 Preprocessing

In this step the specified data has to be integrated, because it is not

necessarily accessible on the same system. Furthermore, several objects

may be described incompletely. Thus, the missing values need to be

completed and inconsistent data should be corrected or left out.

2.3.3 Transformation

The transformation step has to assure that each data object is

represented in a common form which is suitable as input in the next step.


12

Chapter 2

2.3.4 Data Mining

Data mining is the application of efficient algorithms to detect the

desired patterns contained within the given data. Thus, the data mining

step is responsible for finding patterns according to the predefined task.

Since this step is the most important within the KDD process, we are

going to have a closer look at it in the next section(2.4).

2.4 Data Mining

2.3.5 Evaluation

At last, the user evaluates the extracted patterns with respect to the

task defined in the focussing step. An important aspect of this evaluation

is the representation of the found patterns. Depending on the given task,

there are several quality measures and visualizations available to

describe the result. The important phase to represent the result of KDD

process by visualization techniques, these techniques allow the user to

assess the results in easier and more flexible. If the user is satisfied with

the quality of the patterns, the process is terminated. However, in most

cases the results might not be satisfying after only one iteration. In those

cases, the user might return to any of the previous steps to achieve more

useful results.

Since data mining is the most important step within the KDD

process, we will treat it more carefully in this section. In [27, 30] Data

Mining is defined as follows:

Data mining is a step in the KDD process consisting of applying

data analysis and discovery algorithms that, under acceptable


13

Chapter 2

computational efficiency limitations, produce a particular enumeration

of patterns over the data.

According to this definition data mining is the step that is responsible for

the actual knowledge discovery. To emphasize the necessity that data

mining algorithms need to process large amounts of data, the desired

patterns has to be found under acceptable computational efficiency

limitations. Let us note that there are many other definitions of data

mining and that the term data mining and KDD are often used in a

synonymous way.

Data mining has many tasks such as:

1- Association Rules (AR): Given a database of transactions, where each

transaction consists of a set of items, association discovery finds all the

item sets that frequently occur together, and also the rules among them.

we are going to have a closer look at it in the next section(2.5).

2- Sequential Patterns: Sequence Discovery aims at extracting sets of

events that commonly occur over a period of time.

3- Classification and Regression: Classification aims to assign a new data

item to one of several predefined categorical classes. The goal of

classification and regression is to build a model that minimizes the error

between the predicted and true values of the target variable. [15,18]

it known as supervised induction[14]. Supervised induction is the

machine learning task of inferring a function from supervised training

data[30].

4- Clustering: Clustering is the process of grouping the data records into

meaningful subclasses (clusters) in a way that maximizes the similarity

within clusters and minimizes the similarity between two different

clusters [10].clustering is also called unsupervised induction.[3]

http://en.wikipedia.org/wiki/Machine_learning�


14

Chapter 2

5- Similarity search: Similarity search is performed on a database of

objects to find the object(s) that are within a user-defined distance from

the queried object, or to find all pairs within some distance of each other.

Figure (2-2) Classification separates the data space (left) and clustering

groups data objects (right)

2.5 Association Rule

Association rules are ones of the promising aspects of data mining

as knowledge discovery tool, and have been widely explored to

date[27,14]. They allow to capture all possible rules that explain the

presence of some attributes according to the presence of other attributes.

An association rule, X⇒ Y, is a statement of the form "for a specified

fraction of transactions, a particular value of an attribute set X

determines the value of attribute set Y as another particular value under a

certain confidence". Thus, association rules aim at discovering the

patterns of co-occurrences of attributes in a database. For instance, an

association rule in a supermarket basket data may be "In 10% of

transactions, 85% of the people buying milk also buy milky-sweets

in that transaction". The association rules may be useful in many


15

Chapter 2

applications such as supermarket transactions analysis, store layout and

promotions on the items, telecommunications alarm correlation,

university course enrollment analysis, customer behavior analysis in

retailing, catalog design, word occurrence in text documents, stock

transactions, etc[29,21,16].

Let I = {I1,..., Im} be a set of literals, called items. Let D be a set of

transactions, where each transaction T is a set of items such that T ⊆ I,

and each transaction is associated with a unique identifier called TID.

Definition 2.1 An itemset X is a set of items in I. An itemset X is called a

k-itemset if it contains k items from I.

Definition 2.2 A transaction T satisfies an itemset X if X ⊆ T. The

support of an itemset X in D, supportD

Definition 2.5 An association rule is an implication of the form X ⇒ Y,

where X ⊂ I, Y ⊂ I, and X ∩ Y = φ. X is called the antecedent of the

rule, and Y is called the consequent of the rule. The rule X ⇒ Y holds in

(X), is the number of transactions

in D that satisfies X.

Definition 2.3 An itemset X is called a large itemset if the support of X

in D exceeds a minimum support threshold explicitly declared by the

user, and a small itemset otherwise.

Definition 2.4 The negative border of a set S ⊂ P(R), closed with

respect to the set inclusion relation, is the set of minimal itemsets X ⊂ R

not in S. The negative border of the set of large itemsets is the set of

itemsets that are generated as a candidate but fail to qualify into the set

of large itemsets.


16

Chapter 2

D with confidence c where c=supportD(X ∪Y)/supportD(X). The rule

X⇒Y has support s in D if the fraction s of the transactions in D

contain X ∪Y.

Example: Suppose I={A, B, C, D, E} is the abbreviation of movie title

in Movie-CD shop, these abbreviation are shown in Table (2.1). Table

(2.2)

Represent a database of the shop sells. Each transaction is defined

Transaction identifier, TID. Table (2.3) shows the frequent itemsets

according To minsup =50%, while Table (2.4) depicts all the ARs

according to Minconf = 100%.

Table (2.1) The items abbreviations of Database

Item Abbreviation

A Golden mountain

B Gone with the Wind

C Zoorba

D Rain Man

E Sound of Music


17

Chapter 2

Table (2.2) The items abbreviations of Database

Transaction TID (Person) Items-(Attributes)

1 B, C, E

2 B, C, D, E

3 A, B, C, D, E

4 B, C, D

5 A, B, F

6 A, B, C, E

Table (2.3) Large itemsets with minsup = 33%=2

Support Itemsets No.

6 =100% B 1

5 = 83% C, BC 2

4 = 67% E, BE, CE, BCE 4

3 = 50% A, D, AB, BD, CD, BCD 6

2 = 33% AC, AE, DE, ABC, ABE, ACE, BDE,

CDE, ABCE, BCDE 10

Table(2.4) Association Rules

Association rules with minconf = 100% A → B (3/3) AC → B (2/2) AC → BE (2/2) C → B (5/5) AE → B (2/2) AE → BC (2/2) D → B (3/3) AC → E (2/2) DE → BC (2/2) E → B (4/4) AE → C (2/2) ABC → E (2/2) D → C (3/3) DE → B (2/2) ABE → C (2/2) E → C (4/4) DE → C (2/2) ACE → B (2/2)

ABE → C (2/2) ACE → B (2/2) ABC → E (2/2)


18

Chapter 2

The mining of Association Rules is decomposed into two sub

problems:

1- Discovering all frequent, (large), patterns (represented by large

itemsets

defined above), and;

2- Generating the association rules from those frequent itemsets.

The first sub problem is very tedious, I/O intensive, and

Computationally expensive for very large databases and this is the case

for many real life applications. In large retailing data, the number of

transactions is generally in the order of millions, and number of items

(attributes) is generally in the order of thousands. When the data

contains N items, then the number of possible large itemsets is 2N. There

are many algorithms to mine frequent itemsets such as apriori,

aprioriTID, and aprioriHyprid,[12]The second problem is

straightforward, and can he done efficiently in a reasonable time and

there is a well known algorithm presented in to accomplish the

extraction of AR. The databases of frequent itemsets and ARs are

assumed to be available in this thesis, therefore there IS no focus on any

frequent itemset and AR mining algorithms.

19

Chapter Three

Visualization Techniques of

Association Rules

Visualization Techniques of Association Rules

20

Chapter 3

Chapter Three


3.1 Introduction

This chapter, presents the concept of visualization, visualization benefits and Visualization Techniques which used to visualize the association rules (AR) in KDD process.

3.2 Visualization

Visualization is the process of transforming data, information, and knowledge into visual form making use of human’s natural visual capabilities [9]. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the renaissance period. The development of animation also helped advance visualization. In spite of the importance of the visualization, there are many limitations and difficulties that must be taken in consideration such as [28, 4]: The main limitations are: • Visualization techniques are always difficult to evaluate. This one is no exception. • The implementation may require, the use of an operating system from one specific vendor.

•The visualization techniques offered are very limited.

• The limitation of many 3D visualizations is the possible waste of screen space towards the comers of the screen. • The traditional menu bar approach would require long mouse movements from the visualization to the menu bar and vice versa.


21

Chapter 3

•Object interacting complexity occurs within 3-d environment, for example the user can transform the parallel bar chart into a matrix format and vice versa. 3.3 Benefits of Visualization

Visual data exploration can be seen as a hypothesis generation process, the visualizations of the data allow the user to gain insight into the data and come up with new hypotheses. The verification of the hypotheses can also be done via data visualization, but may also be accomplished by automatic techniques from statistics, pattern recognition, or machine learning. In addition to the direct involvement of the user, the main advantages of visual data exploration over automatic data analysis techniques are: • Visual data exploration can easily deal with highly non-homogeneous and noisy data. • Visual data exploration is intuitive and requires no understanding of complex mathematical or statistical algorithms or parameters. • Visualization can provide a qualitative overview of the data, allowing data phenomena to be isolated for further quantitative analysis. As a result, visual data exploration usually allows a faster data exploration and often provides more interesting results, especially in cases where automatic algorithms fail. In addition, visual data exploration techniques provide a much higher degree of confidence in the findings of the exploration. These facts lead to a high demand for visual exploration techniques and make them indispensable in conjunction with automatic exploration techniques [6]. 3.4 Visualization of Association Rule

Visualizing association rules aims at solving some major problems that come with association rules. First of all the rules found by automatic procedures must be filtered. Depending on what minimum confidence and what support is specified a vast amount of rules may be generated.

There are at least five parameters involved in a visualization of association rules [19].

· Sets of antecedent items. · Sets of consequent items.


22

Chapter 3

· Associations between antecedent and consequent. · Rules' support. . Rules' confidence. The goal of association rule generation is to find interesting patterns and trends in transaction databases. Association rules are statistical relations between two or more items in the data set. In a supermarket basket application, associations express "the relations between items that are bought together. It is for example interesting if we find out that in 70% of the cases when people buy bread, they also buy milk. Association rules tell us that the presence of some items in a transaction implies the presence of other items In the same transaction with a certain probability, called confidence. A second important parameter is the support of an association rule, which is defined as the percentage of transactions in which the items co·occur. Let I = {i1., .. .in} be a set of items and let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. An association rule is an implication of the form X → Y, ,where X ⊆I ,Y ∈ I, X, Y≠ O. The confidence c is defined as the percentage of transactions that contain Y, given X The support is the percentage of transactions that contain both X and Y. For a given support and confidence level, there are efficient algorithms to determine all association rules. A problem, however, is that the resulting set of association rules is usually very large, especially for low support and confidence levels [8,9]. Using higher support and confidence levels may not be effective since then, useful rules may be overlooked. Pattern visualization techniques have been used to overcome this problem and to allow an interactive selection of good support and confidence levels. Figure (2.5) shows SGI MineSets Rule Visualizer[14], which maps the left and right hand sides of the rules to the x- and y-axes of the plot, respectively, and shows the confidence as the height of the bars and the support as the height of the discs. The color of the bars shows the interestingness of the rule.


23

Chapter 3

Figure (3.1) MineSet's Association Rule Visualizer

Using the visualization, the user is able to see groups of related rules and the impact of different confidence and support levels. The goal of association rules visualization is to visualize a large number of association rules and their metadata in two- dimensional (2D) or three-dimensional (3D) display with minimum human interaction, minimum occlusion, and no screen swapping. There are many approaches developed to visualize association rules which are the:

1- Rule Table 2- two-dimensional matrix 3- directed graph 4- rule-item approach 5- Mosaic Plot 6- Double Decker Plot, 7- Parallel Coordinates, 8- Many- to- Many AR Visualization Technique.

U3.4.1 Rule TableU The most straightforward method for the association rule visualization is to use the rule table. The following rule table format has been used [26]: tem 1

Item2

Item3

Item4

Item5

Item N

Rule N

Antecedent N

Confidence Support


24

Chapter 3

Here Item1, Item2, …, and Item5 mean the 5 items, Rule N means the number of item in rule, antecedent N means the number of item in rule antecedent , Rule N – antecedentN= consequent.

Table (3.1) Example of Association Rules in Rule Table Format Item 1 Item2 Item3 Item4 Item5 Item

5 Rule N

Antecedent N

Confidence Support

Bread Milk Null Null Null Null 2 1 90% 10% Eggs Bread Milk Null Null Null 3 1 85% 7% Milk Bread Eggs Olive Null Null 4 2 60% 3%

In Table 3.1, rule #3 (the third row), the column Rule N= 4 means the rule consists of 4 items.’ antecedentN=2’ means there are 2 items in the rule head. Milk, Bread 60% Eggs, Olive and support 3%.

Rule table is the most straightforward way to show the association rule to the users. However, the rule table is only suitable to display the limited number of rules to the users. If the user needs to have a global view of all the rules, the rule table is not a suitable approach.

• The strengths of a 2D matrix, however, break down when we need to Visualize many-to-one relationships such as association rules with

3.4.2 Two-Dimensional Matrix The design of a two-dimensional (2D) association matrix positions the antecedent and consequent items on separate axes of a square matrix. Customized icons are drawn on certain matrix tiles that connect the antecedent and the consequent items of the corresponding association rules. Different icons can be used to depict different metadata such as the support and confidence values of the rules. Figure (2.2) depicts an association rule (B→C). Both the height and the color of the column icon can be used to present metadata values. The values of support and confidence are mapped to 3D columns that are built separately on and beneath the matrix tiles. Other icons such as disk and bar are also used to visualize metadata in the rule visualize of MineSet [4,22,28] . A 2D matrix is arguably the most effective technique to show one-to- one binary relationship.


25

Chapter 3

multiple antecedent items. For example, in Figure (2.3) it is almost impossible to tell whether there is only one association rule (A+B→C) or two (A→C and B→C). • the lack of a practical way to identify the togetherness of individual antecedent items makes a 2D matrix a weaker candidate to visualize rules with multiple antecedent items. MineSet[23] addresses the problem by grouping all the antecedent items of an association rule as one unit and plotting it against its consequent, i.e., an antecedent -to-consequent plot. For example, a dedicated item group (A+B) is created in Figure (3.4) to describe the association rule (A +B→C).

Figure (3.2) The colored column indicates the association rule (B →C). Different icon colors are used to show

different metadata values of the association rule • The strategy works fine for smaller antecedent sets (e.g., less than 3items). In our text mining studies, we encounter association rules with as many as 12 items in the antecedent. • The replication of items in the antecedent groups creates a much larger antecedent-to-consequent plot when compared with the corresponding item-to-item plot. The loss of item identity within an antecedent group also defeats the purpose of visualizing the associations with a matrix. For example, the row (or column) of the matrix connected to an item can no longer be used to search for all the rules involving that item.


26

Chapter 3

Figure. (3.3) It is Very difficult to determine the differences between (A+B→C) and (A→C and B→C)

Figure (3.4) The identities of A and B are lost in the new item group that was created to depict the

association rule (A+B→C).

• Another problem in a 2D·matrix display is object occlusion, especially when multiple icons are used to depict different metadata values on the matrix tiles. The occlusion problem is obvious in Figure (3.5).


27

Chapter 3

Figure (3.5) Object occlusions are unavoidable.

Figure (3.6) Left: A →C and B →C. Right: A+B→C.

3.4.3 Directed Graph A directed graph is another prevailing technique to depict item associations. The nodes of a directed graph represent the items, and the edges represent the associations. Figure (3.6) shows three association rules (A→C, B→C, A+B→C). • This technique works well when only a few items (nodes) and associations (edges) are involved. An association graph can quickly turn in to a tangled display with as few as a dozen rules. Hetzler et at [19] address the problem by animating the edges to show the association of certain items with 3D rainbow arcs. The animation technique requires significcp1t human interaction to turn on and off the item nodes. It is not an easy task to show multiple metadata values including support and confidence, alongside the association rules.


28

Chapter 3

3.4.4 Rule-to-Item Visualization Technique To visualize many-to-one association rules, instead of using the tiles of a 2D matrix to show the item-to-item association rules, the matrix of the rule-to-item relationship is used to depict many-to-one rule[19]. In figure (3.7) the rows of the matrix floor represent the items (or topics in the context of text mining), and the columns represent the item associations. The blue and red blocks of each column (rule) represent the antecedent and the consequent of the rule. The identities of the items are shown along the right side of the matrix. The confidence and support levels of the rules are given by the corresponding bar charts in different scales at the far end of the matrix. The rule-to-item visualization approach has many advantages over all the other matrix-based predecessors:

•There is virtually no upper limit on the number of items in an antecedent. We can analyze the distributions of the association rules(horizontal axis) as well as the items within (vertical axis) simultaneously. •Unlike Figure (3.4), the identity of individual items within an antecedent group is clearly shown. •No new antecedent groups are created because of the multiple antecedent items in association rules. •Because all the metadata are plotted at the far end and the height of the columns is scaled so that the front columns do not block the rear ones, few occlusions occur. • No screen swapping, animation, or human interaction (other than basic mouse zooming) is required to analyze the rules. Although this technique is the better one, there are fatal drawbacks that are suffers from, such as: • It is unable to visualize many-to-many association rule. • It suffers from antecedent-consequent interlining, i.e interleaving of the items of the antecedent and consequent, although they are given different colors


29

Chapter 3

• Deterioration of the naturalness of the rule's parts sequence.

Figure (3.7) A visualization of item associations with support 0.4% and confidence 50%.

Parallel Coordinates [1,2,13],the Basic elements of association

rules are sets of items, which can be handled by listing all items along a

vertical coordinate. The resulting coordinate is then repeated evenly in

the horizontal direction until there are enough coordinates to host the

longest of the association rule. An association rule can be visualized as a

polygonal line connecting all items in the rule. Parameters such as

support factor and confidence can be mapped to graphics features such

as line-width and color. Figure (3.8) illustrates an association rule ab →

cd as one polygonal line for its LHS, followed by an arrow connecting

another polygonal line for its RHS. This visualization handles nicely the

3.4.5 Parallel Coordinates


30

Chapter 3

upward closure property of association rules: subsets of the RHS are

absorbed and are not displayed. For example, ab → cd implies that abc

→ d, abd → c, ab → c, and ab → d are valid association rules. The

implied association rules are not displayed.If two or more itemsets or

rules have parts in common, for example, adbe and cdb in Figure (3.8).

Figure (3.8) association rule ab → cd in Parallel Coordinates Visualization technique

U3.4.6 Mosaic Plot

The basic idea is to partition a rectangle on the y-axis according to one attribute and make the regions proportional to the sum of the corresponding data values the height of the bars instead of the width to show the parameter value. Then each resulting area is split in the same way according to a second attribute [13]. The coloring reflects the percentage of data items that fulfill a third attribute. The visualization shows the support and confidence values of all rules of the form X1,X2 → Y Figure (3.9). Mosaic plots are restricted to two attributes on the left side of the association rule [6].


31

Chapter 3

Figure (3.9) X1,X2 → Y in Mosaic Plot

Figure (3.10) X1,X2 → Y in Double Decker Plot

3.4.7 Double Decker Plot

Double decker plots can be used to show more than two attributes on the left side. The idea is to show a hierarchy of attributes on the bottom (heineken, coke, chicken in the example shown in figure (3.10) corresponding to the left hand side of the association rules and the bars on the top correspond to the number of items in the corresponding subset of the database and therefore visualize the support of the rule. The colored areas in the bars correspond to the percentage of data transactions that contain an additional item and therefore correspond to the support [6,11].


32

Chapter 3

As previously mentioned, three approaches developed to visualize association rules are the two-dimensional matrix, directed graph, and rule-item approach. Also, it is shown that rules-item approach is the best technique in spite of its drawbacks such as its inability to represent many-to -many AR and interlining of consequent and antecedent items in the visualization area. This section presents a new technique which excludes these drawbacks. It excludes the items interleaving and efficiently represents many-to-many AR. This technique has been called many-to-many AR visualization technique, MARVT. In this technique the visualization area is divided into three regions; antecedent region, statistical region, and consequent region. This technique can be implemented in 2- dimension or 3- dimension. If the 2-dimension implementation is chosen, the x-axis of the visualization area is rule identifiers, while the y-axis of antecedent region is items of the antecedent of the rules to be visualized. The y-axis of the statistical region is divided according to the confidence and support level of the rules, while the y-axis of the antecedent region is the items of the consequent of the selector rules. Figure (3.11) depicts the general structure of visualization area of the proposed technique. If an item i is belonging to the antecedent of a rule R a red ellipse is drawn in (R, i) position of the antecedent region and if an item j is part of the consequent of the rule R, a black ellipse is drawn in the (R, j) position of consequent area. The statistical region contains an important statistical value such as the confidence, support, support of antecedent item set and- support of consequent itemset of each rule in a specified region of a rule. The y-axis of statistical region is divided beginning at the minsup and minconf threshold and ending with 100%. The technique is flexible to visualize more statistical information such as the support for each item. Also, it is possible to display the order of the rule. If this technique is implemented as a 3-dimension, the same regions are utilized. X-axis is determined by rule id. Y-axis is determined by the items of antecedent and consequent for their regions respectively. Z-axis is determined by the support and confidence beginning at minconf or minsup threshold.

3.4.8 Many to Many AR Visualization Technique


33

Chapter 3

The third dimension is used to show the support of the items, the confidence, and the support of a rule, and the support of antecedent itemset and consequent itemsets. In this technique it is possible to visualize many-many rules, one-to-many, many-to-one, etc. because it determines two separated regions for antecedent an consequent which hold unlimited number of items. This separation, also, excludes the items interlining because the items of consequent and antecedent are presented at different regions.

Figure (3.11) General Structure of Visualization Area of Proposed Many-to-Many Association Rules

Visualization Technique, MARVT .


34

Chapter 3

To give more_ illustration of this technique, for example, consider the following rules: 1- a,b→c,q1 and its confidence, and support are 63, 2 respectively. 2- a,b,c→q1,m and its confidence, and support are 100, 3 respectively. 3-b,c→c,m,q1 and its confidence, and support are 50, 1 respectively. Figure (3, 12) shows the hypothesis visualization of these rules. As

shown the antecedent items of R1 are a and b therefore, the position (R1, a)

Figure (3.12) Visualization Area of Many-to-Many Association Rules

Visualization Technique


35

Chapter 3

and (R1, b) of antecedent area is marked with red cycles and so on for

the rest to rules. Also, (R1, c) and (Rl, ql) of consequent area are marked

with black cycles because e and ql are the consequent items of Rl. The

same process is done for R2 and R3. The statistical area visualizes the

support of antecedent and consequent itemsets and furthermore the

support and confidence of the rules. Also, it is possible to add the

support of each item with its ellipse in its position. For example, the

number 3 beside the ellipse of the item a in Rl represents the support of

the item a and so on for each items. Figure (3.13) depicts the general

structure of MARVT. This structure preserves the same pertaining

regions; consequent, antecedent, and statistical regions.


36

Chapter 3

Figure (3.13) 3D General Structure of MARVT

37

Chapter four

Summary

And

Future work

Conclusion

38

Chapter 4

Chapter four

Summary and Future work

4.1 introductions

In chapter three, the most important techniques which visualized the association rules are presented. In this chapter, the summary of these techniques by Review the most important advantages and disadvantages of these techniques,

4.2 Summary

Summary by review of the most important characteristics of the previous techniques.

1- Visualize one-to- one, many-to-one, many-to-many relationships.

4.2.1 Rule Table

2- Ability to sort the results by the column of interest. 3- Visualize full details for the rule (antecedent, consequent, support,

confidence). 4- Display the limited number of rules. 5- Its main limitation is the close resemblance to the original row

textual form so that the user can inspect only few rules without having a global view of all the information.

6- Not interacting.

Conclusion

39

Chapter 4

1- Effective technique to show one-to- one binary relationship.

4.2.2 Two-Dimensional Matrix

2- Break down when we need to Visualize many-to-one, many-to-many relationships.

3- Visualize full details for the rule (antecedent, consequent, support, confidence).

4- Object occlusion, especially when multiple icons are used to depict different metadata values on the matrix tiles.

5- Limited number of rule. 6- Not interacting.

1- Visualize one-to- one, many-to-one relationships.

4.2.3 Directed Graph

2- Display the limited number of rules. 3- Lacks a clear representation the 4-

support and confidence. Edges overlap with each other to

5- Not interacting. different rules.

1- Visualize many-to-one relationships.

4.2.4 Rule-to-Item Visualization Technique

2- Break down when we need to Visualize many-to-many relationships.

3- No upper limit on the number of items in an antecedent. 4- Clearly shown to the individual items within an antecedent group. 5- No new antecedent groups are created because of the multiple

antecedent items in association rules. 6- No Object occlusion. 7- Deterioration of the naturalness of the rule's parts sequence 8- Interleaving of the items of the antecedent and consequent,

although they are given different colors. 9- Interacting.

Conclusion

40

Chapter 4


4.2.5 Parallel Coordinates


3- Visual rules overlap 4- Object occlusion.

with each other.

5- Lacks a clear representation the

support and confidence figure (4.1).

Figure (4.1) The rules overlap and lack of representation is clear for the

support and confidence


4.2.6 Mosaic Plot

2- Restricted to two attributes on the left side of the association rule. 3- Visualize one rule in time. 4- Difficult to understand and implementation. 5- Lacks a clear representation the

support and confidence.

Conclusion

41

Chapter 4


4.2.7 Double Decker Plot

2- Show more than two attributes on the left side. 3- Visualize one rule in time. 4- Lacks a clear representation the 5- Difficult to understand and implementation.


1- Best technique to Visualize many-to-many relationships. 4.2.8 Many to Many AR Visualization Technique


3- No Object occlusion. 4- No upper limit on the number of items in an antecedent. 5- Clear representation the 6- Interacting.


7- Flexible to visualize more statistical information. 8- It is possible to display the order of the rule.

4.3 Future work

The exploration of large data sets is an important but difficult problem. Information visualization techniques can be useful in solving this problem. Visual data exploration has a high potential, and many applications such as fraud detection and data mining can use information visualization technology for improved data analysis.

Avenues for future work include the tight integration of visualization techniques with traditional techniques from such disciplines as statistics, machine learning, operations research, and simulation. Integration of visualization techniques and these more established methods would combine fast automatic data mining algorithms with the intuitive power of the human mind, improving the quality and speed of the data mining process. Visual data mining techniques also need to be tightly integrated with the systems used to manage the vast amounts of relational and semi structured information, including database management and data warehouse systems. The ultimate goal is to bring the power of visualization technology to every desktop to allow a better, faster and more intuitive exploration of very large data resources. This will not only be valuable in an economic sense but will also stimulate and delight the user.

42

References

[1] Alfred Inselberg, “Parallel Coordinates: Visual Multidimensional

Geometry and Its Application”, University of San Francisco, 2009.

[2] Alfred Inselberg, “Visualizing high dimensional datasets and

multivariate relations”, (tutorial).In: Proc. 6th

[4] B. Bustos, D. KeIrn, C. Panse, T Schreck, “ Pattern

Visualization",

ACMSIGKDD Inter. Conf. on

Knowledge Discovery and Data Mining (KDD 2000), Boston, MA (2000).

[3] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data”,

Prentice Hall, 1988.

wawTyniuk}@dbvis.infUlUkonslanz., 2003.

[5] Cheung D.W., Ng V., Fu A.W. and Fu Y., “Efficient Mining of

Association Rules in Distributed Databases”, Special Issue in ata

ining”,IEEE Transaction on Knowledge and Data Engineering, IEEE

Computer Society, 1996.

[6] Daniel Keim and Matthew Ward, “Visual Data MiningTechniques “,

University of Konstanz, Germany and Worcester Polytechnic Institute,

USA 2002.

[7] D. Bruzzese, C. Davino, “Visual Post-Analysis of Association Rules”,

Dept. of athematics and Statistics, University of Naples Federico, Italy,

{dbruzzes, cdavino !aunina.it, 2002.

mailto:wawTyniuk%[email protected]�

43

[8] D. Keim, "Designing fuel-Oriented Visualization Techniques” ,

University of Florida,,2000.

[9] Gershon N., Eick S. G., and Card S., “Information Visualization”, ACM

Interactions, vol. 5, no. 2, pp. 9-15, March/April 1998.

[10] G. Karypis and V. Kumar, “Scalable Parallel Data Mining for

Association Rules”, University Arizona,2000.

[11] H. Hofmann, A. Siebes, and A. Wilhelm, “Visualizing association

rules with interactive mosaic plots”, SIGKDD Int. Conf. On Knowledge

Discovery & Data Mining (KDD 2000), Boston, MA, 2000.

[12] J.Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate

generation”. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data

(SIGMOD’00, Dallas, TX, May 2000.

[13] Martin, A., Ward, M.O.: High dimensional brushing for interactive

exploration of multivariate data, In: Proc. IEEE Conf. on Visualization,

Atlanta,(1995).

[14] Matthias Schubert, “Advanced Data Mining Techniques for Compound

Objects”, Maximilians- University¨, 2004.

[15] M. Deshpande and G. Karypis. ”Evaluation of Techniques for

lassifying Biological equences”. Taipei, Taiwan2002.

[16] Michael Hahsler and Sudheer Chelluboina, “Visualizing Association

Rules: Introduction to theR-extension Package arulesViz”, Southern

Methodist University 2004.

44

[17] M. J. Zaki and C. J. Hsiao. CHARM: “An efficient algorithm for closed

itemset mining”. In Proc. 2002 SIAM Int. Conf. Data Mining (SDM’02),

pages 457–473, Arlington, VA, April 2002.

[18] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,” Introduction to

Data Mining”, University of Minnesota , 2005.

[19] P. C Wong, P. Whitney, J. Thomas, "Visualizing Anociation Rules for

Text Mining", Pacific Northwest National Laboratory, 2000.

[20] Rakesh Agrawal Ramakrishnan Srikant, “Fast Algorithms for Mining

Association Rules”, IBM Almaden Research Center 1994.

[21] Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami:” Mining

Association Rules between Sets of Items in Large Databases”. SIGMOD

Conference 1993.

[22] Redpath, B. Sriruvasan, "Criteria for Comparati\"e Study of

VISualization Techniques in Data mining", IEEE 3..1 into Conf On

Intelligent System, Tulsa, USA, 2003.

[23] S. G. Inc. Mineset. http://www.sgi.com/software/mineset, 2001.

[24] Simeon J. Simoff, Michael H. Böhlen, “Visual Data Mining”,

University ofWestern Sydney,1998.

[25] Stefanos Manganaris. “Supervised Classification with Temporal Data”,

PhD thesis, School of Engineering, Vanderbilt University, 1997.

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/a/Agrawal:Rakesh.html�

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/i/Imielinski:Tomasz.html�

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/s/Swami:Arun_N=.html�

http://www.informatik.uni-trier.de/~ley/db/conf/sigmod/sigmod93.html#AgrawalIS93�



45

[26] Thomas S., “Architectures and Optimizations for Integrating Data

Mining Algorithms with Database Systems”, Ph.D. dissertation, University

of Florida, Gainesville, 1998.

[27] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Editors).

“Advances in Knowledge Discovery and Data Mining”, Menlo Park, 1996.

[28] U. M. Fan-ad, G. Grinstein, "Information Visualization in Dara Mining

and Knowledge Discovery", Morgan Kaufman, San Francisco (CA), 2004.

[29] vincent wing-sing cho ,”knowledge discovery from distributed and

textual data” , Hong Kong University of Science and Technology , 1999.

[30] http://en.wikipedia.org/wiki/Association_rule_learning.

جمهورية العراق

وزارة التعليم العالي والبحث العلـمي

الهيئة العراقية للحاسبات والمعلوماتية

معهـد المعلـوماتيـة للدراسـات العليـا

دراسة تقنيات ترئية القواعد المجمعة

رسالة مقدمة

الى

معهـد المعلـوماتيـة للدراسـات العليـا/ الهيئة العراقية للحاسبات والمعلوماتية

كجزء من متطلبات نيل شهادة الدبلوم العالي في

تقنية مواقع الشبكة

من قبل

مصطفى صباح شهيد

بأشراف

د. حسين الخفاجي

ربيع االول

1432

شباط

2011

association rule visualization technique

Data & Analytics