association rule visualization technique
TRANSCRIPT
Republic of Iraq
Ministry of Higher Education & Scientific Research
Iraqi Commission for Computers and Informatics
Informatics Institute for Postgraduate Studies
Study of Association Rules'Visulalization Techniques
A Project
Submitted to the Informatics Institute
For Postgraduate Studies of the Iraqi Commission
For Computers and Informatics as a partial fulfillment of the
Requirements for the degree of Higher Diploma in Web Site Technology in Computer Science
By
Mustafa S.Shaheed
Supervised by
Dr. Hussein K. Khafaji Baghdad, Iraq
Feb 2011 1432
I
بسم هللا الرحمن الرحيم
ــْْ َبِّ زٍدْنـي عِـلْـمًا َُب
صدق هللا العظيم
114سوَة طه- آيه
II
Dedication
To My Family With Love
And Affection
III
Acknowledgments
My first and deepest gratitude goes to ALLAH the
almighty for his uncountable blessing, help, and
guidance.
I would like to express my deepest appreciation to
my supervisor Dr. Hussein K. Khafaji for his guidance,
helpful, comments, and suggestions.
IV
Supervisor's Certification
I certify that the project entitled "Comparative Study of Association Rules'Visulalization Techniques” was prepared under
my supervision at the Informatics Institute for Postgraduate Studies in Iraqi Commission for Computers and Informatics as a partial fulfillment of the requirements for the degree of Higher
Diploma in Web Site Technology in Computer Science.
Signature:
Name: Dr. Hussein K. Khafaji
Date: /2/2011
V
Examining Committee Certification
We certify that we read this project, entitled " Comparative Study of Association Rules'Visulalization Techniques ", and as an examining committee, examined the student " Mustafa S. Shaheed", in the contents and what is related to it and that in our opinion it meet the standard of a project for the Higher Diploma in Web Site Technology in Computer Science.
Signature
Name: Dr. Hussein K. Khafaji
Title:
Date: /2/2011
Supervisor
Approved by the Informatics Institute for Postgraduate Studies of the
Iraqi Commission for Computers and Informatics.
Signature
Name: Prof. Dr. Imad Hussain Al-Hussaini
Date: /10/2010
Dean of the Institute
Signature
Name: Dr.
Title:
Date: /2/2011
Chairman
Signature
Name: Dr.
Title:
Date: /2/2011
Member
Signature
Name: Dr.
Title:
Date: /2/2011
Member
VI
Abstract
Computers are used in more and more areas, large volumes of data have been collected and stored in the database continuously. An important issue is to figure out how to find the useful information from these massive data.
Data mining, also known as knowledge discovery in databases, is such a research area to extract implicit, understandable, previously unknown and potentially useful information from data.
Association Rules are one of the most widespread data mining tools because they provide valuable information for many application fields, in spite of their mining difficulties.
The exploration of large data sets is an important but difficult problem.
Information visualization techniques can be useful in solving this problem. Visual data exploration has a high potential, and many applications.
Association Rules Visualization is emerging as a crucial step in a data mining process in order to profitably use the extracted knowledge.
In this project, most important techniques of association rule visualization are study which used to present the association rule that discovered from databases by used algorithms 0Tdeveloped 0T1T 0T1Tfor this0T1T 0T1Tpurpose and identify0T1T 0T1Tthe strengths0T1T 0T1Tand weaknesses 0T1T 0T1Tof 0T1T 0T1Tthese0T1T 0T1Ttechniques to reach 0T1T 0T1Tthe0T1T 0T1Tmost 0T1T 0T1Tappropriate0T1T 0T1Ttechnology0T1T 0T1Tto solve 0Tthe main drawback of Association Rules.
VII
Title Page
Chapter One: Introduction 1
1.1 Introduction 2
1.2 Introduction to Data Mining 2
1.3 Introduction to Association Rule 3
1.4 Introduction to Functional Dependencies 4
1.4.1 Candidate Key 5
1.5 Aim of the study 6
Chapter Two: Data Mining And Functional Dependency 8
2.1 Introduction 9
2.2 Data Mining Overview 9
2.2.1 Data Mining Application 10
2.2.2 The process before Data Mining 10
2.2.3 Data Mining tasks 11
2.2.3.1 Association Rules 12
2.2.3.2 Apriori algorithm 15
2.3 Functional depe 16
2.3.1 Definition (1) 17
2.3.2 Definition (2) 18
2.3.3 Multi Valued Dependencies 23
2.4 Candidate Keys 24
2.5 Primary Key 25
2.6 Super key 26
List of Contents
VIII
2.7 Armstrong's Axioms 27
Chapter Three: proposed System To Determine the Candidate Keys
31
3.1 Introduction 32
3.2 The relation between data mining and functional
dependency
32
3.3 An Algorithm of determining closure sets 32
3.4 System Architecture 34
3.4.1 Sets Generator 35
3.4.2 Candidate key tester 36
3.5 Set closure producer 42
3.6 key filter 46
3.7 Candidate keys system execution 47
Chapter four: Discussion, and Future works 52
4.1 Discussion 53
4.2 Future works 54
IX
List of algorithms
Algorithm (3-1) testing the closure of sets of attributes algorithm 33
Algorithm (3-2) Rule testing algorithm 43
Algorithm (3-3) Closure generator algorithm 44
List of programs
Program (3-1) Candidate key tester 41
Program (3-2) Candidate key function 42
Program (3-3) merge program 45
List of Figures
Figure (3-1) the architecture of generating candidate keys 34
Figure (3-2) the main view of application 47
Figure (3-3) the interface of set generator 48
Figure (3-4) the interface of canidiate key tester 49
Figure (3-5) the interface of table (sets) 50
Figure (3-6) the in oterfacef table (candid) 51
X
List of tables
Table (2.1) A database with 4 items and 5 transactions 12
Table (2.2) How employees get to work 19
Table (2.3) Functional Dependencies defined over two sets 20
Table (2.4) Employees information 21
Table (2.5) Students information 22
Table (2.6) Managers phone# 23
Table (2.7) Manager- employee 23
Table (2.8) Relation of Managers, phone, and employee 24
Table (3.1) Sets stored table 36
Table (3.2) Candidate keys stored table 37
Table (3.3) Temporary values stored table 37
1
Chapter One
Introduction
Introduction
2
Chapter 1
Chapter one
Introduction
Knowledge discovery in databases (KDD) is a new field depending on ideas from statistics, machine learning, databases, parallel computing, computer graphics, data visualization, and other fields. KDD systems generally use methods , algorithms, and techniques from all of these fields. It has been materialized due to the extraordinary growth of data in all specialties of human activities, disability of database management system (DBMS) to extract hidden knowledge in databases,
1.1 Overview
Recent years have seen an enormous increase in the amount of information stored in electronic format. It has been estimated that the amount of collected information in the world doubles every 20 months and the size and number of databases are increasing even faster and the ability to rapidly collect data has outpaced the ability to analyze it. Information is crucial for decision making, especially in business operations. As a response to those trends, the term 'Data Mining' (or 'Knowledge Discovery') has been coined to describe a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. Automated tools must be developed to help extract meaningful information from a flood of information. Moreover, these tools must be sophisticated enough to search for correlations among the data unspecified by the user, as the potential for unforeseen relationships to exist among the data is very high. A successful tool set to accomplish these goals will locate useful nuggets of information in the otherwise chaotic data space, and present them to the user in a contextual format.
Introduction
3
Chapter 1
and the need for economic and scientific tools such knowledge. KDD includes techniques and tools to address this need.
defines knowledge discovery in databases as follows[27]:
"KDD is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in the data".
Many literatures used the terms data mining (DM) and KDD
interchangeably and regard them as synonymous. At the first
international KDD conference in Montreal in 1995, it was proposed that
the term "KDD" be employed to describe the whole process of
extraction of knowledge from data. It was further proposed that the term
'data mining' should be used exclusively for the discovery stage of the
KDD process. A more or less official definition of DM is the process of
automatic extraction of novel, useful, and understandable patterns
in large databases[20,21]. Hence, KDD
includes many steps such as Focussing, Preprocessing,
Transformation, Data Mining and Evaluation. Figure (1.1) abstracts
the KDD process[14].
1- Focussing :- define the goal of the particular KDD task.
2- Preprocessing :- specified data has to be integrated.
3- Transformation :- assure that each data object is represented in a
common form which is suitable as input in the next step.
4- Data Mining :- detect the desired patterns contained within the
given data.
5- Evaluation :- the user evaluates the extracted patterns with
respect to the task defined in the focussing step.
Introduction
4
Chapter 1
data mining is the most important step within the KDD
process, defines data mining as follows[27]:
Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. According to this definition data mining is the step that is responsible
for the actual knowledge discovery and the data minig has many step
such as Association Rules (AR), Sequential Patterns, Classification,
Clustering, Similarity search.
Association Rules is the most important task of DM. ARs represent the correlation between sets of items in transaction database. An AR is an implication of the form: X c%
means that the person who reads the novels "The love in cholera era",
Y , where X, and Yare sets of items each of which is called itemsets.{X} is called antecedent, while Y is called consequent such that {X} ∩ {Y}=∅ and C% is the confidence of the implication, for example the following rule
The Merchant of Venice
The ARs are extracted from mined frequent itemsets. Mining of frequent itemsets is a very complex process[3].the mining of association rules consists of two steps; the first one is mining of frequent itemsets
", and "Zoorba", also reads the novels {"The Trees and Marzooq's Association", "One Hundred Years of Segregation}, with certainty factor of 60%. The confidence of a rule is calculated as follows: Confidence = support (X∪Y)/support (X). where the support of an itemset is the number of its occurrences in the database. The confidante rule is of confidence greater than or equal to the user defined threshold called minimum confidence, minconf
{ “The love in cholera era” , “The Merchant of Venice “ , “Zoorba”} 60% {“ the Tree and Marzooq’s Association” , “One Hundred Years of egregation”}
Introduction
5
Chapter 1
while the second one is extracting the rules from these frequent ilemsets. The first step, intermediate step, is massive computational step and attains the interest of the researcher since for many years many algorithms have been produced to accomplish this complicated mining process such as apriori, aprioriTID, aprioriHyprid [20], FP-growth [12], and CHARM [17], . The second step is extracting the association rules from the results of the previous step.
The main drawback of Association Rules is thus the huge number of extracted rules that cannot be manually inspected by that and the existence of trivial or meaningless associations that are usually mined due to the exhaustive nature of the extraction algorithms[24]. Graphical tools and pruning methods are the main approaches used to face these problems and to make data mining to be effective and well-Evaluated, it is important to include the human in the data exploration process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of today’s computers. Visual data exploration aims at integrating the human in the data exploration process, applying human perceptual abilities to the analysis of large data sets available in today’s computer systems. The basic idea of visual data exploration is to present the data in some visual form, allowing the user to gain insight into the data, draw conclusions, and directly interact with the data. Visual data mining techniques have proven to be of high value in exploratory data analysis, and have a high potential for exploring large databases. Visual data exploration is especially useful when little is known about the data and the exploration goals are vague. Since the user is directly involved in the exploration process, shifting and adjusting the exploration goals is automatically done if necessary.There are many techniques used to visually represent the data we will discuss some of them in this project.
Introduction
6
Chapter 1
Figure (1-1) Visualization and Data Mining
The aim of the project is a Study of techniques which used to present the association rule that discovered from databases by used algorithms
1.2 Aim of the project
developed for this purpose and identify the strengths and weaknesses of these techniques to
Introduction
7
Chapter 1
reach the most appropriate technology to solve the main drawback of Association Rules.
1.3 Project Outline Chapter two explains the stage of Knowledge Discovery in
Databases (KDD), task of data mining and concentrates on
Association rules(AR).
Chapter three focus on concept of Visualization, Visualization Benefits and Visualization Techniques which used to visualize the association rules (AR) due to their importance as an interesting field of this study.
Chapter four presents the summary and future work of the techniques used to visualized association rules.
8
Chapter Two
Data Mining And
Association Rules
Data mining and Association Rules
9
Chapter 2
Chapter Two
Data mining and Association Rules
2.1 Introduction
This chapter presents the general steps of Knowledge discovery
in databases (KDD) and its relation with data mining. Also, it presents
the tasks of data mining (DM) and concentrates on Association rules
due to their importance as an interesting field of DM.
2.2 Knowledge Discovery in Databases
In recent years the amount of data that is collected by advanced
information systems has increased tremendously. Although very useful
information of strategic importance is buried within this data, this
information is not readily available for the users To analyze these huge
amounts of data, the interdisciplinary field of Knowledge Discovery in
Databases (KDD) has emerged. Applies efficient algorithms to extract
interesting patterns and regularities from the data.
KDD is defined as follows[27] :
Knowledge Discovery in Databases is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data.
Data mining and Association Rules
10
Chapter 2
According to this definition, data is a set of facts that is somehow
accessible in electronic form. The term patterns indicate models and
regularities which can be observed within the data. Patterns have to be
valid, i.e. they should be true on new data with some degree of certainty.
A novel pattern is not previously known or trivially true. The potentially
usefulness of patterns refers to the possibility that they lead to an action
providing a benefit.
A pattern is understandable if it is interpretable by a human user.
At last KDD is a process, indicating that there are several steps that are
repeated in several iterations.
Figure 2.1 displays the process of KDD in its basic form.
Figure (2-1) The KDD process
Data mining and Association Rules
11
Chapter 2
1- Focussing
2.3 KDD Process Stages
KDD process is an interactive and iterative multi-step process
which uses five steps to extract interesting knowledge according to
some specific measures and thresholds.[14]
2- Preprocessing
3- Transformation
4- Data Mining
5- Evaluation
2.3.1 Focussing
The first step is to define the goal of the particular KDD task.
Another important aspect of this step is to determine the data to be
analyzed and how to obtain it.
2.3.2 Preprocessing
In this step the specified data has to be integrated, because it is not
necessarily accessible on the same system. Furthermore, several objects
may be described incompletely. Thus, the missing values need to be
completed and inconsistent data should be corrected or left out.
2.3.3 Transformation
The transformation step has to assure that each data object is
represented in a common form which is suitable as input in the next step.
Data mining and Association Rules
12
Chapter 2
2.3.4 Data Mining
Data mining is the application of efficient algorithms to detect the
desired patterns contained within the given data. Thus, the data mining
step is responsible for finding patterns according to the predefined task.
Since this step is the most important within the KDD process, we are
going to have a closer look at it in the next section(2.4).
2.4 Data Mining
2.3.5 Evaluation
At last, the user evaluates the extracted patterns with respect to the
task defined in the focussing step. An important aspect of this evaluation
is the representation of the found patterns. Depending on the given task,
there are several quality measures and visualizations available to
describe the result. The important phase to represent the result of KDD
process by visualization techniques, these techniques allow the user to
assess the results in easier and more flexible. If the user is satisfied with
the quality of the patterns, the process is terminated. However, in most
cases the results might not be satisfying after only one iteration. In those
cases, the user might return to any of the previous steps to achieve more
useful results.
Since data mining is the most important step within the KDD
process, we will treat it more carefully in this section. In [27, 30] Data
Mining is defined as follows:
Data mining is a step in the KDD process consisting of applying
data analysis and discovery algorithms that, under acceptable
Data mining and Association Rules
13
Chapter 2
computational efficiency limitations, produce a particular enumeration
of patterns over the data.
According to this definition data mining is the step that is responsible for
the actual knowledge discovery. To emphasize the necessity that data
mining algorithms need to process large amounts of data, the desired
patterns has to be found under acceptable computational efficiency
limitations. Let us note that there are many other definitions of data
mining and that the term data mining and KDD are often used in a
synonymous way.
Data mining has many tasks such as:
1- Association Rules (AR): Given a database of transactions, where each
transaction consists of a set of items, association discovery finds all the
item sets that frequently occur together, and also the rules among them.
we are going to have a closer look at it in the next section(2.5).
2- Sequential Patterns: Sequence Discovery aims at extracting sets of
events that commonly occur over a period of time.
3- Classification and Regression: Classification aims to assign a new data
item to one of several predefined categorical classes. The goal of
classification and regression is to build a model that minimizes the error
between the predicted and true values of the target variable. [15,18]
it known as supervised induction[14]. Supervised induction is the
machine learning task of inferring a function from supervised training
data[30].
4- Clustering: Clustering is the process of grouping the data records into
meaningful subclasses (clusters) in a way that maximizes the similarity
within clusters and minimizes the similarity between two different
clusters [10].clustering is also called unsupervised induction.[3]
Data mining and Association Rules
14
Chapter 2
5- Similarity search: Similarity search is performed on a database of
objects to find the object(s) that are within a user-defined distance from
the queried object, or to find all pairs within some distance of each other.
Figure (2-2) Classification separates the data space (left) and clustering
groups data objects (right)
2.5 Association Rule
Association rules are ones of the promising aspects of data mining
as knowledge discovery tool, and have been widely explored to
date[27,14]. They allow to capture all possible rules that explain the
presence of some attributes according to the presence of other attributes.
An association rule, X⇒ Y, is a statement of the form "for a specified
fraction of transactions, a particular value of an attribute set X
determines the value of attribute set Y as another particular value under a
certain confidence". Thus, association rules aim at discovering the
patterns of co-occurrences of attributes in a database. For instance, an
association rule in a supermarket basket data may be "In 10% of
transactions, 85% of the people buying milk also buy milky-sweets
in that transaction". The association rules may be useful in many
Data mining and Association Rules
15
Chapter 2
applications such as supermarket transactions analysis, store layout and
promotions on the items, telecommunications alarm correlation,
university course enrollment analysis, customer behavior analysis in
retailing, catalog design, word occurrence in text documents, stock
transactions, etc[29,21,16].
Let I = {I1,..., Im} be a set of literals, called items. Let D be a set of
transactions, where each transaction T is a set of items such that T ⊆ I,
and each transaction is associated with a unique identifier called TID.
Definition 2.1 An itemset X is a set of items in I. An itemset X is called a
k-itemset if it contains k items from I.
Definition 2.2 A transaction T satisfies an itemset X if X ⊆ T. The
support of an itemset X in D, supportD
Definition 2.5 An association rule is an implication of the form X ⇒ Y,
where X ⊂ I, Y ⊂ I, and X ∩ Y = φ. X is called the antecedent of the
rule, and Y is called the consequent of the rule. The rule X ⇒ Y holds in
(X), is the number of transactions
in D that satisfies X.
Definition 2.3 An itemset X is called a large itemset if the support of X
in D exceeds a minimum support threshold explicitly declared by the
user, and a small itemset otherwise.
Definition 2.4 The negative border of a set S ⊂ P(R), closed with
respect to the set inclusion relation, is the set of minimal itemsets X ⊂ R
not in S. The negative border of the set of large itemsets is the set of
itemsets that are generated as a candidate but fail to qualify into the set
of large itemsets.
Data mining and Association Rules
16
Chapter 2
D with confidence c where c=supportD(X ∪Y)/supportD(X). The rule
X⇒Y has support s in D if the fraction s of the transactions in D
contain X ∪Y.
Example: Suppose I={A, B, C, D, E} is the abbreviation of movie title
in Movie-CD shop, these abbreviation are shown in Table (2.1). Table
(2.2)
Represent a database of the shop sells. Each transaction is defined
Transaction identifier, TID. Table (2.3) shows the frequent itemsets
according To minsup =50%, while Table (2.4) depicts all the ARs
according to Minconf = 100%.
Table (2.1) The items abbreviations of Database
Item Abbreviation
A Golden mountain
B Gone with the Wind
C Zoorba
D Rain Man
E Sound of Music
Data mining and Association Rules
17
Chapter 2
Table (2.2) The items abbreviations of Database
Transaction TID (Person) Items-(Attributes)
1 B, C, E
2 B, C, D, E
3 A, B, C, D, E
4 B, C, D
5 A, B, F
6 A, B, C, E
Table (2.3) Large itemsets with minsup = 33%=2
Support Itemsets No.
6 =100% B 1
5 = 83% C, BC 2
4 = 67% E, BE, CE, BCE 4
3 = 50% A, D, AB, BD, CD, BCD 6
2 = 33% AC, AE, DE, ABC, ABE, ACE, BDE,
CDE, ABCE, BCDE 10
Table(2.4) Association Rules
Association rules with minconf = 100% A → B (3/3) AC → B (2/2) AC → BE (2/2) C → B (5/5) AE → B (2/2) AE → BC (2/2) D → B (3/3) AC → E (2/2) DE → BC (2/2) E → B (4/4) AE → C (2/2) ABC → E (2/2) D → C (3/3) DE → B (2/2) ABE → C (2/2) E → C (4/4) DE → C (2/2) ACE → B (2/2)
ABE → C (2/2) ACE → B (2/2) ABC → E (2/2)
Data mining and Association Rules
18
Chapter 2
The mining of Association Rules is decomposed into two sub
problems:
1- Discovering all frequent, (large), patterns (represented by large
itemsets
defined above), and;
2- Generating the association rules from those frequent itemsets.
The first sub problem is very tedious, I/O intensive, and
Computationally expensive for very large databases and this is the case
for many real life applications. In large retailing data, the number of
transactions is generally in the order of millions, and number of items
(attributes) is generally in the order of thousands. When the data
contains N items, then the number of possible large itemsets is 2N. There
are many algorithms to mine frequent itemsets such as apriori,
aprioriTID, and aprioriHyprid,[12]The second problem is
straightforward, and can he done efficiently in a reasonable time and
there is a well known algorithm presented in to accomplish the
extraction of AR. The databases of frequent itemsets and ARs are
assumed to be available in this thesis, therefore there IS no focus on any
frequent itemset and AR mining algorithms.
19
Chapter Three
Visualization Techniques of
Association Rules
Visualization Techniques of Association Rules
20
Chapter 3
Chapter Three
Visualization Techniques of Association Rules
3.1 Introduction
This chapter, presents the concept of visualization, visualization benefits and Visualization Techniques which used to visualize the association rules (AR) in KDD process.
3.2 Visualization
Visualization is the process of transforming data, information, and knowledge into visual form making use of human’s natural visual capabilities [9]. Typical of a visualization application is the field of computer graphics. The invention of computer graphics may be the most important development in visualization since the invention of central perspective in the renaissance period. The development of animation also helped advance visualization. In spite of the importance of the visualization, there are many limitations and difficulties that must be taken in consideration such as [28, 4]: The main limitations are: • Visualization techniques are always difficult to evaluate. This one is no exception. • The implementation may require, the use of an operating system from one specific vendor.
•The visualization techniques offered are very limited.
• The limitation of many 3D visualizations is the possible waste of screen space towards the comers of the screen. • The traditional menu bar approach would require long mouse movements from the visualization to the menu bar and vice versa.
Visualization Techniques of Association Rules
21
Chapter 3
•Object interacting complexity occurs within 3-d environment, for example the user can transform the parallel bar chart into a matrix format and vice versa. 3.3 Benefits of Visualization
Visual data exploration can be seen as a hypothesis generation process, the visualizations of the data allow the user to gain insight into the data and come up with new hypotheses. The verification of the hypotheses can also be done via data visualization, but may also be accomplished by automatic techniques from statistics, pattern recognition, or machine learning. In addition to the direct involvement of the user, the main advantages of visual data exploration over automatic data analysis techniques are: • Visual data exploration can easily deal with highly non-homogeneous and noisy data. • Visual data exploration is intuitive and requires no understanding of complex mathematical or statistical algorithms or parameters. • Visualization can provide a qualitative overview of the data, allowing data phenomena to be isolated for further quantitative analysis. As a result, visual data exploration usually allows a faster data exploration and often provides more interesting results, especially in cases where automatic algorithms fail. In addition, visual data exploration techniques provide a much higher degree of confidence in the findings of the exploration. These facts lead to a high demand for visual exploration techniques and make them indispensable in conjunction with automatic exploration techniques [6]. 3.4 Visualization of Association Rule
Visualizing association rules aims at solving some major problems that come with association rules. First of all the rules found by automatic procedures must be filtered. Depending on what minimum confidence and what support is specified a vast amount of rules may be generated.
There are at least five parameters involved in a visualization of association rules [19].
· Sets of antecedent items. · Sets of consequent items.
Visualization Techniques of Association Rules
22
Chapter 3
· Associations between antecedent and consequent. · Rules' support. . Rules' confidence. The goal of association rule generation is to find interesting patterns and trends in transaction databases. Association rules are statistical relations between two or more items in the data set. In a supermarket basket application, associations express "the relations between items that are bought together. It is for example interesting if we find out that in 70% of the cases when people buy bread, they also buy milk. Association rules tell us that the presence of some items in a transaction implies the presence of other items In the same transaction with a certain probability, called confidence. A second important parameter is the support of an association rule, which is defined as the percentage of transactions in which the items co·occur. Let I = {i1., .. .in} be a set of items and let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. An association rule is an implication of the form X → Y, ,where X ⊆I ,Y ∈ I, X, Y≠ O. The confidence c is defined as the percentage of transactions that contain Y, given X The support is the percentage of transactions that contain both X and Y. For a given support and confidence level, there are efficient algorithms to determine all association rules. A problem, however, is that the resulting set of association rules is usually very large, especially for low support and confidence levels [8,9]. Using higher support and confidence levels may not be effective since then, useful rules may be overlooked. Pattern visualization techniques have been used to overcome this problem and to allow an interactive selection of good support and confidence levels. Figure (2.5) shows SGI MineSets Rule Visualizer[14], which maps the left and right hand sides of the rules to the x- and y-axes of the plot, respectively, and shows the confidence as the height of the bars and the support as the height of the discs. The color of the bars shows the interestingness of the rule.
Visualization Techniques of Association Rules
23
Chapter 3
Figure (3.1) MineSet's Association Rule Visualizer
Using the visualization, the user is able to see groups of related rules and the impact of different confidence and support levels. The goal of association rules visualization is to visualize a large number of association rules and their metadata in two- dimensional (2D) or three-dimensional (3D) display with minimum human interaction, minimum occlusion, and no screen swapping. There are many approaches developed to visualize association rules which are the:
1- Rule Table 2- two-dimensional matrix 3- directed graph 4- rule-item approach 5- Mosaic Plot 6- Double Decker Plot, 7- Parallel Coordinates, 8- Many- to- Many AR Visualization Technique.
U3.4.1 Rule TableU The most straightforward method for the association rule visualization is to use the rule table. The following rule table format has been used [26]: tem 1
Item2
Item3
Item4
Item5
Item N
Rule N
Antecedent N
Confidence Support
Visualization Techniques of Association Rules
24
Chapter 3
Here Item1, Item2, …, and Item5 mean the 5 items, Rule N means the number of item in rule, antecedent N means the number of item in rule antecedent , Rule N – antecedentN= consequent.
Table (3.1) Example of Association Rules in Rule Table Format Item 1 Item2 Item3 Item4 Item5 Item
5 Rule N
Antecedent N
Confidence Support
Bread Milk Null Null Null Null 2 1 90% 10% Eggs Bread Milk Null Null Null 3 1 85% 7% Milk Bread Eggs Olive Null Null 4 2 60% 3%
In Table 3.1, rule #3 (the third row), the column Rule N= 4 means the rule consists of 4 items.’ antecedentN=2’ means there are 2 items in the rule head. Milk, Bread 60% Eggs, Olive and support 3%.
Rule table is the most straightforward way to show the association rule to the users. However, the rule table is only suitable to display the limited number of rules to the users. If the user needs to have a global view of all the rules, the rule table is not a suitable approach.
• The strengths of a 2D matrix, however, break down when we need to Visualize many-to-one relationships such as association rules with
3.4.2 Two-Dimensional Matrix The design of a two-dimensional (2D) association matrix positions the antecedent and consequent items on separate axes of a square matrix. Customized icons are drawn on certain matrix tiles that connect the antecedent and the consequent items of the corresponding association rules. Different icons can be used to depict different metadata such as the support and confidence values of the rules. Figure (2.2) depicts an association rule (B→C). Both the height and the color of the column icon can be used to present metadata values. The values of support and confidence are mapped to 3D columns that are built separately on and beneath the matrix tiles. Other icons such as disk and bar are also used to visualize metadata in the rule visualize of MineSet [4,22,28] . A 2D matrix is arguably the most effective technique to show one-to- one binary relationship.
Visualization Techniques of Association Rules
25
Chapter 3
multiple antecedent items. For example, in Figure (2.3) it is almost impossible to tell whether there is only one association rule (A+B→C) or two (A→C and B→C). • the lack of a practical way to identify the togetherness of individual antecedent items makes a 2D matrix a weaker candidate to visualize rules with multiple antecedent items. MineSet[23] addresses the problem by grouping all the antecedent items of an association rule as one unit and plotting it against its consequent, i.e., an antecedent -to-consequent plot. For example, a dedicated item group (A+B) is created in Figure (3.4) to describe the association rule (A +B→C).
Figure (3.2) The colored column indicates the association rule (B →C). Different icon colors are used to show
different metadata values of the association rule • The strategy works fine for smaller antecedent sets (e.g., less than 3items). In our text mining studies, we encounter association rules with as many as 12 items in the antecedent. • The replication of items in the antecedent groups creates a much larger antecedent-to-consequent plot when compared with the corresponding item-to-item plot. The loss of item identity within an antecedent group also defeats the purpose of visualizing the associations with a matrix. For example, the row (or column) of the matrix connected to an item can no longer be used to search for all the rules involving that item.
Visualization Techniques of Association Rules
26
Chapter 3
Figure. (3.3) It is Very difficult to determine the differences between (A+B→C) and (A→C and B→C)
Figure (3.4) The identities of A and B are lost in the new item group that was created to depict the
association rule (A+B→C).
• Another problem in a 2D·matrix display is object occlusion, especially when multiple icons are used to depict different metadata values on the matrix tiles. The occlusion problem is obvious in Figure (3.5).
Visualization Techniques of Association Rules
27
Chapter 3
Figure (3.5) Object occlusions are unavoidable.
Figure (3.6) Left: A →C and B →C. Right: A+B→C.
3.4.3 Directed Graph A directed graph is another prevailing technique to depict item associations. The nodes of a directed graph represent the items, and the edges represent the associations. Figure (3.6) shows three association rules (A→C, B→C, A+B→C). • This technique works well when only a few items (nodes) and associations (edges) are involved. An association graph can quickly turn in to a tangled display with as few as a dozen rules. Hetzler et at [19] address the problem by animating the edges to show the association of certain items with 3D rainbow arcs. The animation technique requires significcp1t human interaction to turn on and off the item nodes. It is not an easy task to show multiple metadata values including support and confidence, alongside the association rules.
Visualization Techniques of Association Rules
28
Chapter 3
3.4.4 Rule-to-Item Visualization Technique To visualize many-to-one association rules, instead of using the tiles of a 2D matrix to show the item-to-item association rules, the matrix of the rule-to-item relationship is used to depict many-to-one rule[19]. In figure (3.7) the rows of the matrix floor represent the items (or topics in the context of text mining), and the columns represent the item associations. The blue and red blocks of each column (rule) represent the antecedent and the consequent of the rule. The identities of the items are shown along the right side of the matrix. The confidence and support levels of the rules are given by the corresponding bar charts in different scales at the far end of the matrix. The rule-to-item visualization approach has many advantages over all the other matrix-based predecessors:
•There is virtually no upper limit on the number of items in an antecedent. We can analyze the distributions of the association rules(horizontal axis) as well as the items within (vertical axis) simultaneously. •Unlike Figure (3.4), the identity of individual items within an antecedent group is clearly shown. •No new antecedent groups are created because of the multiple antecedent items in association rules. •Because all the metadata are plotted at the far end and the height of the columns is scaled so that the front columns do not block the rear ones, few occlusions occur. • No screen swapping, animation, or human interaction (other than basic mouse zooming) is required to analyze the rules. Although this technique is the better one, there are fatal drawbacks that are suffers from, such as: • It is unable to visualize many-to-many association rule. • It suffers from antecedent-consequent interlining, i.e interleaving of the items of the antecedent and consequent, although they are given different colors
Visualization Techniques of Association Rules
29
Chapter 3
• Deterioration of the naturalness of the rule's parts sequence.
Figure (3.7) A visualization of item associations with support 0.4% and confidence 50%.
Parallel Coordinates [1,2,13],the Basic elements of association
rules are sets of items, which can be handled by listing all items along a
vertical coordinate. The resulting coordinate is then repeated evenly in
the horizontal direction until there are enough coordinates to host the
longest of the association rule. An association rule can be visualized as a
polygonal line connecting all items in the rule. Parameters such as
support factor and confidence can be mapped to graphics features such
as line-width and color. Figure (3.8) illustrates an association rule ab →
cd as one polygonal line for its LHS, followed by an arrow connecting
another polygonal line for its RHS. This visualization handles nicely the
3.4.5 Parallel Coordinates
Visualization Techniques of Association Rules
30
Chapter 3
upward closure property of association rules: subsets of the RHS are
absorbed and are not displayed. For example, ab → cd implies that abc
→ d, abd → c, ab → c, and ab → d are valid association rules. The
implied association rules are not displayed.If two or more itemsets or
rules have parts in common, for example, adbe and cdb in Figure (3.8).
Figure (3.8) association rule ab → cd in Parallel Coordinates Visualization technique
U3.4.6 Mosaic Plot
The basic idea is to partition a rectangle on the y-axis according to one attribute and make the regions proportional to the sum of the corresponding data values the height of the bars instead of the width to show the parameter value. Then each resulting area is split in the same way according to a second attribute [13]. The coloring reflects the percentage of data items that fulfill a third attribute. The visualization shows the support and confidence values of all rules of the form X1,X2 → Y Figure (3.9). Mosaic plots are restricted to two attributes on the left side of the association rule [6].
Visualization Techniques of Association Rules
31
Chapter 3
Figure (3.9) X1,X2 → Y in Mosaic Plot
Figure (3.10) X1,X2 → Y in Double Decker Plot
3.4.7 Double Decker Plot
Double decker plots can be used to show more than two attributes on the left side. The idea is to show a hierarchy of attributes on the bottom (heineken, coke, chicken in the example shown in figure (3.10) corresponding to the left hand side of the association rules and the bars on the top correspond to the number of items in the corresponding subset of the database and therefore visualize the support of the rule. The colored areas in the bars correspond to the percentage of data transactions that contain an additional item and therefore correspond to the support [6,11].
Visualization Techniques of Association Rules
32
Chapter 3
As previously mentioned, three approaches developed to visualize association rules are the two-dimensional matrix, directed graph, and rule-item approach. Also, it is shown that rules-item approach is the best technique in spite of its drawbacks such as its inability to represent many-to -many AR and interlining of consequent and antecedent items in the visualization area. This section presents a new technique which excludes these drawbacks. It excludes the items interleaving and efficiently represents many-to-many AR. This technique has been called many-to-many AR visualization technique, MARVT. In this technique the visualization area is divided into three regions; antecedent region, statistical region, and consequent region. This technique can be implemented in 2- dimension or 3- dimension. If the 2-dimension implementation is chosen, the x-axis of the visualization area is rule identifiers, while the y-axis of antecedent region is items of the antecedent of the rules to be visualized. The y-axis of the statistical region is divided according to the confidence and support level of the rules, while the y-axis of the antecedent region is the items of the consequent of the selector rules. Figure (3.11) depicts the general structure of visualization area of the proposed technique. If an item i is belonging to the antecedent of a rule R a red ellipse is drawn in (R, i) position of the antecedent region and if an item j is part of the consequent of the rule R, a black ellipse is drawn in the (R, j) position of consequent area. The statistical region contains an important statistical value such as the confidence, support, support of antecedent item set and- support of consequent itemset of each rule in a specified region of a rule. The y-axis of statistical region is divided beginning at the minsup and minconf threshold and ending with 100%. The technique is flexible to visualize more statistical information such as the support for each item. Also, it is possible to display the order of the rule. If this technique is implemented as a 3-dimension, the same regions are utilized. X-axis is determined by rule id. Y-axis is determined by the items of antecedent and consequent for their regions respectively. Z-axis is determined by the support and confidence beginning at minconf or minsup threshold.
3.4.8 Many to Many AR Visualization Technique
Visualization Techniques of Association Rules
33
Chapter 3
The third dimension is used to show the support of the items, the confidence, and the support of a rule, and the support of antecedent itemset and consequent itemsets. In this technique it is possible to visualize many-many rules, one-to-many, many-to-one, etc. because it determines two separated regions for antecedent an consequent which hold unlimited number of items. This separation, also, excludes the items interlining because the items of consequent and antecedent are presented at different regions.
Figure (3.11) General Structure of Visualization Area of Proposed Many-to-Many Association Rules
Visualization Technique, MARVT .
Visualization Techniques of Association Rules
34
Chapter 3
To give more_ illustration of this technique, for example, consider the following rules: 1- a,b→c,q1 and its confidence, and support are 63, 2 respectively. 2- a,b,c→q1,m and its confidence, and support are 100, 3 respectively. 3-b,c→c,m,q1 and its confidence, and support are 50, 1 respectively. Figure (3, 12) shows the hypothesis visualization of these rules. As
shown the antecedent items of R1 are a and b therefore, the position (R1, a)
Figure (3.12) Visualization Area of Many-to-Many Association Rules
Visualization Technique
Visualization Techniques of Association Rules
35
Chapter 3
and (R1, b) of antecedent area is marked with red cycles and so on for
the rest to rules. Also, (R1, c) and (Rl, ql) of consequent area are marked
with black cycles because e and ql are the consequent items of Rl. The
same process is done for R2 and R3. The statistical area visualizes the
support of antecedent and consequent itemsets and furthermore the
support and confidence of the rules. Also, it is possible to add the
support of each item with its ellipse in its position. For example, the
number 3 beside the ellipse of the item a in Rl represents the support of
the item a and so on for each items. Figure (3.13) depicts the general
structure of MARVT. This structure preserves the same pertaining
regions; consequent, antecedent, and statistical regions.
Visualization Techniques of Association Rules
36
Chapter 3
Figure (3.13) 3D General Structure of MARVT
37
Chapter four
Summary
And
Future work
Conclusion
38
Chapter 4
Chapter four
Summary and Future work
4.1 introductions
In chapter three, the most important techniques which visualized the association rules are presented. In this chapter, the summary of these techniques by Review the most important advantages and disadvantages of these techniques,
4.2 Summary
Summary by review of the most important characteristics of the previous techniques.
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.1 Rule Table
2- Ability to sort the results by the column of interest. 3- Visualize full details for the rule (antecedent, consequent, support,
confidence). 4- Display the limited number of rules. 5- Its main limitation is the close resemblance to the original row
textual form so that the user can inspect only few rules without having a global view of all the information.
6- Not interacting.
Conclusion
39
Chapter 4
1- Effective technique to show one-to- one binary relationship.
4.2.2 Two-Dimensional Matrix
2- Break down when we need to Visualize many-to-one, many-to-many relationships.
3- Visualize full details for the rule (antecedent, consequent, support, confidence).
4- Object occlusion, especially when multiple icons are used to depict different metadata values on the matrix tiles.
5- Limited number of rule. 6- Not interacting.
1- Visualize one-to- one, many-to-one relationships.
4.2.3 Directed Graph
2- Display the limited number of rules. 3- Lacks a clear representation the 4-
support and confidence. Edges overlap with each other to
5- Not interacting. different rules.
1- Visualize many-to-one relationships.
4.2.4 Rule-to-Item Visualization Technique
2- Break down when we need to Visualize many-to-many relationships.
3- No upper limit on the number of items in an antecedent. 4- Clearly shown to the individual items within an antecedent group. 5- No new antecedent groups are created because of the multiple
antecedent items in association rules. 6- No Object occlusion. 7- Deterioration of the naturalness of the rule's parts sequence 8- Interleaving of the items of the antecedent and consequent,
although they are given different colors. 9- Interacting.
Conclusion
40
Chapter 4
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.5 Parallel Coordinates
2- Visualize full details for the rule (antecedent, consequent, support, confidence).
3- Visual rules overlap 4- Object occlusion.
with each other.
5- Lacks a clear representation the
support and confidence figure (4.1).
Figure (4.1) The rules overlap and lack of representation is clear for the
support and confidence
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.6 Mosaic Plot
2- Restricted to two attributes on the left side of the association rule. 3- Visualize one rule in time. 4- Difficult to understand and implementation. 5- Lacks a clear representation the
support and confidence.
Conclusion
41
Chapter 4
1- Visualize one-to- one, many-to-one, many-to-many relationships.
4.2.7 Double Decker Plot
2- Show more than two attributes on the left side. 3- Visualize one rule in time. 4- Lacks a clear representation the 5- Difficult to understand and implementation.
support and confidence.
1- Best technique to Visualize many-to-many relationships. 4.2.8 Many to Many AR Visualization Technique
2- Visualize full details for the rule (antecedent, consequent, support, confidence).
3- No Object occlusion. 4- No upper limit on the number of items in an antecedent. 5- Clear representation the 6- Interacting.
support and confidence.
7- Flexible to visualize more statistical information. 8- It is possible to display the order of the rule.
4.3 Future work
The exploration of large data sets is an important but difficult problem. Information visualization techniques can be useful in solving this problem. Visual data exploration has a high potential, and many applications such as fraud detection and data mining can use information visualization technology for improved data analysis.
Avenues for future work include the tight integration of visualization techniques with traditional techniques from such disciplines as statistics, machine learning, operations research, and simulation. Integration of visualization techniques and these more established methods would combine fast automatic data mining algorithms with the intuitive power of the human mind, improving the quality and speed of the data mining process. Visual data mining techniques also need to be tightly integrated with the systems used to manage the vast amounts of relational and semi structured information, including database management and data warehouse systems. The ultimate goal is to bring the power of visualization technology to every desktop to allow a better, faster and more intuitive exploration of very large data resources. This will not only be valuable in an economic sense but will also stimulate and delight the user.
42
References
[1] Alfred Inselberg, “Parallel Coordinates: Visual Multidimensional
Geometry and Its Application”, University of San Francisco, 2009.
[2] Alfred Inselberg, “Visualizing high dimensional datasets and
multivariate relations”, (tutorial).In: Proc. 6th
[4] B. Bustos, D. KeIrn, C. Panse, T Schreck, “ Pattern
Visualization",
ACMSIGKDD Inter. Conf. on
Knowledge Discovery and Data Mining (KDD 2000), Boston, MA (2000).
[3] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data”,
Prentice Hall, 1988.
wawTyniuk}@dbvis.infUlUkonslanz., 2003.
[5] Cheung D.W., Ng V., Fu A.W. and Fu Y., “Efficient Mining of
Association Rules in Distributed Databases”, Special Issue in ata
ining”,IEEE Transaction on Knowledge and Data Engineering, IEEE
Computer Society, 1996.
[6] Daniel Keim and Matthew Ward, “Visual Data MiningTechniques “,
University of Konstanz, Germany and Worcester Polytechnic Institute,
USA 2002.
[7] D. Bruzzese, C. Davino, “Visual Post-Analysis of Association Rules”,
Dept. of athematics and Statistics, University of Naples Federico, Italy,
{dbruzzes, cdavino !aunina.it, 2002.
43
[8] D. Keim, "Designing fuel-Oriented Visualization Techniques” ,
University of Florida,,2000.
[9] Gershon N., Eick S. G., and Card S., “Information Visualization”, ACM
Interactions, vol. 5, no. 2, pp. 9-15, March/April 1998.
[10] G. Karypis and V. Kumar, “Scalable Parallel Data Mining for
Association Rules”, University Arizona,2000.
[11] H. Hofmann, A. Siebes, and A. Wilhelm, “Visualizing association
rules with interactive mosaic plots”, SIGKDD Int. Conf. On Knowledge
Discovery & Data Mining (KDD 2000), Boston, MA, 2000.
[12] J.Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate
generation”. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD’00, Dallas, TX, May 2000.
[13] Martin, A., Ward, M.O.: High dimensional brushing for interactive
exploration of multivariate data, In: Proc. IEEE Conf. on Visualization,
Atlanta,(1995).
[14] Matthias Schubert, “Advanced Data Mining Techniques for Compound
Objects”, Maximilians- University¨, 2004.
[15] M. Deshpande and G. Karypis. ”Evaluation of Techniques for
lassifying Biological equences”. Taipei, Taiwan2002.
[16] Michael Hahsler and Sudheer Chelluboina, “Visualizing Association
Rules: Introduction to theR-extension Package arulesViz”, Southern
Methodist University 2004.
44
[17] M. J. Zaki and C. J. Hsiao. CHARM: “An efficient algorithm for closed
itemset mining”. In Proc. 2002 SIAM Int. Conf. Data Mining (SDM’02),
pages 457–473, Arlington, VA, April 2002.
[18] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,” Introduction to
Data Mining”, University of Minnesota , 2005.
[19] P. C Wong, P. Whitney, J. Thomas, "Visualizing Anociation Rules for
Text Mining", Pacific Northwest National Laboratory, 2000.
[20] Rakesh Agrawal Ramakrishnan Srikant, “Fast Algorithms for Mining
Association Rules”, IBM Almaden Research Center 1994.
[21] Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami:” Mining
Association Rules between Sets of Items in Large Databases”. SIGMOD
Conference 1993.
[22] Redpath, B. Sriruvasan, "Criteria for Comparati\"e Study of
VISualization Techniques in Data mining", IEEE 3..1 into Conf On
Intelligent System, Tulsa, USA, 2003.
[23] S. G. Inc. Mineset. http://www.sgi.com/software/mineset, 2001.
[24] Simeon J. Simoff, Michael H. Böhlen, “Visual Data Mining”,
University ofWestern Sydney,1998.
[25] Stefanos Manganaris. “Supervised Classification with Temporal Data”,
PhD thesis, School of Engineering, Vanderbilt University, 1997.
45
[26] Thomas S., “Architectures and Optimizations for Integrating Data
Mining Algorithms with Database Systems”, Ph.D. dissertation, University
of Florida, Gainesville, 1998.
[27] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Editors).
“Advances in Knowledge Discovery and Data Mining”, Menlo Park, 1996.
[28] U. M. Fan-ad, G. Grinstein, "Information Visualization in Dara Mining
and Knowledge Discovery", Morgan Kaufman, San Francisco (CA), 2004.
[29] vincent wing-sing cho ,”knowledge discovery from distributed and
textual data” , Hong Kong University of Science and Technology , 1999.
[30] http://en.wikipedia.org/wiki/Association_rule_learning.
جمهورية العراق
وزارة التعليم العالي والبحث العلـمي
الهيئة العراقية للحاسبات والمعلوماتية
معهـد المعلـوماتيـة للدراسـات العليـا
دراسة تقنيات ترئية القواعد المجمعة
رسالة مقدمة
الى
معهـد المعلـوماتيـة للدراسـات العليـا/ الهيئة العراقية للحاسبات والمعلوماتية
كجزء من متطلبات نيل شهادة الدبلوم العالي في
تقنية مواقع الشبكة
من قبل
مصطفى صباح شهيد
بأشراف
د. حسين الخفاجي
ربيع االول
1432
شباط
2011