comp53311 knowledge discovery in databases overview prepared by raymond wong presented by raymond...
TRANSCRIPT
COMP5331 1
COMP5331
Knowledge Discovery in Databases
Overview
Prepared by Raymond WongPresented by Raymond Wong
raywong@cse
COMP5331 2
Course Details
Reference books/materials: Papers Data Mining: Concepts and Techniques.
Jiawei Han and Micheline Kamber. Morgan Kaufmann Publishers (3rd edition)
Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar Boston : Pearson Addison Wesley (2006)
COMP5331 3
Area
DB or AI This course can count towards one
of the areas ONLY and cannot be double counted towards the required credits
COMP5331 4
Course Details
Grading Scheme: Assignment 30% Project 30% Final Exam 40%
COMP5331 5
Assignment
If the students can answer the selected questions in class correctly, for each corrected answer,
I will give him/her a coupon This coupon can be used to waive one
question in an assignment which means that s/he can get full marks
for this question without answering this question
COMP5331 6
Assignment Guideline
For each assignment, each student can waive at most one question only.
s/he can waive any question he wants and obtain full marks for this question (no matter whether s/he answer this question or not)
s/he may also answer this question. But, we will also mark it but will give full marks to this question.
When the student submits the assignment, please staple the coupon to the submitted assignment please write down the question no. s/he wants to
waive on the coupon
COMP5331 7
Project
Each project is completed by a group.
The number of students in a group depends on the class size.
The duration of each presentation depends on the class size.
It will be announced soon.
COMP5331 8
Project
Project Type (One of the following) Survey
Implementation-oriented Project
Research-oriented Project
Your group only needs to read about 2~5 papers
Your group only needs to read about 1~2 papers
You can read some papers and conduct research
COMP5331 9
Project
Project Type (One of the following) Survey
Implementation-oriented Project
Research-oriented Project
1. Proposal2. Presentation3. Final report
1. Proposal2. Presentation3. Final report4. Coding
1. Proposal2. Presentation3. Final report (containing your
proposed methodology)4. Coding (if any)
Full Score = 80%
Full Score = 90%
Full Score = 100%
COMP5331 10
Project
Project Topic Some pre-selected topics/papers Your own choice
For fairness, please do not choose the topic which is closely related to your own research
COMP5331 11
Exam
You are allowed to bring a calculator with you.
Please remember to prepare a calculator for the exam
COMP5331 12
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 13
1. Association
Customer
Apple Orange Milk
Raymond Apple Orange
Ada Orange Milk
Grace Apple Orange
… … … …Items/Itemsets Frequency
Apple 2
Orange 3
Milk 1
{Apple, Orange} 2
{Orange, Milk} 1
We are interested in the items/itemsets with frequency >= 2
Frequent Pattern(or Frequent Item)
Frequent Pattern(or Frequent Item)
Frequent Pattern(or Frequent Itemset)
COMP5331 14
1. Association
Customer
Apple Orange Milk
Raymond Apple Orange
Ada Orange Milk
Grace Apple Orange
… … … …Items/Itemsets Frequency
Apple 2
Orange 3
Milk 1
{Apple, Orange} 2
{Orange, Milk} 1
We are interested in the items/itemsets with frequency >= 2
Association Rule:1. Apple Orange( customers who buy apple will probably buy orange.)
2. Orange Apple( customer who buy orange will probably buy apple.)
100%
2
2
67%
3
2
Problem: to find all frequent patterns and association rules
COMP5331 15
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 16
2. Clustering
Computer
History
Raymond
100 40
Louis 90 45
Wyman 20 95
… … …Computer
History
Cluster 1(e.g. High Score in Computer and Low Score in History)
Cluster 2(e.g. High Score in Historyand Low Score in Computer)
Problem: to find all clusters
COMP5331 17
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 18
3. Classification
root
child=yes child=no
Income=high Income=low
100% Yes0% No
100% Yes0% No
0% Yes100% No
Decision tree
Race Income
Child Insurance
white
high no ?
Suppose there is a person.
COMP5331 19
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 20
4. Warehouse
Databases Users
Databases UsersData Warehouse
Need to wait for a long time (e.g., 1 day to 1
week)
Pre-computed results
Query
COMP5331 21
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 22
5. Data Mining over Static Data
1. Association2. Clustering3. Classification
StaticData
Output(Data Mining Results)
COMP5331 23
5. Data Mining over Data Streams
1. Association2. Clustering3. Classification
Output(Data Mining Results)
…
Unbounded Data
Real-time Processing
COMP5331 24
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 25
6. Web Databases
Raymond Wong
COMP5331 26
How to rank the webpages?
COMP5331 27
Major Topics
1. Association2. Clustering3. Classification4. Data Warehouse5. Data Mining over Data Streams6. Web Databases7. Multi-criteria Decision Making
COMP5331 28
7. Multi-criteria Decision Making
Hotel Price Distance to beach (km)
a 1000 4
b 2400 5
c 3000 1
3 hotels
Suppose we want to look for a hotel which is close to a beach.
We have two attributes. Which hotel should we select?
Suppose we compare hotel a and hotel b
We know that hotel a is “better”than hotel bbecause 1. Price of hotel a is smaller2. Distance of hotel a is smaller
COMP5331 29
7. Multi-criteria Decision Making
Hotel Price Distance to beach (km)
a 1000 4
b 2400 5
c 3000 1
3 hotels
Suppose we want to look for a hotel which is close to a beach.
We have two attributes. Which hotel should we select?
Suppose we compare hotel a and hotel c
We cannot determine hotel a is “better”than hotel c (wrt two attributes).We cannot determine hotel c is “better”than hotel a (wrt two attributes)..This is because 1. Price of hotel a is smaller2. Distance of hotel c is smaller