powerpoint template
TRANSCRIPT
Welcome
2
Mining? Warehousing?
3
Data Rich, Information Poor
4
Heterogeneous Data
5
The Value of Data
6
Data Integration & Analysis
7
From Data To Intelligence
8
Decision Models
Data Mining
Preprocessing
Database
Decision Support
Knowledge
Information
Data
Business Intelligence
9
Related Areas
10
Data Mining
Is DM really important?
Q: Your job sounds extremely interesting. What jobs would you recommend to a young person with an interest, and maybe a bachelors degree, in economics?
A: If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.
An interview with Google Chief Economist Hal Varian from the New York Times
11
It is all about data …
12
Financial Institutions
Healthcare
Telecommunication
Consulting Companies
Government
Bioinformatics
WWW
Retail
Course Profile
Lecturer: Dr. Bo Yuan
Contact Phone: 2603 6067 E-mail: [email protected] Room: F-401A
Time 2:00 pm – 3:35 pm, Friday
Venue: CI-105
Consultation 2:00pm – 3:00pm, Wednesday Appointment via phone or e-mail preferred
13
Aims & Objectives
Course Aims To gain a good understanding of popular data mining techniques. To gain experience in implementing and using data mining methods. To gain an appreciation for the basic principles of data warehousing.
Learning Objectives Able to implement and apply data mining techniques to solve problems. Understand the main issues and core problems in data mining. Understand the relationship between data mining and other fields. Appreciate data mining research ideas and practice. Get familiar with academic writing and presentation.
Graduate Attributes In-depth knowledge of the field of study Effective communication Independence and teamwork Critical judgment
14
Learning Activities
Week 1: Introduction
Week 2: Principles of Data Warehousing ETL, OLAP, Metadata
Week 3: Data Preprocessing
Week 4 – Week 7: Data Mining (Foundations) Bayesian Classifiers, Decision Trees, Neural Networks, Regression, Clustering Support Vector Machines, Association Rules
Week 8: Field Study
Week 9 – Week 11: Data Mining (Advanced) Semi-supervised Learning, Active Learning Ensemble Learning, Evolutionary Computation
Week 12 – Week 13: Special Topic A (Text Mining & Web Information Retrieval)
Week 14: Special Topic B (Bioinformatics, CRM, Privacy Issue)
Week 15: Project Presentation
15
Assessment
Assignment 1 Type: Class Presentation Weight: 10% Task Description: Individual 25 minutes talks on selected topics
Assignment 2 Type: Algorithm Experimentation Weight: 10% Task Description: Coding and testing of selected data mining algorithms
Assignment 3 Type: Problem Solving Weight: 30% Task Description: Group project on solving real-world data mining problems
Final Exam Type: Closed Book Examination Weight: 50% Duration: 120 minutes
16
Presentation matters!
Learning Resources
17
Learning Resources
18
International Conference on Data Mining
International Conference on Data Engineering
International Conference on Machine Learning
Pacific-Asia Conference on Knowledge Discovery and Data Mining
ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Rules & Policies
Plagiarism Plagiarism is the act of misrepresenting as one's own original work the
ideas, interpretations, words or creative works of another.
Direct copying of paragraphs, sentences, a single sentence or significant parts of a sentence.
Presenting as independent work done in collaboration with others.
Copying ideas, concepts, research results, computer codes, statistical tables, designs, images, sounds or text or any combination of these.
Paraphrasing, summarizing or simply rearranging another person's words, ideas, etc without changing the basic structure and/or meaning of the text.
Copying or adapting another student's original work into a submitted assessment item. 19
Rules & Policies
Late Submission Late submissions will incur a penalty of 10% of the total marks for each day
that the submission is late (including weekends). Submissions more than 5 days late will not be accepted.
Assumed Background This course will deal with concepts using algorithms and data structures,
mathematics, statistics and probability.
20
21
10 Minutes …
Data
Definition “Data are pieces of information that represent the qualitative or quantitative
attributes of a variable or set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.”
Data Types Continuous, Binary Discrete, String Symbolic
Storage Physical Logical
Major Issues Transformation Errors and corruption
22
Database
Definition “A database is an integrated collection of logically related records or files that is
stored in a computer system which consolidates records previously stored in separate files into a common pool of data records that provides data for many applications.”
“A database is a collection of information that is organized so that it can easily be accessed, managed, and updated.”
Relational Databases
23
Relational Model
24
First Normal Form(1NF)
There's no top-to-bottom ordering to the rows.
There's no left-to-right ordering to the columns.
There are no duplicate rows.
Every cell contains exactly one value from the applicable domain.
25
Customer
Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633
First Normal Form(1NF)
26
Customer
Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659555-776-4100
789 Maria Fernandez 555-808-9633
Customer
Customer ID First Name Surname Tel. No. 1 Tel. No. 2 Tel. No. 3
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
555-776-4100
789 Maria Fernandez 555-808-9633
First Normal Form(1NF)
27
Customer Name
Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez
Customer Telephone No.
Customer ID Telephone No.
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
Second Normal Form(2NF)
Definition A 1NF table is in 2NF if and only if none of its non-prime attributes are
functionally dependent on a part (proper subset) of a candidate key.
28
Employees' Skills
Employee Skill Current Work Location
Jones Typing 114 Main Street
Jones Shorthand 114 Main Street
Jones Whittling 114 Main Street
Bravo Light Cleaning 73 Industrial Way
Ellis Alchemy 73 Industrial Way
Ellis Juggling 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Second Normal Form(2NF)
29
Employees
Employee Current Work Location
Jones 114 Main Street
Bravo 73 Industrial Way
Ellis 73 Industrial Way
Harrison 73 Industrial Way
Employees' SkillsEmployee Skill
Jones Typing
Jones Shorthand
Jones Whittling
Bravo Light Cleaning
Ellis Alchemy
Ellis Juggling
Harrison Light Cleaning
Third Normal Form(3NF)
Definition: Every non-prime attribute of R is non-transitively dependent (directly dependent)
on every key of R.
30
Tournament Winners
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Third Normal Form(3NF)
31
Tournament Winners
Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson
Player Dates of BirthPlayer Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968
Data Warehouse
Operational databases are optimized for the preservation of data integrity and speed of recording of business transactions.
Data warehouses are optimized for the speed of data retrieval.
Data warehouse is a repository of an organization's electronically stored data, which are designed to facilitate reporting and analysis.
W. H. Inmon states that the data warehouse is: Subject-oriented Time-variant Non-volatile Integrated
Data Warehousing Business Intelligence Tools Tools to extract, transform, and load data into the repository Tools to manage and retrieve metadata
32
Multidimensional Data
33
OLAP Cube
To Build a Data Warehouse
Data must be extracted from multiple, heterogeneous sources such as databases or other data feeds.
Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled.
Data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is one of the most labor-demanding tasks.
Data must be fitted into the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases.
Data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task.
35
Data Warehouse vs. Database
36
Differences
Data warehouse Operational Database
Designed for the analysis of business measures by categories and attributes.
Designed for real time business operations.
Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.
Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.
Loaded with consistent, valid data; requires no real time validation.
Optimized for validation of incoming data during transactions; uses validation data tables.
Supports few concurrent users. Supports thousands of concurrent users.
Performance Dashboard
37
38
5 Minutes …
Data Mining
People have been analysing and investigating data for centuries.
Statistics Mean, Variance, Correlation, Distribution …
In modern days, data are often far beyond human comprehension. Diversity Volume Dimensionality
Definition Data Mining is the process of automatically extracting interesting and useful hidden patterns
from usually massive, incomplete and noisy data.
Not a fully automatic process Human interventions are often inevitable. Domain Knowledge Data Collection and Pre-processing
Synonym: Knowledge Discovery
One Field, Many Techniques, Unlimited Applications39
The Process of Data Mining
40
DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as variables, characters, etc) and based on a training set of previously labeled items”.
Given training data {(x1, y1), …, (xn, yn)}, the task is to produce a classifier that
maps any unknown object xi to its true classification label yi defined by some
unknown mapping.
Algorithms Decision Trees K-nearest neighbours Neural Networks Support Vector Machines
Applications Credit Scoring Churn Prediction Medical Diagnosis
41
X Y
Classification Boundaries
42
?
?
Confusion Matrix
43
Confusion Matrix
actual value
p n total
predictionoutcome
p' TruePositive
FalsePositive P'
n' FalseNegative
TrueNegative N'
total P N
Accuracy=(TP+TN)/(P+N)
Receiver Operating Characteristic
44
Lift
45
DM Techniques - Clustering
Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.
Distance Metrics Euclidean distance Manhattan distance Mahalanobis distance
Algorithms K-means Leader RPCL Affinity Propagation
Applications Market Research Image Segmentation Social Network Analysis
46
What is the difference between classification and clustering?
DM Techniques – Association Rule
48
Association Rule
49
Example data base with 4 items and 5 transactions
Transaction ID milk bread butter beer
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
DM Techniques – Regression
50
Regression
51
Overfitting – Regression
52
Overfitting – Classification
53
Cross Validation
54
Data
Training Set
Test Set
EvaluationGenerated
Models
Seeing is Knowing
55
Data Preprocessing
Why data processing? Real data are often surprisingly dirty.
• Incomplete Data• Inconsistent Data• Noisy Data
Typical Issues• Missing Attribute Values• Different Coding/Naming Schemes• Infeasible Values• Outliers
Data Quality Accuracy Completeness Consistency Interpretability Credibility Timeliness 56
Data Preprocessing
Data quality is a crucial factor in successful data mining tasks.
Data Cleaning Fill in missing values. Correct inconsistent data. Identify outliers and noisy data.
Data Integration Combine data from different sources.
Data Transformation Normalization Aggregation Type Conversion
Data Reduction Feature Selection Sampling
57
Review
What is data mining?
Why is data mining important?
What are the typical data mining applications?
What is the general procedure of data mining?
What are the major techniques in data mining?
What is the difference between data warehouses and databases?
What to expect in this course?
Where to find relevant information?
How to make the most of this course?
58
Just in Case Someone Asks …
59
Just in Case Someone Asks …
60