introduction to clementine1701
TRANSCRIPT
-
8/13/2019 Introduction to Clementine1701
1/31
Introduction to Clementine
Tutors: Cecia Chan & Gabriel Fung
Data Mining Tutorial
-
8/13/2019 Introduction to Clementine1701
2/31
A Brief Review of Data Mining (I)
Data mining is
A process of extracting previously unknown, validand
actionable knowledgefrom large databases
A rule of thumb:
If we know clearly the shape and likely content of what
we are looking for, we are probably not dealing with
data mining
-
8/13/2019 Introduction to Clementine1701
3/31
A Brief Review of Data Mining (II)
Therefore, data mining is not
SQL queries against any number of disparate database or data
warehouse
SQL queries in a parallel or massively parallel environment
Information retrieval, for example, through intelligent agents
Multidimensional database analysis (MDA)
OLAP
Exploratory data analysis (EDA)
Graphical visualization Traditional statistical processing against a data warehouse
However, they are all related to data mining
-
8/13/2019 Introduction to Clementine1701
4/31
Data Mining Process
1. Business objective(s) determination
What is your goal?
2. Data collection
You can learn nothing without data3. Data preprocessing (or Data preparation)
Remove outlier / filter noise / modify fields / etc
4. Modeling
The core part of data mining
5. Evaluation
See what you have learn!
-
8/13/2019 Introduction to Clementine1701
5/31
Data Mining Software
Existing Data mining software:
Clementine from SPSS (we have this software),
Enterprise M inter from SAS (we have this software),
I ntel l igence M iner from IBM (we have this software),MineSet from Silicon Graphics,
K-wiz from Compression Sciences Ltd.,
DBMiner from DBMiner Tech. Inc.,
PolyAnalyst from Megaputer Intelligence,
StatServer from Mathsoft
:
:
-
8/13/2019 Introduction to Clementine1701
6/31
Problem Statement
Situation:
You are a researcher compiling data for a medical
study
You have collected data about a set of patients, all ofwhom suffered from the same illness
Each patient responded to one of five drug treatments
-
8/13/2019 Introduction to Clementine1701
7/31
Step 1: Business objective
Figure out which drug might be appropriate for a
future patient with the same illness
Here are the data collected:
Age
Sex (M or F)
BP (Blood pressure: High, normal, or low)
Weight (The weight of the patient)
Cholesterol (Blood cholesterol: Normal or high)
Na (Blood sodium concentration)
K (Blood potassium concentration)
Drug (Drug to which the patient responded)
-
8/13/2019 Introduction to Clementine1701
8/31
Using Clementine (1)
Clementine is located in
Start All Programs Clementine 6.0.2
Models
Nodes
Work-Space
-
8/13/2019 Introduction to Clementine1701
9/31
Using Clementine (2)
Nodes in the workspace represent different objects
and actions. You connect the nodes to form
streams, which, when executed, let you visualize
relationships and draw conclusions.
-
8/13/2019 Introduction to Clementine1701
10/31
Step 2: Data Collection (1)
Double Click
Nodes for inputting
the collected data
-
8/13/2019 Introduction to Clementine1701
11/31
Data Collection (2)
Location of your file
Use how many columns from the file
Is the first row specify the names of thefields or not
Other details
-
8/13/2019 Introduction to Clementine1701
12/31
Step 3: Data PreparationExplore the Data (1)
Nodes for exploration/visualization:
Table (in the Output panel)
Plot (in the Graphs Panel)
Histogram (in the Graphs Panel)
Distribution (in the Graphs Panel)
Web (in the Graphs Panel)
-
8/13/2019 Introduction to Clementine1701
13/31
Step 3: Data PreparationExplore the Data (2)
Note:
Connect the nodes by click-and-drag the middle button of the mouse
Double Click
Connect the nodes:
-
8/13/2019 Introduction to Clementine1701
14/31
Step 3: Data PreparationExplore the Data (3)
Execution:
Note:
Right click on the table node
to display this menu
-
8/13/2019 Introduction to Clementine1701
15/31
Step 3: Data PreparationExplore the Data (4)
Other nodes (Please try the other nodes yourself):
Histogram:
-
8/13/2019 Introduction to Clementine1701
16/31
Step 3: Data PreparationModify the Data (1)
Replacing values:
Use Filler node:
Supposewe want to transform all weights to its log value (Note:
we usually only transform variables to log when it is highly
skewed):
-
8/13/2019 Introduction to Clementine1701
17/31
Step 3: Data PreparationModify the Data (2)
Derive a new value:
Use Derive node:
Supposewe want to combine Na and K:
-
8/13/2019 Introduction to Clementine1701
18/31
Step 3: Data PreparationModify the Data (3)
Remove some fields
Use Filter node
Suppose we have derived a new field Na_Over_K, now we
need to remove the field Na and K:
-
8/13/2019 Introduction to Clementine1701
19/31
Step 4: ModelingDefine fields
Define the fields
Use Type node:
-
8/13/2019 Introduction to Clementine1701
20/31
Step 4: ModelingBuild a Model (1)
It is the core part of data mining. Supervised Learning:
Train Net (Neural Network)
C5.0 (C5.0 Decision Tree)
Linear Reg. (Linear regression)
C & R Tree (Classification and Regression Tree, CART)
Unsupervised Learning: Train Kohonen (Self-Organized Map, SOM)
Train KMeans (K-means Clustering)
TwoStep (A kind of Hierarchical Clustering) Others:
GRI (Association Rule mining)
Apriori (Association Rule mining)
Factor / PCA (Factor analysis, attribute selection technique)
-
8/13/2019 Introduction to Clementine1701
21/31
Step 4: ModelingBuild a Model (2)
Build what model?
Recall that our objective is to determine which type of drugs is
suitable for a specific patient.
Thus, it is a classification problem (supervised learning)
In this tutorial, we use:
C5.0 and C & R Tree
-
8/13/2019 Introduction to Clementine1701
22/31
Step 4: ModelingBuild a Model (3)
Note:
There are many complex settings for each model
In this tutorial, we use default setting
Fine tuning a model requires solid experiences in data mining
-
8/13/2019 Introduction to Clementine1701
23/31
Step 5: Evaluation (1)
It means NOTHING even if we have learned
SOMETHING, until the knowledge that we have
learned are ACTIONABLE and VALID
Remember: The data set of training and testing are ALWAYS
different (why?)
-
8/13/2019 Introduction to Clementine1701
24/31
Step 5: Evaluation (2)
Create the following flow
Note:
Must have the same flow
as the training stage
-
8/13/2019 Introduction to Clementine1701
25/31
Step 5: Evaluation (3)
Different results:
Different models can yield a completely different results
Choosing and tuning a good model is a difficult jobIn this tutorial, we only introduce the process of data
mining only
-
8/13/2019 Introduction to Clementine1701
26/31
Assignment 1
-
8/13/2019 Introduction to Clementine1701
27/31
-
8/13/2019 Introduction to Clementine1701
28/31
Assignment 1Field definitions
VARIABLE ROLE DEFINITION DESCRIPTIONCHECKING input Nominal Checking account statusHISTORY input Nominal Credit historyAMOUNT input Interval Amount in BankSAVINGS input Nominal No. of Savings (bonds, stocks, etc)EMPLOYED input Nominal Employment Type (Gov., private, etc)INSTALLP input Nominal Type of installment rateMARITAL input Nominal Martial statusPROPERTY input Nominal Type of PropertyAGE input Interval Age in yearsOTHER input Nominal Type of other installment planHOUSING input Nominal Type of HouseEXISTCR input Interval Number of existing creditsJOB input Nominal Job NatureFOREIGN input Binary Foreign worker or Local workerGOOD_BAD Output Binary Good or bad credit rating
-
8/13/2019 Introduction to Clementine1701
29/31
Assignment 1Data Mining Process
Data Collection Please download CreditRisk data set from
http://www.se.cuhk.edu.hk/~ect7470/
Two data sets:
(i) creditRisk1.csv is for training(ii) creditRisk2.csv is for testing
Data Preprocessing
Please explore the data and think critically whether anydata preprocessing is necessary Hints: Two of the interval variables are highly skewed
http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/ -
8/13/2019 Introduction to Clementine1701
30/31
Assignment 1Data Mining Process
Modeling
Please build a prediction models using default settings:
C5.0 Decision Tree
Model Assessment
Please use the testing data set to evaluate the
performance of the prediction models
-
8/13/2019 Introduction to Clementine1701
31/31
Assignment 1Submission
Save the stream as id.str
E.g, 00123456.str
Upload your stream to the course account
Deadline: 4 April 2004
This is an individual assignment
Note:We strongly encourage you to submit this assignment
during the class!!!