introduction to clementine1701

8/13/2019 Introduction to Clementine1701

1/31

Introduction to Clementine

Tutors: Cecia Chan & Gabriel Fung

Data Mining Tutorial


2/31

A Brief Review of Data Mining (I)

Data mining is

A process of extracting previously unknown, validand

actionable knowledgefrom large databases

A rule of thumb:

If we know clearly the shape and likely content of what

we are looking for, we are probably not dealing with

data mining


3/31

A Brief Review of Data Mining (II)

Therefore, data mining is not

SQL queries against any number of disparate database or data

warehouse

SQL queries in a parallel or massively parallel environment

Information retrieval, for example, through intelligent agents

Multidimensional database analysis (MDA)

OLAP

Exploratory data analysis (EDA)

Graphical visualization Traditional statistical processing against a data warehouse

However, they are all related to data mining


4/31

Data Mining Process

1. Business objective(s) determination

What is your goal?

2. Data collection

You can learn nothing without data3. Data preprocessing (or Data preparation)

Remove outlier / filter noise / modify fields / etc

4. Modeling

The core part of data mining

5. Evaluation

See what you have learn!


5/31

Data Mining Software

Existing Data mining software:

Clementine from SPSS (we have this software),

Enterprise M inter from SAS (we have this software),

I ntel l igence M iner from IBM (we have this software),MineSet from Silicon Graphics,

K-wiz from Compression Sciences Ltd.,

DBMiner from DBMiner Tech. Inc.,

PolyAnalyst from Megaputer Intelligence,

StatServer from Mathsoft

:

:


6/31

Problem Statement

Situation:

You are a researcher compiling data for a medical

study

You have collected data about a set of patients, all ofwhom suffered from the same illness

Each patient responded to one of five drug treatments


7/31

Step 1: Business objective

Figure out which drug might be appropriate for a

future patient with the same illness

Here are the data collected:

Age

Sex (M or F)

BP (Blood pressure: High, normal, or low)

Weight (The weight of the patient)

Cholesterol (Blood cholesterol: Normal or high)

Na (Blood sodium concentration)

K (Blood potassium concentration)

Drug (Drug to which the patient responded)


8/31

Using Clementine (1)

Clementine is located in

Start All Programs Clementine 6.0.2

Models

Nodes

Work-Space


9/31

Using Clementine (2)

Nodes in the workspace represent different objects

and actions. You connect the nodes to form

streams, which, when executed, let you visualize

relationships and draw conclusions.


10/31

Step 2: Data Collection (1)

Double Click

Nodes for inputting

the collected data


11/31

Data Collection (2)

Location of your file

Use how many columns from the file

Is the first row specify the names of thefields or not

Other details


12/31

Step 3: Data PreparationExplore the Data (1)

Nodes for exploration/visualization:

Table (in the Output panel)

Plot (in the Graphs Panel)

Histogram (in the Graphs Panel)

Distribution (in the Graphs Panel)

Web (in the Graphs Panel)


13/31


Note:

Connect the nodes by click-and-drag the middle button of the mouse

Double Click

Connect the nodes:


14/31


Execution:

Note:

Right click on the table node

to display this menu


15/31


Other nodes (Please try the other nodes yourself):

Histogram:


16/31

Step 3: Data PreparationModify the Data (1)

Replacing values:

Use Filler node:

Supposewe want to transform all weights to its log value (Note:

we usually only transform variables to log when it is highly

skewed):


17/31


Derive a new value:

Use Derive node:

Supposewe want to combine Na and K:


18/31


Remove some fields

Use Filter node

Suppose we have derived a new field Na_Over_K, now we

need to remove the field Na and K:


19/31

Step 4: ModelingDefine fields

Define the fields

Use Type node:


20/31

Step 4: ModelingBuild a Model (1)

It is the core part of data mining. Supervised Learning:

Train Net (Neural Network)

C5.0 (C5.0 Decision Tree)

Linear Reg. (Linear regression)

C & R Tree (Classification and Regression Tree, CART)

Unsupervised Learning: Train Kohonen (Self-Organized Map, SOM)

Train KMeans (K-means Clustering)

TwoStep (A kind of Hierarchical Clustering) Others:

GRI (Association Rule mining)

Apriori (Association Rule mining)

Factor / PCA (Factor analysis, attribute selection technique)


21/31


Build what model?

Recall that our objective is to determine which type of drugs is

suitable for a specific patient.

Thus, it is a classification problem (supervised learning)

In this tutorial, we use:

C5.0 and C & R Tree


22/31


Note:

There are many complex settings for each model

In this tutorial, we use default setting

Fine tuning a model requires solid experiences in data mining


23/31

Step 5: Evaluation (1)

It means NOTHING even if we have learned

SOMETHING, until the knowledge that we have

learned are ACTIONABLE and VALID

Remember: The data set of training and testing are ALWAYS

different (why?)


24/31


Create the following flow

Note:

Must have the same flow

as the training stage


25/31


Different results:

Different models can yield a completely different results

Choosing and tuning a good model is a difficult jobIn this tutorial, we only introduce the process of data

mining only


26/31

Assignment 1


27/31


28/31

Assignment 1Field definitions

VARIABLE ROLE DEFINITION DESCRIPTIONCHECKING input Nominal Checking account statusHISTORY input Nominal Credit historyAMOUNT input Interval Amount in BankSAVINGS input Nominal No. of Savings (bonds, stocks, etc)EMPLOYED input Nominal Employment Type (Gov., private, etc)INSTALLP input Nominal Type of installment rateMARITAL input Nominal Martial statusPROPERTY input Nominal Type of PropertyAGE input Interval Age in yearsOTHER input Nominal Type of other installment planHOUSING input Nominal Type of HouseEXISTCR input Interval Number of existing creditsJOB input Nominal Job NatureFOREIGN input Binary Foreign worker or Local workerGOOD_BAD Output Binary Good or bad credit rating


29/31

Assignment 1Data Mining Process

Data Collection Please download CreditRisk data set from

http://www.se.cuhk.edu.hk/~ect7470/

Two data sets:

(i) creditRisk1.csv is for training(ii) creditRisk2.csv is for testing

Data Preprocessing

Please explore the data and think critically whether anydata preprocessing is necessary Hints: Two of the interval variables are highly skewed
http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/


30/31

Assignment 1Data Mining Process

Modeling

Please build a prediction models using default settings:

C5.0 Decision Tree

Model Assessment

Please use the testing data set to evaluate the

performance of the prediction models


31/31

Assignment 1Submission

Save the stream as id.str

E.g, 00123456.str

Upload your stream to the course account

Deadline: 4 April 2004

This is an individual assignment

Note:We strongly encourage you to submit this assignment

during the class!!!

introduction to clementine1701

Documents