introduction to clementine1701

Upload: sarbarup-banerjee

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Introduction to Clementine1701

    1/31

    Introduction to Clementine

    Tutors: Cecia Chan & Gabriel Fung

    Data Mining Tutorial

  • 8/13/2019 Introduction to Clementine1701

    2/31

    A Brief Review of Data Mining (I)

    Data mining is

    A process of extracting previously unknown, validand

    actionable knowledgefrom large databases

    A rule of thumb:

    If we know clearly the shape and likely content of what

    we are looking for, we are probably not dealing with

    data mining

  • 8/13/2019 Introduction to Clementine1701

    3/31

    A Brief Review of Data Mining (II)

    Therefore, data mining is not

    SQL queries against any number of disparate database or data

    warehouse

    SQL queries in a parallel or massively parallel environment

    Information retrieval, for example, through intelligent agents

    Multidimensional database analysis (MDA)

    OLAP

    Exploratory data analysis (EDA)

    Graphical visualization Traditional statistical processing against a data warehouse

    However, they are all related to data mining

  • 8/13/2019 Introduction to Clementine1701

    4/31

    Data Mining Process

    1. Business objective(s) determination

    What is your goal?

    2. Data collection

    You can learn nothing without data3. Data preprocessing (or Data preparation)

    Remove outlier / filter noise / modify fields / etc

    4. Modeling

    The core part of data mining

    5. Evaluation

    See what you have learn!

  • 8/13/2019 Introduction to Clementine1701

    5/31

    Data Mining Software

    Existing Data mining software:

    Clementine from SPSS (we have this software),

    Enterprise M inter from SAS (we have this software),

    I ntel l igence M iner from IBM (we have this software),MineSet from Silicon Graphics,

    K-wiz from Compression Sciences Ltd.,

    DBMiner from DBMiner Tech. Inc.,

    PolyAnalyst from Megaputer Intelligence,

    StatServer from Mathsoft

    :

    :

  • 8/13/2019 Introduction to Clementine1701

    6/31

    Problem Statement

    Situation:

    You are a researcher compiling data for a medical

    study

    You have collected data about a set of patients, all ofwhom suffered from the same illness

    Each patient responded to one of five drug treatments

  • 8/13/2019 Introduction to Clementine1701

    7/31

    Step 1: Business objective

    Figure out which drug might be appropriate for a

    future patient with the same illness

    Here are the data collected:

    Age

    Sex (M or F)

    BP (Blood pressure: High, normal, or low)

    Weight (The weight of the patient)

    Cholesterol (Blood cholesterol: Normal or high)

    Na (Blood sodium concentration)

    K (Blood potassium concentration)

    Drug (Drug to which the patient responded)

  • 8/13/2019 Introduction to Clementine1701

    8/31

    Using Clementine (1)

    Clementine is located in

    Start All Programs Clementine 6.0.2

    Models

    Nodes

    Work-Space

  • 8/13/2019 Introduction to Clementine1701

    9/31

    Using Clementine (2)

    Nodes in the workspace represent different objects

    and actions. You connect the nodes to form

    streams, which, when executed, let you visualize

    relationships and draw conclusions.

  • 8/13/2019 Introduction to Clementine1701

    10/31

    Step 2: Data Collection (1)

    Double Click

    Nodes for inputting

    the collected data

  • 8/13/2019 Introduction to Clementine1701

    11/31

    Data Collection (2)

    Location of your file

    Use how many columns from the file

    Is the first row specify the names of thefields or not

    Other details

  • 8/13/2019 Introduction to Clementine1701

    12/31

    Step 3: Data PreparationExplore the Data (1)

    Nodes for exploration/visualization:

    Table (in the Output panel)

    Plot (in the Graphs Panel)

    Histogram (in the Graphs Panel)

    Distribution (in the Graphs Panel)

    Web (in the Graphs Panel)

  • 8/13/2019 Introduction to Clementine1701

    13/31

    Step 3: Data PreparationExplore the Data (2)

    Note:

    Connect the nodes by click-and-drag the middle button of the mouse

    Double Click

    Connect the nodes:

  • 8/13/2019 Introduction to Clementine1701

    14/31

    Step 3: Data PreparationExplore the Data (3)

    Execution:

    Note:

    Right click on the table node

    to display this menu

  • 8/13/2019 Introduction to Clementine1701

    15/31

    Step 3: Data PreparationExplore the Data (4)

    Other nodes (Please try the other nodes yourself):

    Histogram:

  • 8/13/2019 Introduction to Clementine1701

    16/31

    Step 3: Data PreparationModify the Data (1)

    Replacing values:

    Use Filler node:

    Supposewe want to transform all weights to its log value (Note:

    we usually only transform variables to log when it is highly

    skewed):

  • 8/13/2019 Introduction to Clementine1701

    17/31

    Step 3: Data PreparationModify the Data (2)

    Derive a new value:

    Use Derive node:

    Supposewe want to combine Na and K:

  • 8/13/2019 Introduction to Clementine1701

    18/31

    Step 3: Data PreparationModify the Data (3)

    Remove some fields

    Use Filter node

    Suppose we have derived a new field Na_Over_K, now we

    need to remove the field Na and K:

  • 8/13/2019 Introduction to Clementine1701

    19/31

    Step 4: ModelingDefine fields

    Define the fields

    Use Type node:

  • 8/13/2019 Introduction to Clementine1701

    20/31

    Step 4: ModelingBuild a Model (1)

    It is the core part of data mining. Supervised Learning:

    Train Net (Neural Network)

    C5.0 (C5.0 Decision Tree)

    Linear Reg. (Linear regression)

    C & R Tree (Classification and Regression Tree, CART)

    Unsupervised Learning: Train Kohonen (Self-Organized Map, SOM)

    Train KMeans (K-means Clustering)

    TwoStep (A kind of Hierarchical Clustering) Others:

    GRI (Association Rule mining)

    Apriori (Association Rule mining)

    Factor / PCA (Factor analysis, attribute selection technique)

  • 8/13/2019 Introduction to Clementine1701

    21/31

    Step 4: ModelingBuild a Model (2)

    Build what model?

    Recall that our objective is to determine which type of drugs is

    suitable for a specific patient.

    Thus, it is a classification problem (supervised learning)

    In this tutorial, we use:

    C5.0 and C & R Tree

  • 8/13/2019 Introduction to Clementine1701

    22/31

    Step 4: ModelingBuild a Model (3)

    Note:

    There are many complex settings for each model

    In this tutorial, we use default setting

    Fine tuning a model requires solid experiences in data mining

  • 8/13/2019 Introduction to Clementine1701

    23/31

    Step 5: Evaluation (1)

    It means NOTHING even if we have learned

    SOMETHING, until the knowledge that we have

    learned are ACTIONABLE and VALID

    Remember: The data set of training and testing are ALWAYS

    different (why?)

  • 8/13/2019 Introduction to Clementine1701

    24/31

    Step 5: Evaluation (2)

    Create the following flow

    Note:

    Must have the same flow

    as the training stage

  • 8/13/2019 Introduction to Clementine1701

    25/31

    Step 5: Evaluation (3)

    Different results:

    Different models can yield a completely different results

    Choosing and tuning a good model is a difficult jobIn this tutorial, we only introduce the process of data

    mining only

  • 8/13/2019 Introduction to Clementine1701

    26/31

    Assignment 1

  • 8/13/2019 Introduction to Clementine1701

    27/31

  • 8/13/2019 Introduction to Clementine1701

    28/31

    Assignment 1Field definitions

    VARIABLE ROLE DEFINITION DESCRIPTIONCHECKING input Nominal Checking account statusHISTORY input Nominal Credit historyAMOUNT input Interval Amount in BankSAVINGS input Nominal No. of Savings (bonds, stocks, etc)EMPLOYED input Nominal Employment Type (Gov., private, etc)INSTALLP input Nominal Type of installment rateMARITAL input Nominal Martial statusPROPERTY input Nominal Type of PropertyAGE input Interval Age in yearsOTHER input Nominal Type of other installment planHOUSING input Nominal Type of HouseEXISTCR input Interval Number of existing creditsJOB input Nominal Job NatureFOREIGN input Binary Foreign worker or Local workerGOOD_BAD Output Binary Good or bad credit rating

  • 8/13/2019 Introduction to Clementine1701

    29/31

    Assignment 1Data Mining Process

    Data Collection Please download CreditRisk data set from

    http://www.se.cuhk.edu.hk/~ect7470/

    Two data sets:

    (i) creditRisk1.csv is for training(ii) creditRisk2.csv is for testing

    Data Preprocessing

    Please explore the data and think critically whether anydata preprocessing is necessary Hints: Two of the interval variables are highly skewed

    http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/
  • 8/13/2019 Introduction to Clementine1701

    30/31

    Assignment 1Data Mining Process

    Modeling

    Please build a prediction models using default settings:

    C5.0 Decision Tree

    Model Assessment

    Please use the testing data set to evaluate the

    performance of the prediction models

  • 8/13/2019 Introduction to Clementine1701

    31/31

    Assignment 1Submission

    Save the stream as id.str

    E.g, 00123456.str

    Upload your stream to the course account

    Deadline: 4 April 2004

    This is an individual assignment

    Note:We strongly encourage you to submit this assignment

    during the class!!!