Download - Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Data Mining and WEKA

Fabiano Dalpiaz

Dipartimento di Ingegneria dei Sistemi e dell’Informazione

Università di Trento - Italy

http://www.disi.unitn.it/~dalpiaz

Database e Business Intelligence

A.A. 2009-2010

© P. Giorgini, F. Dalpiaz

© P. Giorgini, F. Dalpiaz 2

Acknowledgements

This presentation is partially based on the slides for the book:

Data Mining: Concepts and Techniques, 2° edJiawei Han and Micheline Kamber


Outline

1. Data Mining and KDD

2. Applied Data Mining

3. WEKA: A tool for Data Mining

4. German credit: a case study

5. Data Preprocessing

6. Data Mining techniques

7. Summary


1. Data Mining and KDD


Looking for knowledge

The Explosive Growth of Data

The World Wide Web

Business: e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation

Society and everyone: news, digital cameras, YouTube,

forums, blogs, Google & Co

We are drowning in data, but starving for knowledge!

Avoid data tombs

Data mining: “Automated analysis of massive data sets”.


What is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

Alternative names Knowledge discovery in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, etc.

Questions: Are simple search engines data mining? Are queries data mining?


Knowledge Discovery (KDD) Process

Data sources

Data CleaningData Integration

Data Warehouse

Data Mining

Pattern Evaluation

Selection

Task-relevantData

© P. Giorgini, F. Dalpiaz

8

Data Mining and Business Intelligence

Potential support tobusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Quantity of data


Data Mining: confluence of multiple disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

Algorithms

OtherDisciplines

Visualization


Why Data Mining?

Tremendous amount of data Walmart – Customer buying patterns – a data warehouse 7.5

Terabytes large in 1995 VISA – Detecting credit card interoperability issues – 6800

payment transactions per second

High-dimensionality of data Many dimensions to be combined together Data cube example: time, location, product sales

High complexity of data Time-series data, temporal data, sequence data Spatial, spatiotemporal, multimedia, text and Web data


1. Applied Data Mining


Market Analysis and Management Data sources:

credit card transactions, loyalty cards, smart cards, discount coupons, ...

Target marketing Find clusters of “model” customers who share the same

characteristics: • Geographics (lives in Rome, lives in Trentino)

• Demographics (married, between 21-35, at least one child, family income more than 40.000€/year)

• Psychographics (likes new products, consistently uses the Web)

• Behaviors (searches info in Internet, always defends her decisions)

Determine customer purchasing patterns over time


Market Analysis and Management Cross-market analysis

Find associations between product sales, and predict based on such association

Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success

Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy

product A Customer requirement analysis

Identify the best products for different groups of customers Predict what factors will attract new customers


Corporate Analysis

Finance Planning and Asset Evaluation Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend

analysis)

Resource Planning summarize and compare the resources and spending

Competition monitor competitors and market directions group customers into classes and a class-based pricing

procedure set pricing strategy in a highly competitive market


1. WEKA: a tool for Data Mining


What is WEKA?

WEKA = Waikato Environment for Knowledge Analysis University of Waikato, New Zealand

Completes the book “Data Mining” by Witten & Frank Main features:

Complete set of tools for data-preprocessing, learning, and evaluation

Graphical user interfaces Environment to compare different algorithms

http://www.cs.waikato.ac.nz/ml/weka/


WEKA supported file formats ARFF is the proprietary format CSV, C4.5, binary

Pre-processing might be required Data can be read

From URLs Connecting to an SQL database (via JDBC)


1. German credit: a case study


A real case study: german creditGo to: http://disi.unitn.it/~dalpiaz

2 versions: noisy and clean

1000 instances, 20 attributes: approved vs not approved@relation german_credit@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}@attribute duration real@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'}@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other}@attribute credit_amount real...@data'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female


German credit - attributes

A1: Status of existing checking account < 0 DM Fra 0 e 200 DM > 200 DM no checking account

A2: Duration of the requested credit (in months) A3: Credit history

no credits taken / all credits paid back duly all credits at this bank paid back duly existing credits paid back duly till now delay in paying off in the past critical account / other credits existing (not at this bank)



A4: Purpose car (new) car (used) furniture/equipment radio/television domestic appliances repairs education vacation retraining business other


German credit - attributes A5: Credit amount A6: Savings account/bonds

X <100 DM 100 <= X < 500 DM 500 <= X < 1000 DM >= 1000DM Unknown / no savings account

A7: Present employment since unemployed X < 1 year 1 <= X < 4 years 4 <= X < 7 years X>= 7 years



A8: Installment rate in percentage of disposable income A9: Personal status and sex

male : divorced/separated female : divorced/separated/married male : single male : married/widowed female : single

A10: Other debtors / guarantors None Co-applicant guarantor



A11: Present residence since A12: Property (the most relevant)

Real estate building society savings agreement / life insurance Car or other No property

A13: Age A14: Other installment plans

Bank Stores None



A14: Housing Rent Own For free

A15: Age A16: Number of existing credits at this bank A17: Job

unemployed/ unskilled - non-resident unskilled – resident skilled employee / official management/ self-employed/highly qualified employee/ officer



A18: Number of people being liable to provide maintenance for

A19: Telephone None Yes

A20: Foreign worker Yes No

When you find this symbol, use WEKA!

HANDS ON

Tool supporte.g., WEKA


So far: a simple KDD cycle

Raw data

Pre-processed data

Information

Pre-processing

Data Mining


1. Data Preprocessing

HANDS ON: open the german

credit data set and visualize it

with the explorer


Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data• e.g., occupation=“ ”, birthdate=“31/12/2099”

noisy: containing errors or outliers• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2009!!)• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records. In one copy of the data

customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything.


Why is data dirty?

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)


Data cleaning – missing values

“Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball

Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record

class) Most probable value: inference!


Missing values in WEKA

Manually insert missing values (“Edit” button)

HANDS ON:Attribute “Purpose”


Missing values in WEKA

ReplaceMissingValues filter: mean / mode value

HANDS ON:Attribute “Purpose”


Data Integration

Data Integration combines data from multiple sources into a coherent store

Schema integration Integrate metadata from different sources A.cust-id B.cust-number

Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill

Clinton = William Clinton

Detecting and resolving data value conflicts For the same real world entity, attribute values from different

sources are different (e.g., cm vs. inch)

D1 D2 D3

D1,2,3


Data Integration

Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)

Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two

attributes Correlation coefficient (Pearson’s product moment coefficient):

the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here


Data transformation

Aggregation: Sum the sales of different branches (in different data sources)

to compute the company sales

Generalization: From integer attribute age to classes of age (children, adult,

old)

Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}


Data transformation in WEKA

Generalization Discretize filter->unsupervised->attribute

Normalization Normalize filter->unsupervised->attribute between [0,1]

HANDS ON:Attribute “Age”


Data reduction

Data reduction Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the same) analytical results

Different reduction types (dimensions, numerosity, discretization)

Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False)

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A1? A6?

Class 1

A4?

Class 1Class 2 Class 2

Reduced attribute set: {A1, A4, A6}

Data reduction

Dimensionality: Principal Components Analysis Given N data vectors from n-dimensions, find k ≤ n orthogonal

vectors (principal components) that can be best used to represent data

Works for numeric data only Used when the number of dimensions is large


Attribute subset selection in WEKA

HANDS ON:Play with the tab

“Select attributes”



Data reduction

Numerosity: Clustering Partition data set into clusters based on similarity, and store

cluster representation (e.g., centroid and diameter) only

2 clustersSparse data leadsto many clusters – non effective


Clustering in WEKA

HANDS ON:Use the tab “clustering”


Data reduction

Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples

should be preserved Stratified sampling: Approximate the percentage of each class

(or subpopulation of interest) in the overall database

No samples from here

Random sampling Stratified sampling

Sampling in WEKA

Random sampling

Stratified sampling

HANDS ON:Reduce the

numerosity of the data set



Discretization

Three types of attributes Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic

rank) Continuous — numbers (integer or real numbers)

Discretization Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types,

and in those cases discretization is mandatory WEKA: we already introduced the discretize filter


1. Data Mining techniques

Frequent pattern analysis What is it?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data

• Which products are bought together? Yesterday’s wine and spaghetti example

• What are the subsequent purchases after buying a PC?• Can we automatically classify web documents?

Applications• Basket data analysis• Cross-marketing• Catalog design• Sale campaign analysis


Association rules (theory)Transaction-id Items bought

1 Wine, Bread, Spaghetti

2 Wine, Cocoa, Spaghetti

3 Wine, Spaghetti, Cheese

4 Bread, Cheese, Sugar

5 Bread, Cocoa, Spaghetti, Cheese, Sugar

Itemsets (= transactionsin this example)

Goal: find all rules of type X Y between items in an itemsetwith minimum:Support s - probability that an itemset contains X YConfidence c – conditional probability that an itemset containing Xcontains also Y


Association rules:Wine Spaghetti (support=60%, confidence=100%)Spaghetti Wine (support=60%, confidence=75%)

Association rules in WEKA

Apriori algorithm Does not work with numeric attributes HANDS ON:

Use Apriori in the “Associate” tab


Classification and Prediction

Classification Characterizes (describes) a set of items belonging to a training

set; these items are already classified according to a label attribute

The characterization is a model The model can be applied to classify new data (predict the

class they should belong to) Meta-algorithms used to enhance results (e.g., cost matrix)

Prediction models continuous-valued functions, i.e., predicts unknown or

missing values


Classification: model construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)


Classification: model usage

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’


Decision Trees

Income > 20K€

Investment type choice

Age > 60

Married?

Low risk

no yes

no

Mid risk

yes

no yes

High risk Mid risk


Decision Trees

How are the attributes in decision trees selected? Two well-known indexes are used

• Information gain selects the most informative attribute in distinguishing the items between the classes

• It biases towards attributes with a large set of values• Gain ratio faces the information gain limitations


Decision Trees in WEKA

HANDS ON:Use ADTree and

see results

Model Truth

Good Bad

Good 0 1

Bad 5 0

HANDS ON:Use J48 and apply

a cost matrix


Bayesian classifiers

Bayesian classification A statistical classification technique

• Predicts class membership probabilities Founded on the Bayes theorem

• What if X = “Red and rounded” and H = “Apple”? Performance

• The simplest implementation (Naïve Bayes) can be compared to decision trees and neural networks

Incremental• Each training example can increase/decrease the

probability that an hypothesis in correct

P H ∣X =P X ∣H P H

P X

HANDS ON:Use NaiveBayes

algorithm


Classification techniquesSupport Vector Machines

One of the most advanced classification techniques Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right

figure margin


Classification techniquesSVMs + Kernel Functions

Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions

How to split this? SVM SVM + KernelFunctions

HANDS ON:Try WEKA’s SMO

algorithm!


1. Summary

Why Data Mining?

Data Miningand KDD

Data preprocessing

Classification

Clustering

Application areas


Download - Data Mining and WEKA Fabiano Dalpiaz Dipartimento di Ingegneria dei Sistemi e dell’Informazione Università di Trento - Italy dalpiaz

Top Related