Data Mining and WEKA
Fabiano Dalpiaz
Dipartimento di Ingegneria dei Sistemi e dell’Informazione
Università di Trento - Italy
http://www.disi.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2009-2010
© P. Giorgini, F. Dalpiaz
© P. Giorgini, F. Dalpiaz 2
Acknowledgements
This presentation is partially based on the slides for the book:
Data Mining: Concepts and Techniques, 2° edJiawei Han and Micheline Kamber
© P. Giorgini, F. Dalpiaz 3
Outline
1. Data Mining and KDD
2. Applied Data Mining
3. WEKA: A tool for Data Mining
4. German credit: a case study
5. Data Preprocessing
6. Data Mining techniques
7. Summary
© P. Giorgini, F. Dalpiaz 4
1. Data Mining and KDD
© P. Giorgini, F. Dalpiaz 5
Looking for knowledge
The Explosive Growth of Data
The World Wide Web
Business: e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation
Society and everyone: news, digital cameras, YouTube,
forums, blogs, Google & Co
We are drowning in data, but starving for knowledge!
Avoid data tombs
Data mining: “Automated analysis of massive data sets”.
© P. Giorgini, F. Dalpiaz 6
What is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of data
Alternative names Knowledge discovery in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, etc.
Questions: Are simple search engines data mining? Are queries data mining?
© P. Giorgini, F. Dalpiaz 7
Knowledge Discovery (KDD) Process
Data sources
Data CleaningData Integration
Data Warehouse
Data Mining
Pattern Evaluation
Selection
Task-relevantData
© P. Giorgini, F. Dalpiaz
8
Data Mining and Business Intelligence
Potential support tobusiness decisions End User
Business Analyst
DataAnalyst
DBA
Decision Making
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
Quantity of data
© P. Giorgini, F. Dalpiaz 9
Data Mining: confluence of multiple disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
Algorithms
OtherDisciplines
Visualization
© P. Giorgini, F. Dalpiaz 10
Why Data Mining?
Tremendous amount of data Walmart – Customer buying patterns – a data warehouse 7.5
Terabytes large in 1995 VISA – Detecting credit card interoperability issues – 6800
payment transactions per second
High-dimensionality of data Many dimensions to be combined together Data cube example: time, location, product sales
High complexity of data Time-series data, temporal data, sequence data Spatial, spatiotemporal, multimedia, text and Web data
© P. Giorgini, F. Dalpiaz 11
1. Applied Data Mining
© P. Giorgini, F. Dalpiaz 12
Market Analysis and Management Data sources:
credit card transactions, loyalty cards, smart cards, discount coupons, ...
Target marketing Find clusters of “model” customers who share the same
characteristics: • Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)
Determine customer purchasing patterns over time
© P. Giorgini, F. Dalpiaz 13
Market Analysis and Management Cross-market analysis
Find associations between product sales, and predict based on such association
Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success
Customer profiling What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy
product A Customer requirement analysis
Identify the best products for different groups of customers Predict what factors will attract new customers
© P. Giorgini, F. Dalpiaz 14
Corporate Analysis
Finance Planning and Asset Evaluation Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend
analysis)
Resource Planning summarize and compare the resources and spending
Competition monitor competitors and market directions group customers into classes and a class-based pricing
procedure set pricing strategy in a highly competitive market
© P. Giorgini, F. Dalpiaz 15
1. WEKA: a tool for Data Mining
© P. Giorgini, F. Dalpiaz 16
What is WEKA?
WEKA = Waikato Environment for Knowledge Analysis University of Waikato, New Zealand
Completes the book “Data Mining” by Witten & Frank Main features:
Complete set of tools for data-preprocessing, learning, and evaluation
Graphical user interfaces Environment to compare different algorithms
http://www.cs.waikato.ac.nz/ml/weka/
© P. Giorgini, F. Dalpiaz 17
© P. Giorgini, F. Dalpiaz 18
WEKA supported file formats ARFF is the proprietary format CSV, C4.5, binary
Pre-processing might be required Data can be read
From URLs Connecting to an SQL database (via JDBC)
© P. Giorgini, F. Dalpiaz 19
1. German credit: a case study
© P. Giorgini, F. Dalpiaz 20
A real case study: german creditGo to: http://disi.unitn.it/~dalpiaz
2 versions: noisy and clean
1000 instances, 20 attributes: approved vs not approved@relation german_credit@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}@attribute duration real@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other existing credit'}@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education, vacation, retraining, business, other}@attribute credit_amount real...@data'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female
© P. Giorgini, F. Dalpiaz 21
German credit - attributes
A1: Status of existing checking account < 0 DM Fra 0 e 200 DM > 200 DM no checking account
A2: Duration of the requested credit (in months) A3: Credit history
no credits taken / all credits paid back duly all credits at this bank paid back duly existing credits paid back duly till now delay in paying off in the past critical account / other credits existing (not at this bank)
© P. Giorgini, F. Dalpiaz 22
German credit - attributes
A4: Purpose car (new) car (used) furniture/equipment radio/television domestic appliances repairs education vacation retraining business other
© P. Giorgini, F. Dalpiaz 23
German credit - attributes A5: Credit amount A6: Savings account/bonds
X <100 DM 100 <= X < 500 DM 500 <= X < 1000 DM >= 1000DM Unknown / no savings account
A7: Present employment since unemployed X < 1 year 1 <= X < 4 years 4 <= X < 7 years X>= 7 years
© P. Giorgini, F. Dalpiaz 24
German credit - attributes
A8: Installment rate in percentage of disposable income A9: Personal status and sex
male : divorced/separated female : divorced/separated/married male : single male : married/widowed female : single
A10: Other debtors / guarantors None Co-applicant guarantor
© P. Giorgini, F. Dalpiaz 25
German credit - attributes
A11: Present residence since A12: Property (the most relevant)
Real estate building society savings agreement / life insurance Car or other No property
A13: Age A14: Other installment plans
Bank Stores None
© P. Giorgini, F. Dalpiaz 26
German credit - attributes
A14: Housing Rent Own For free
A15: Age A16: Number of existing credits at this bank A17: Job
unemployed/ unskilled - non-resident unskilled – resident skilled employee / official management/ self-employed/highly qualified employee/ officer
© P. Giorgini, F. Dalpiaz 27
German credit - attributes
A18: Number of people being liable to provide maintenance for
A19: Telephone None Yes
A20: Foreign worker Yes No
When you find this symbol, use WEKA!
HANDS ON
Tool supporte.g., WEKA
© P. Giorgini, F. Dalpiaz 28
So far: a simple KDD cycle
Raw data
Pre-processed data
Information
Pre-processing
Data Mining
© P. Giorgini, F. Dalpiaz 29
1. Data Preprocessing
HANDS ON: open the german
credit data set and visualize it
with the explorer
© P. Giorgini, F. Dalpiaz 30
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data• e.g., occupation=“ ”, birthdate=“31/12/2099”
noisy: containing errors or outliers• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2009!!)• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything.
© P. Giorgini, F. Dalpiaz 31
Why is data dirty?
Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was
collected and when it is analyzed. Human/hardware/software problems
Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission
Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)
© P. Giorgini, F. Dalpiaz 32
Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball
Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record
class) Most probable value: inference!
© P. Giorgini, F. Dalpiaz 33
Missing values in WEKA
Manually insert missing values (“Edit” button)
HANDS ON:Attribute “Purpose”
© P. Giorgini, F. Dalpiaz 34
Missing values in WEKA
ReplaceMissingValues filter: mean / mode value
HANDS ON:Attribute “Purpose”
© P. Giorgini, F. Dalpiaz 35
Data Integration
Data Integration combines data from multiple sources into a coherent store
Schema integration Integrate metadata from different sources A.cust-id B.cust-number
Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
D1 D2 D3
D1,2,3
© P. Giorgini, F. Dalpiaz 36
Data Integration
Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two
attributes Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here
© P. Giorgini, F. Dalpiaz 37
Data transformation
Aggregation: Sum the sales of different branches (in different data sources)
to compute the company sales
Generalization: From integer attribute age to classes of age (children, adult,
old)
Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1}
© P. Giorgini, F. Dalpiaz 38
Data transformation in WEKA
Generalization Discretize filter->unsupervised->attribute
Normalization Normalize filter->unsupervised->attribute between [0,1]
HANDS ON:Attribute “Age”
© P. Giorgini, F. Dalpiaz 39
Data reduction
Data reduction Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same) analytical results
Different reduction types (dimensions, numerosity, discretization)
Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False)
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A1? A6?
Class 1
A4?
Class 1Class 2 Class 2
Reduced attribute set: {A1, A4, A6}
Data reduction
Dimensionality: Principal Components Analysis Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
Works for numeric data only Used when the number of dimensions is large
© P. Giorgini, F. Dalpiaz 40
Attribute subset selection in WEKA
HANDS ON:Play with the tab
“Select attributes”
© P. Giorgini, F. Dalpiaz 41
© P. Giorgini, F. Dalpiaz 42
Data reduction
Numerosity: Clustering Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
2 clustersSparse data leadsto many clusters – non effective
© P. Giorgini, F. Dalpiaz 43
Clustering in WEKA
HANDS ON:Use the tab “clustering”
© P. Giorgini, F. Dalpiaz 44
Data reduction
Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples
should be preserved Stratified sampling: Approximate the percentage of each class
(or subpopulation of interest) in the overall database
No samples from here
Random sampling Stratified sampling
Sampling in WEKA
Random sampling
Stratified sampling
HANDS ON:Reduce the
numerosity of the data set
© P. Giorgini, F. Dalpiaz 45
© P. Giorgini, F. Dalpiaz 46
Discretization
Three types of attributes Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic
rank) Continuous — numbers (integer or real numbers)
Discretization Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types,
and in those cases discretization is mandatory WEKA: we already introduced the discretize filter
© P. Giorgini, F. Dalpiaz 47
1. Data Mining techniques
Frequent pattern analysis What is it?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti example
• What are the subsequent purchases after buying a PC?• Can we automatically classify web documents?
Applications• Basket data analysis• Cross-marketing• Catalog design• Sale campaign analysis
© P. Giorgini, F. Dalpiaz 48
Association rules (theory)Transaction-id Items bought
1 Wine, Bread, Spaghetti
2 Wine, Cocoa, Spaghetti
3 Wine, Spaghetti, Cheese
4 Bread, Cheese, Sugar
5 Bread, Cocoa, Spaghetti, Cheese, Sugar
Itemsets (= transactionsin this example)
Goal: find all rules of type X Y between items in an itemsetwith minimum:Support s - probability that an itemset contains X YConfidence c – conditional probability that an itemset containing Xcontains also Y
© P. Giorgini, F. Dalpiaz 49
Association rules:Wine Spaghetti (support=60%, confidence=100%)Spaghetti Wine (support=60%, confidence=75%)
Association rules in WEKA
Apriori algorithm Does not work with numeric attributes HANDS ON:
Use Apriori in the “Associate” tab
© P. Giorgini, F. Dalpiaz 52
Classification and Prediction
Classification Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label attribute
The characterization is a model The model can be applied to classify new data (predict the
class they should belong to) Meta-algorithms used to enhance results (e.g., cost matrix)
Prediction models continuous-valued functions, i.e., predicts unknown or
missing values
© P. Giorgini, F. Dalpiaz 53
Classification: model construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
© P. Giorgini, F. Dalpiaz 54
Classification: model usage
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
© P. Giorgini, F. Dalpiaz 55
Decision Trees
Income > 20K€
Investment type choice
Age > 60
Married?
Low risk
no yes
no
Mid risk
yes
no yes
High risk Mid risk
© P. Giorgini, F. Dalpiaz 56
Decision Trees
How are the attributes in decision trees selected? Two well-known indexes are used
• Information gain selects the most informative attribute in distinguishing the items between the classes
• It biases towards attributes with a large set of values• Gain ratio faces the information gain limitations
© P. Giorgini, F. Dalpiaz 57
Decision Trees in WEKA
HANDS ON:Use ADTree and
see results
Model Truth
Good Bad
Good 0 1
Bad 5 0
HANDS ON:Use J48 and apply
a cost matrix
© P. Giorgini, F. Dalpiaz 58
Bayesian classifiers
Bayesian classification A statistical classification technique
• Predicts class membership probabilities Founded on the Bayes theorem
• What if X = “Red and rounded” and H = “Apple”? Performance
• The simplest implementation (Naïve Bayes) can be compared to decision trees and neural networks
Incremental• Each training example can increase/decrease the
probability that an hypothesis in correct
P H ∣X =P X ∣H P H
P X
HANDS ON:Use NaiveBayes
algorithm
© P. Giorgini, F. Dalpiaz 59
Classification techniquesSupport Vector Machines
One of the most advanced classification techniques Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right
figure margin
© P. Giorgini, F. Dalpiaz 60
Classification techniquesSVMs + Kernel Functions
Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions
How to split this? SVM SVM + KernelFunctions
HANDS ON:Try WEKA’s SMO
algorithm!
© P. Giorgini, F. Dalpiaz 61
1. Summary
Why Data Mining?
Data Miningand KDD
Data preprocessing
Classification
Clustering
Application areas
© P. Giorgini, F. Dalpiaz 62