overview of the data mining process

39
Overview of the Data Mining Process Data Mining for Business Analytics Shmueli, Patel & Bruce

Upload: bertha-butler

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

Core Ideas in Data Mining Classification Prediction Association Rules Predictive Analytics Data Reduction and Dimension Reduction Data Exploration and Visualization Supervised and Unsupervised Learning

TRANSCRIPT

Page 1: Overview of the Data Mining Process

Overview of the Data Mining Process

Data Mining for Business AnalyticsShmueli, Patel & Bruce

Page 2: Overview of the Data Mining Process

Core Ideas in Data Mining

ClassificationPredictionAssociation RulesPredictive AnalyticsData Reduction and Dimension ReductionData Exploration and VisualizationSupervised and Unsupervised Learning

Page 3: Overview of the Data Mining Process

Supervised LearningGoal: Predict a single “target” or “outcome”

variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

Page 4: Overview of the Data Mining Process

Unsupervised Learning

Goal: Segment data into meaningful segments; detect patterns

There is no target (outcome) variable to predict or classify

Methods: Association rules, data reduction & exploration, visualization

Page 5: Overview of the Data Mining Process

Supervised: ClassificationGoal: Predict categorical target (outcome)

variable Examples: Purchase/no purchase, fraud/no

fraud, creditworthy/not creditworthy…Each row is a case (customer, tax return,

applicant)Each column is a variableTarget variable is often binary (yes/no)

Page 6: Overview of the Data Mining Process

Supervised: PredictionGoal: Predict numerical target (outcome)

variable Examples: sales, revenue, performanceAs in classification:Each row is a case (customer, tax return,

applicant)Each column is a variableTaken together, classification and

prediction constitute “predictive analytics”

Page 7: Overview of the Data Mining Process

Unsupervised: Association RulesGoal: Produce rules that define “what goes

with what”Example: “If X was purchased, Y was also

purchased”Rows are transactionsUsed in recommender systems – “Our

records show you bought X, you may also like Y”

Also called “affinity analysis”

Page 8: Overview of the Data Mining Process

Unsupervised: Data ReductionDistillation of complex/large data into

simpler/smaller dataReducing the number of variables/columns

(e.g., principal components)Reducing the number of records/rows (e.g.,

clustering)

Page 9: Overview of the Data Mining Process

Unsupervised: Data VisualizationGraphs and plots of dataHistograms, boxplots, bar charts,

scatterplotsEspecially useful to examine relationships

between pairs of variables

Page 10: Overview of the Data Mining Process

Data Exploration

Data sets are typically large, complex & messy

Need to review the data to help refine the task

Use techniques of Reduction and Visualization

Page 11: Overview of the Data Mining Process

The Process of Data Mining

Page 12: Overview of the Data Mining Process

Steps in Data Mining1. Define/understand purpose2. Obtain data (may involve random

sampling)3. Explore, clean, pre-process data4. Reduce the data, transform variables5. Determine DM task (classification,

clustering, etc.)6. Partition the data (for supervised tasks)7. Choose the techniques (regression, neural

networks, etc.)8. Perform the task9. Interpret – compare models10. Deploy the best model

Page 13: Overview of the Data Mining Process

SAS SEMMA Sample Take a sample and partitionExplore Examine statistically and graphicallyModify Transform variables and impute missing valuesModel Fit predictive modelAssess Compare model using validation dataset

Page 14: Overview of the Data Mining Process

Obtaining Data: SamplingData mining typically deals with huge

databasesAlgorithms and models are typically applied

to a sample from a database, to produce statistically-valid results

XLMiner, e.g., limits the “training” partition to 10,000 records

Once you develop and select a final model, you use it to “score” the observations in the larger database

Page 15: Overview of the Data Mining Process

Rare event oversamplingOften the event of interest is rareExamples: response to mailing, fraud in

taxes, …Sampling may yield too few “interesting”

cases to effectively train a modelA popular solution: oversample the rare

cases to obtain a more balanced training set

Later, need to adjust results for the oversampling

Page 16: Overview of the Data Mining Process

Pre-processing Data

Page 17: Overview of the Data Mining Process

Types of VariablesDetermine the types of pre-processing

needed, and algorithms usedMain distinction: Categorical vs. numericNumeric

ContinuousInteger

CategoricalOrdered (low, medium, high)Unordered (male, female)

Page 18: Overview of the Data Mining Process

Variable handlingNumeric

Most algorithms in XLMiner can handle numeric data

May occasionally need to “bin” into categories

CategoricalNaïve Bayes can use as-isIn most other algorithms, must create binary

dummies (number of dummies = number of categories – 1)

Page 19: Overview of the Data Mining Process

Variable selectionParsimony10 records/variable6 x m x p records

m = no. of outcome classesP = no. of variables

Redundancy must be avoidedDomain experts must be consulted

Page 20: Overview of the Data Mining Process

Detecting OutliersAn outlier is an observation that is

“extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)

Outliers can have disproportionate influence on models (a problem if it is spurious)

An important step in data pre-processing is detecting outliers

Once detected, domain knowledge is required to determine if it is an error, or truly extreme.

Page 21: Overview of the Data Mining Process

Detecting OutliersIn some contexts, finding outliers is the

purpose of the DM exercise (airport security screening). This is called “anomaly detection”.

Page 22: Overview of the Data Mining Process

Handling Missing DataMost algorithms will not process records

with missing values. Default is to drop those records.

Solution 1: Omission If a small number of records have missing

values, can omit them If many records are missing values on a small

set of variables, can drop those variables (or use proxies)

If many records have missing values, omission is not practical

Solution 2: Imputation Replace missing values with reasonable

substitutes Lets you keep the record and use the rest of its

(non-missing) information

Page 23: Overview of the Data Mining Process

Normalizing (Standardizing) DataUsed in some techniques when variables

with the largest scales would dominate and skew results

Puts all variables on same scaleNormalizing function: Subtract mean and

divide by standard deviation (used in XLMiner)

Alternative function: scale to 0-1 by subtracting minimum and dividing by the range

Page 24: Overview of the Data Mining Process

Partitioning the DataProblem: How well will our

model perform with new data?

Solution: Separate data into two parts Training partition to develop

the modelValidation partition to

implement the model and evaluate its performance on “new” data

Addresses the issue of overfitting

Page 25: Overview of the Data Mining Process

Test PartitionWhen a model is developed on

training data, it can overfit the training data (hence need to assess on validation)

Assessing multiple models on same validation data can overfit validation data

Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data

Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data

Page 26: Overview of the Data Mining Process

OverfittingStatistical models can produce highly

complex explanations of relationships between variables

The “fit” may be excellentWhen used with new data, models of great

complexity do not do so well.

Page 27: Overview of the Data Mining Process

100% fit – not useful for new data

200 300 400 500 600 700 800 900 10000

200

400

600

800

1000

1200

1400

1600

Expenditure

Revenue

Page 28: Overview of the Data Mining Process

Overfitting (cont.)Causes:

Too many predictors A model with too many parameters Trying many different models

Consequence: Deployed model will not work as well as expected with completely new data.

Page 29: Overview of the Data Mining Process

Building Predictive Model with XLMINERExample – West Roxbury Home Value Dataset

TOTAL VALUE TAX

LOT SQ FT

YR BUILT

GROSS AREA

LIVING AREA FLOORS ROOMS

BEDROOMS

FULL BATH

HALF BATH

KITCHEN

FIREPLACEREMODEL

344.2 330 9965 1880 2436 1352 2 6 3 1 1 1 0None412.6 5190 6590 1945 3108 1976 2 10 4 2 1 1 0Recent330.1 4152 7500 1890 2294 1371 2 8 4 1 1 1 0None498.6 6272 13773 1957 5032 2608 1 9 5 1 1 1 1None331.5 4170 5000 1910 2370 1438 2 7 3 2 0 1 0None337.4 4244 5142 1950 2124 1060 1 6 3 1 0 1 1Old359.4 4521 5000 1954 3220 1916 2 7 3 1 1 1 0None320.4 4030 10000 1950 2208 1200 1 6 3 1 0 1 0None333.5 4195 6835 1958 2582 1092 1 5 3 1 0 1 1Recent409.4 5150 5093 1900 4818 2992 2 8 4 2 0 1 0None

Page 30: Overview of the Data Mining Process

TOTAL VALUE Total assessed value for property, in thousands of USDTAX Tax bill amount based on total assessed value multiplied by the

tax rate, in USDLOT SQ FT Total lot size of parcel in square feetYR BUILT Year property was builtGROSS AREA Gross floor areaLIVING AREA Total living area for residential properties (ftz)FLOORS Number of floorsROOMS Total number of roomsBEDROOMS Total number of bedroomsFULL BATH Total number of full bathsHALF BATH Total number of half bathsKITCHEN Total number of kitchensFIREPLACE Total number of fireplacesREMODEL When house was remodeled (Recent/Old/None)

Page 31: Overview of the Data Mining Process

Modeling Process

1. Determine the purposePredict value of homes in West Roxbury

2. Obtain the dataWest Roxbury Housing.XLSX

3. Explore, Clean and Preprocess the dataa. Tax is circular, not useful as a predictorb. Generate descriptive statistics and look for

unusual valuesc. Generate plots and examine them

Page 32: Overview of the Data Mining Process

Modeling Process

4. Reduce the dimensiona. Condense number of categoriesb. Consolidate multiple numerical variables using

Principal Component Analysis5. Determine Data Mining Task6. Partition the data (for supervised tasks)7. Choose the technique

For the example multiple linear regression

Page 33: Overview of the Data Mining Process

Modeling Process

8. Use the algorithm to perform the task9. Interpret results10. Deploy the model

Page 34: Overview of the Data Mining Process

Step 6: Partitioning the data

Page 35: Overview of the Data Mining Process

Step 8: Using XLMiner for Multiple Linear Regression

Page 36: Overview of the Data Mining Process

Summary of errors

Page 37: Overview of the Data Mining Process

RMS errorError = actual - predictedRMS = Root-mean-squared error = Square

root of average squared error

In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable

Page 38: Overview of the Data Mining Process

Using Excel and XLMiner for Data MiningExcel is limited in data capacityHowever, the training and validation of DM

models can be handled within the modest limits of Excel and XLMiner

Models can then be used to score larger databases

XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)

Page 39: Overview of the Data Mining Process

SummaryData Mining consists of supervised

methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization)

Before algorithms can be applied, data must be characterized and pre-processed

To evaluate performance and to avoid overfitting, data partitioning is used

Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database