overview of the data mining process
DESCRIPTION
Core Ideas in Data Mining Classification Prediction Association Rules Predictive Analytics Data Reduction and Dimension Reduction Data Exploration and Visualization Supervised and Unsupervised LearningTRANSCRIPT
Overview of the Data Mining Process
Data Mining for Business AnalyticsShmueli, Patel & Bruce
Core Ideas in Data Mining
ClassificationPredictionAssociation RulesPredictive AnalyticsData Reduction and Dimension ReductionData Exploration and VisualizationSupervised and Unsupervised Learning
Supervised LearningGoal: Predict a single “target” or “outcome”
variable
Training data, where target value is known
Score to data where value is not known
Methods: Classification and Prediction
Unsupervised Learning
Goal: Segment data into meaningful segments; detect patterns
There is no target (outcome) variable to predict or classify
Methods: Association rules, data reduction & exploration, visualization
Supervised: ClassificationGoal: Predict categorical target (outcome)
variable Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…Each row is a case (customer, tax return,
applicant)Each column is a variableTarget variable is often binary (yes/no)
Supervised: PredictionGoal: Predict numerical target (outcome)
variable Examples: sales, revenue, performanceAs in classification:Each row is a case (customer, tax return,
applicant)Each column is a variableTaken together, classification and
prediction constitute “predictive analytics”
Unsupervised: Association RulesGoal: Produce rules that define “what goes
with what”Example: “If X was purchased, Y was also
purchased”Rows are transactionsUsed in recommender systems – “Our
records show you bought X, you may also like Y”
Also called “affinity analysis”
Unsupervised: Data ReductionDistillation of complex/large data into
simpler/smaller dataReducing the number of variables/columns
(e.g., principal components)Reducing the number of records/rows (e.g.,
clustering)
Unsupervised: Data VisualizationGraphs and plots of dataHistograms, boxplots, bar charts,
scatterplotsEspecially useful to examine relationships
between pairs of variables
Data Exploration
Data sets are typically large, complex & messy
Need to review the data to help refine the task
Use techniques of Reduction and Visualization
The Process of Data Mining
Steps in Data Mining1. Define/understand purpose2. Obtain data (may involve random
sampling)3. Explore, clean, pre-process data4. Reduce the data, transform variables5. Determine DM task (classification,
clustering, etc.)6. Partition the data (for supervised tasks)7. Choose the techniques (regression, neural
networks, etc.)8. Perform the task9. Interpret – compare models10. Deploy the best model
SAS SEMMA Sample Take a sample and partitionExplore Examine statistically and graphicallyModify Transform variables and impute missing valuesModel Fit predictive modelAssess Compare model using validation dataset
Obtaining Data: SamplingData mining typically deals with huge
databasesAlgorithms and models are typically applied
to a sample from a database, to produce statistically-valid results
XLMiner, e.g., limits the “training” partition to 10,000 records
Once you develop and select a final model, you use it to “score” the observations in the larger database
Rare event oversamplingOften the event of interest is rareExamples: response to mailing, fraud in
taxes, …Sampling may yield too few “interesting”
cases to effectively train a modelA popular solution: oversample the rare
cases to obtain a more balanced training set
Later, need to adjust results for the oversampling
Pre-processing Data
Types of VariablesDetermine the types of pre-processing
needed, and algorithms usedMain distinction: Categorical vs. numericNumeric
ContinuousInteger
CategoricalOrdered (low, medium, high)Unordered (male, female)
Variable handlingNumeric
Most algorithms in XLMiner can handle numeric data
May occasionally need to “bin” into categories
CategoricalNaïve Bayes can use as-isIn most other algorithms, must create binary
dummies (number of dummies = number of categories – 1)
Variable selectionParsimony10 records/variable6 x m x p records
m = no. of outcome classesP = no. of variables
Redundancy must be avoidedDomain experts must be consulted
Detecting OutliersAn outlier is an observation that is
“extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)
Outliers can have disproportionate influence on models (a problem if it is spurious)
An important step in data pre-processing is detecting outliers
Once detected, domain knowledge is required to determine if it is an error, or truly extreme.
Detecting OutliersIn some contexts, finding outliers is the
purpose of the DM exercise (airport security screening). This is called “anomaly detection”.
Handling Missing DataMost algorithms will not process records
with missing values. Default is to drop those records.
Solution 1: Omission If a small number of records have missing
values, can omit them If many records are missing values on a small
set of variables, can drop those variables (or use proxies)
If many records have missing values, omission is not practical
Solution 2: Imputation Replace missing values with reasonable
substitutes Lets you keep the record and use the rest of its
(non-missing) information
Normalizing (Standardizing) DataUsed in some techniques when variables
with the largest scales would dominate and skew results
Puts all variables on same scaleNormalizing function: Subtract mean and
divide by standard deviation (used in XLMiner)
Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Partitioning the DataProblem: How well will our
model perform with new data?
Solution: Separate data into two parts Training partition to develop
the modelValidation partition to
implement the model and evaluate its performance on “new” data
Addresses the issue of overfitting
Test PartitionWhen a model is developed on
training data, it can overfit the training data (hence need to assess on validation)
Assessing multiple models on same validation data can overfit validation data
Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data
Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data
OverfittingStatistical models can produce highly
complex explanations of relationships between variables
The “fit” may be excellentWhen used with new data, models of great
complexity do not do so well.
100% fit – not useful for new data
200 300 400 500 600 700 800 900 10000
200
400
600
800
1000
1200
1400
1600
Expenditure
Revenue
Overfitting (cont.)Causes:
Too many predictors A model with too many parameters Trying many different models
Consequence: Deployed model will not work as well as expected with completely new data.
Building Predictive Model with XLMINERExample – West Roxbury Home Value Dataset
TOTAL VALUE TAX
LOT SQ FT
YR BUILT
GROSS AREA
LIVING AREA FLOORS ROOMS
BEDROOMS
FULL BATH
HALF BATH
KITCHEN
FIREPLACEREMODEL
344.2 330 9965 1880 2436 1352 2 6 3 1 1 1 0None412.6 5190 6590 1945 3108 1976 2 10 4 2 1 1 0Recent330.1 4152 7500 1890 2294 1371 2 8 4 1 1 1 0None498.6 6272 13773 1957 5032 2608 1 9 5 1 1 1 1None331.5 4170 5000 1910 2370 1438 2 7 3 2 0 1 0None337.4 4244 5142 1950 2124 1060 1 6 3 1 0 1 1Old359.4 4521 5000 1954 3220 1916 2 7 3 1 1 1 0None320.4 4030 10000 1950 2208 1200 1 6 3 1 0 1 0None333.5 4195 6835 1958 2582 1092 1 5 3 1 0 1 1Recent409.4 5150 5093 1900 4818 2992 2 8 4 2 0 1 0None
TOTAL VALUE Total assessed value for property, in thousands of USDTAX Tax bill amount based on total assessed value multiplied by the
tax rate, in USDLOT SQ FT Total lot size of parcel in square feetYR BUILT Year property was builtGROSS AREA Gross floor areaLIVING AREA Total living area for residential properties (ftz)FLOORS Number of floorsROOMS Total number of roomsBEDROOMS Total number of bedroomsFULL BATH Total number of full bathsHALF BATH Total number of half bathsKITCHEN Total number of kitchensFIREPLACE Total number of fireplacesREMODEL When house was remodeled (Recent/Old/None)
Modeling Process
1. Determine the purposePredict value of homes in West Roxbury
2. Obtain the dataWest Roxbury Housing.XLSX
3. Explore, Clean and Preprocess the dataa. Tax is circular, not useful as a predictorb. Generate descriptive statistics and look for
unusual valuesc. Generate plots and examine them
Modeling Process
4. Reduce the dimensiona. Condense number of categoriesb. Consolidate multiple numerical variables using
Principal Component Analysis5. Determine Data Mining Task6. Partition the data (for supervised tasks)7. Choose the technique
For the example multiple linear regression
Modeling Process
8. Use the algorithm to perform the task9. Interpret results10. Deploy the model
Step 6: Partitioning the data
Step 8: Using XLMiner for Multiple Linear Regression
Summary of errors
RMS errorError = actual - predictedRMS = Root-mean-squared error = Square
root of average squared error
In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable
Using Excel and XLMiner for Data MiningExcel is limited in data capacityHowever, the training and validation of DM
models can be handled within the modest limits of Excel and XLMiner
Models can then be used to score larger databases
XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)
SummaryData Mining consists of supervised
methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization)
Before algorithms can be applied, data must be characterized and pre-processed
To evaluate performance and to avoid overfitting, data partitioning is used
Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database