chapter 6 regression algorithms in data mining fit data time-series data: forecast other data:...
TRANSCRIPT
Chapter 6Chapter 6Regression Algorithms in Data Regression Algorithms in Data
MiningMining
Fit data
Time-series data: Forecast
Other data: Predict
結束
6-2
ContentsContents
Describes OLS (ordinary least square) regression and Logistic regression
Describes linear discriminant analysis and centroid discriminant analysis
Demonstrates techniques on small data sets
Reviews the real applications of each model
Shows the application of models to larger data sets
結束
6-3
Use in Data MiningUse in Data Mining
Telecommunication Industry, turnover (churn)
One of major analytic models for classification problem.Linear regression
The standard – ordinary least squares regressionCan use for discriminant analysisCan apply stepwise regression
Nonlinear regressionMore complex (but less reliable) data fitting
Logistic regressionWhen data are categorical (usually binary)
結束
6-4
OLS ModelOLS Model
error term theis
st variableindependenfor tscoefficien theare
termintercept theis
variabledependent theis where
...
0
22110
n
Y
XXXY
n
nn
結束
6-5
OLS RegressionOLS Regression
Uses intercept and slope coefficients () to minimize squared error terms over all i observations
Fits the data with a linear model
Time-series data:Observations over past periodsBest fit line (in terms of minimizing sum of
squared errors)
結束
6-6
Regression OutputRegression Output
R2 : 0.987
Intercept: 0.642 t=0.286 P=0.776
Week: 5.086 t=53.27 P=0
Requests = 0.642 + 5.086*Week
結束
6-7
ExampleExample
SSE
R2
SST
結束
6-8
ExampleExample
結束
6-9
AA graph of the time-series modelgraph of the time-series model
(X1) Requests vs. (X2) Pred_lmreg_1
20018016014012010080604020
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
結束
6-10
Time-Series ForecastTime-Series Forecast
Time- series prediction
0
50
100
150
200
250
0 10 20 30 40 50
結束
6-11
Regression TestsRegression Tests
FIT:SSE – sum of squared errors
Synonym: SSR – sum of squared residualsR2 – proportion explained by modelAdjusted R2 – adjusts calculation to penalize for
number of independent variablesSignificanceF-test - test of overall model significancet-test - test of significant difference between model
coefficient & zeroP – probability that the coefficient is zero
(or at least the other side of zero from the coefficient)
See page. 103
結束
6-12
Regression Model TestsRegression Model Tests
SSE (sum of squared errors)For each observation, subtract model value from
observed, square difference, total over all observationsBy itself means nothingCan compare across models (lower is better)Can use to evaluate proportion of variance in data
explained by modelR2
Ratio of explained squared dependent variable values (MSR) to sum of squares (SST)SST = MSR plus SSE
0 ≤ R2 ≤ 1
See page. 104
結束
6-13
Multiple RegressionMultiple Regression
Can include more than one independent variableTrade-off:
Too many variables – many spurious, overlapping information
Too few variables – miss important contentAdding variables will always increase R2
Adjusted R2 penalizes for additional independent variables
結束
6-14
Example: Hiring DataExample: Hiring Data
Dependent Variable – Sales
Independent Variables:Years of EducationCollege GPAAgeGenderCollege Degree
See page. 104-105
結束
6-15
Regression ModelRegression Model
Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812+4331*Age P = 0.116-23581*Male P = 0.266+31001*Degree P = 0.450
R2 = 0.252 Adj R2 = -0.015Weak model, no significant at 0.10
結束
6-16
Improved Regression ModelImproved Regression Model
Sales = 173284
- 9991*YrsEd P = 0.098*
+3537*Age P = 0.141
-18730*Male P = 0.328
R2 = 0.218 Adj R2 = 0.070
結束
6-17
Logistic RegressionLogistic Regression
Data often ordinal or nominal
Regression based on continuous numbers not appropriateNeed dummy variables
Binary – either are or are not– LOGISTIC REGRESSION (probability of either
1 or 0)
Two or more categories– DISCRIMINANT ANALYSIS (perform
regression for each outcome; pick one that fit’s best)
結束
6-18
Logistic RegressionLogistic Regression
For dependent variables that are nominal or ordinal
Probability of acceptance of case i to class j
Sigmoidal function(in English, an S curve from 0 to 1)
iixj
eP
01
1
結束
6-19
Insurance Claim ModelInsurance Claim Model
Fraud = 81.824 -2.778 * Age P = 0.789-75.893 * Male P = 0.758+ 0.017 * Claim P = 0.757-36.648 * Tickets P = 0.824+ 6.914 * Prior P = 0.935-29.362 * Atty Smith P = 0.776
Can get probability by running score through logistic formula
See page. 107~109
結束
6-20
Linear Discriminant AnalysisLinear Discriminant Analysis
Group objects into predetermined set of outcome classes
Regression one means of performing discriminant analysis2 groups: find cutoff for regression scoreMore than 2 groups: multiple cutoffs
結束
6-21
Centroid Method (NOT regression)Centroid Method (NOT regression)
Binary data
Divide training set into two groups by binary outcomeStandardize data to remove scales
Identify means for each independent variable by group (the CENTROID)
Calculate distance function
結束
6-22
Fraud DataFraud Data
Age Claim Tickets Prior Outcome
52 2000 0 1 OK
38 1800 0 0 OK
19 600 2 2 OK
21 5600 1 2 Fraud
41 4200 1 2 Fraud
結束
6-23
Standardized & Sorted Fraud DataStandardized & Sorted Fraud Data
Age Claim Tickets Prior Outcome
1 0.60 1 0.5 0
0.9 0.64 1 1 0
0 0.88 0 0 0
0.633 0.707 0.667 0.500 0
0.05 0 1 0 1
1 0.16 1 0 1
0.525 0.080 1.000 0.000 1
結束
6-24
Distance CalculationsDistance Calculations
New To 0 To 1
Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001
Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048
Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000
Prior 1 (0.5-1)2 0.250 (0-1)2 1.000
Totals 0.879 2.049
結束
6-25
Discriminant Analysis with RegressionDiscriminant Analysis with RegressionStandardized data, Binary outcomesStandardized data, Binary outcomes
Intercept 0.430 P = 0.670Age -0.421 P = 0.671Gender 0.333 P = 0.733Claim -0.648 P = 0.469Tickets 0.584 P = 0.566Prior Claims -1.091 P = 0.399Attorney 0.573 P = 0.607
R2 = 0.804Cutoff average of group averages: 0.429
結束
6-26
Case: Stepwise RegressionCase: Stepwise Regression
Stepwise RegressionAutomatic selection of independent variables
Look at F scores of simple regressionsAdd variable with greatest F statisticCheck partial F scores for adding each variable not in
modelDelete variables no longer significantIf no external variables significant, quit
Considered inferior to selection of variables by experts
結束
6-27
Credit Card Bankruptcy PredictionCredit Card Bankruptcy PredictionFoster & Stine (2004), Foster & Stine (2004), Journal of the American Statistical AssociationJournal of the American Statistical Association
Data on 244,000 credit card accounts12-month period1 percent defaultCost of granting loan that defaults almost $5,000Cost of denying loan that would have paid about $50
結束
6-28
Data TreatmentData Treatment
Divided observations into 5 groupsUsed one for trainingAny smaller would have problems due to insufficient
default casesUsed 80% of data for detailed testing
Regression performed better than C5 model Even though C5 used costs, regression didn’t
結束
6-29
SummarySummary
Regression a basic classical modelMany forms
Logistic regression very useful in data miningOften have binary outcomesAlso can use on categorical data
Can use for discriminant analysisTo classify