outlier detection in logistic regression: a quest for ... · outlier detection in logistic...

Outlier Detection in Logistic Regression: A Quest for Reliable Knowledge from

Predictive Modeling and Classification

Abdul Nurunnabi, Geoff West Department of Spatial Sciences, Curtin University, Perth, Australia

CRC for Spatial Information (CRCSI) [email protected]

[email protected]

Objectives

Identification of multiple influential observations in logistic regression

Classification of outliers on a graphical plot

Investigating importance of outlier treatment for reliable knowledge discovery

Outlier (what, when and how?) An outlier is an observation that deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. Hawkins (1980)

Causes of outliers: Outliers occur very frequently in real data, and often go unnoticed because much data is processed by computers without careful inspection and screening. They may appear because of human error such as keypunch errors, mechanical faults (such as transmission or recoding errors), changes in system behavior, exceptional events (natural disasters such as earthquakes and floods), instrument error, or simply through natural deviations in populations.

Outliers’ effects: The presence of outliers in a dataset may cause the parameter estimation to be erroneous, misclassifying the outcomes and consequently creating problems when making inferences with the wrong model. Draws unreliable conclusions and decisions.

The typical steps constituting the KDD process

Outliers and Reliability Issues

Reliability issue and outlier interact with 5 questions [Fayyad et al. 1996; Dai et al. 2012] :

i. What are the major factors that can make the discovery process unreliable? ii. How can we make sure that the discovered knowledge are reliable? iii. Under what conditions can a reliable discovery be assured? iv. What techniques are there that can improve the reliability of discovered knowledge? v. When can we trust that the discovered knowledge is reliable and reflects the real data?

Logistic Regression

Logistic regression is useful for situation in which we want to predict the response (presence or absence) of a characteristic or outcome based on values of a set of predictor variables.

It can be classified into three types based on categorical response variables: binary, ordinary and nominal. A binary response has two categories with no natural order (for example, success-failure or yes-no). An ordinal response has three or more categories with a natural ordering (e.g. none, mild, and severe). A nominal response has three or more categories with no natural ordering (for example, blue, black, red, yellow; or sunny, rainy, and cloudy).

In this presentation, we will cover only the binary logistic regression.

Logistic Regression and Outlier The customary model for LR is:

)()|( XπXYE =

( ) =

−

=ππXg

1ln XβXβXββ pp =+++= 110log(odds)

10,1 110

110

≤≤+

= +++

+++

πe

eπpp

pp

XβXββ

XβXββ

where

επY +=

=−−=−

=.01yprobabilitwith

,1yprobabilitwith1yifπ;πyifπ;π

ε

Outliers, linear and logistic (S-curve) models.

The log of [π (.)/(1-π (.))] can be defined as a linear function called logit (log odds) of X

The LR model can be re-written as:

where Y is a vector of binary (0, 1) response and ε is the error term:

Typically outliers in regression can be categorized into three classes: outliers, high leverage points and influential observations. Deviation/change in X (explanatory) space, called leverage points Deviation in Y (response variable) not in X, called vertical outliers Deviation in both (X-Y) spaces. Influential observations are defined as points, which either Individually or together with several other observations, have a demonstrably larger impact on the calculated values of various estimates (coefficients, standard errors, t-values etc.). (Belsley et al. 1980)

Types of Outliers

In logistic regression, outliers and influential observations may occur as misclassification between the binary (0, 1) responses. It may occur by meaningful deviation (we also see low leverage) in explanatory variables. (Nurunnabi et al. 2010)

Outlier Detection: Single-case Deletion Approach

iii πyε ˆˆ −=The ith residual can be defined in LR as:

The projection (leverage) matrix is a diagonal matrix that gives the fitted values of the response variable as the projection onto the covariate space. It is defined as:

2/112/1 )( VXVXXXVH TT −=

The ith diagonal element of H defined as:

iTT

iiiii xVXXxππh 1)()ˆ1(ˆ −−=

where V is a diagonal matrix with diagonal elements )ˆ1(ˆ iii ππv −=

)1(ˆ

iii

iisi hv

πyr−−

=

The standardized Pearson residual for LR is defined as:

hii > ck/n ( c = 2 or 3) values are generally identified as high leverage points. k=p+1.

are generally identified as outliers

3|| ≥sir

DFFITS in LR defined as:

iii

i

iii

i hvyyDFFITS )(

)(ˆˆ−

−−= DFFITS > 3√(k/n) are identified as

influential cases

Modification for Group Deletion Approach

The single-case deletion measures are naturally affected by the masking and swamping phenomena and fail to detect outliers in the presence of multiple outliers and/or influential cases. Masking occurs when an outlying subset goes undetected because of the presence of another, usually adjacent, subset. Swamping occurs when good observations are incorrectly identified as outliers because of the presence of another, usually remote subset of observations. The group deletion approach forms a clean subset of the data that is presumably free of outliers, and then test the outlyingness of the remaining points relative to the clean subset.

Outlier Detection: Group Deletion Approach

=

=

=

D

R

D

R

D

R

VV

VYY

YXX

X0

0,,

( )( ))(

)()( ˆexp1

ˆexpˆ

RT

i

RT

iRi βx

βxπ

+=

)ˆ1(ˆ,ˆˆ )()()()()( RiRiRiRiiRi ππvπyε −=−=

iRRTR

TiRiRiRii xXVXxππh 1

)()()( )()ˆ1(ˆ −−=

∈+

−

∈−

−

=,

)1(

ˆ

,)1(

ˆ

)()(

)(

)()(

)(

*

Diforhv

πy

Riforhv

πy

r

RiiRi

Rii

RiiRi

Rii

si

∈+

∈−

=.

1

,1

)(

)(

)(

)(

*

Diforh

h

Riforh

h

h

Rii

Rii

Rii

Rii

ii

are generally identified as outliers

3r*si ≥

are identified as high leverage points

)(3)( ***iiiiii hMADhmedianh ×+>

Generalized Standardized Pearson Residual (GSPR)

Generalized Weight (GW)

These methods find a suspect group (D) of d outlying/unusual cases with the help of graphical methods, robust techniques such as LMS, RLS and/or appropriate diagnostics measures. The data in explanatory variables (X), response variable (Y) and the variance-covariance matrix V can be separated (deletion group D and the clean set R) as:

Identification of Multiple Influential Observations

Mahalanobis Distance

)()( 1 ZZZZMD iT

ii −Σ−= −

where Z is an m variate with mean and covariance matrix . Σ

][ **iisi hrG =

( )( ) ( )( )RiRT

Rii GGΣGGID −−= −1)(

Z

The proposed Influence Distance (ID):

where is the generalized residual-leverage matrix and and are the mean and covariance matrix based on the R group (excluding the observations are identified as outliers by GSPR) .

)(RG)(RΣ

Proposed Method Algorithm 1.

(i) Calculate and using the group deletion approach

(ii) Construct the matrix

(iii) Calculate and based on the R group after the deletion of outlying cases

(iv) Calculate

(v) Find influential observations for which

(vi) To sketch the classification plot : (a) draw an scatter plot

versus (b) draw cut-off lines at ±3 and

through the and axes respectively (c) draw an influence ellipse based on the ID values and the Chi-square cut-off value.

716.22975.0,2 => χIDi

*iih Classification plot.

*sir)(3)( **

iiii hMADhmedian ×+ *iih

*iih

*sir

*sir

][ **iisi hrG =

)(RG )(RΣ

( )( ) ( )( )RiRT

Rii GGΣGGID −−= −1)(

Modified Brown Data i |rsi| (3.00)

hii (0.073)

|DFFITSi| (0.572)

|rsi*| (3.00)

hii* (0.081)

IDi (2.716) i |rsi|

(3.00) hii

(0.073) |DFFITSi|

(0.572) |rsi*|

(3.00)

hii* (0.081

)

IDi (2.716)

1 -0.740 0.029 -0.124 -0.520 0.040 0.578 29 -0.737 0.027 -0.120 -0.543 0.037 0.655 2 -0.728 0.023 -0.110 -0.615 0.029 0.906 30 -0.753 0.036 -0.141 -0.441 0.052 0.459 3 -0.737 0.027 -0.120 -0.543 0.037 0.655 31 -0.729 0.024 -0.111 -0.602 0.030 0.866 4 -0.734 0.026 -0.116 -0.566 0.034 0.739 32 -0.723 0.022 -0.106 -0.655 0.026 1.017 5 -0.737 0.027 -0.120 -0.543 0.037 0.655 33 1.391 0.029 0.237 1.998 0.040 1.994 6 -0.739 0.028 -0.122 -0.531 0.038 0.615 34 1.396 0.026 0.228 1.869 0.035 1.863 7 -0.743 0.030 -0.128 -0.499 0.043 0.515 35 1.393 0.028 0.234 1.954 0.038 1.948 8 -0.719 0.020 -0.102 -0.698 0.024 1.104 36 -0.740 0.029 -0.124 -0.520 0.040 0.578 9 1.406 0.023 0.216 1.672 0.029 1.696 37 -0.718 0.020 -0.101 -0.714 0.023 1.126

10 -0.729 0.024 -0.111 -0.602 0.030 0.866 38 -0.674 0.030 -0.111 -1.740 0.131 3.173 11 -0.719 0.020 -0.102 -0.698 0.024 1.104 39 -0.701 0.018 -0.093 -0.949 0.034 1.091 12 -0.707 0.018 -0.095 -0.849 0.026 1.165 40 -0.681 0.025 -0.103 -1.476 0.101 2.215 13 -0.715 0.019 -0.099 -0.745 0.023 1.159 41 -0.714 0.019 -0.098 -0.761 0.023 1.170 14 1.431 0.019 0.198 1.315 0.023 1.428 42 1.475 0.020 0.204 0.926 0.056 1.054 15 -0.742 0.029 -0.126 -0.510 0.042 0.545 43 1.466 0.019 0.200 0.983 0.046 0.999 16 -0.739 0.028 -0.122 -0.531 0.038 0.615 44 1.453 0.018 0.196 1.089 0.034 1.113 17 -0.737 0.027 -0.120 -0.543 0.037 0.655 45 1.438 0.018 0.196 1.234 0.025 1.335 18 -0.699 0.019 -0.093 -0.993 0.038 1.063 46 1.458 0.019 0.197 1.045 0.038 1.048 19 -0.693 0.020 -0.094 -1.114 0.052 1.101 47 1.438 0.018 0.196 1.234 0.025 1.335 20 -0.678 0.027 -0.106 -1.584 0.114 2.624 48 1.431 0.019 0.198 1.315 0.023 1.428 21 -0.734 0.026 -0.116 -0.566 0.034 0.739 49 1.469 0.019 0.201 0.964 0.049 1.003 22 -0.702 0.018 -0.094 -0.928 0.032 1.108 50 1.431 0.019 0.198 1.315 0.023 1.428 23 1.520 0.028 0.244 0.689 0.118 2.706 51 1.443 0.018 0.195 1.183 0.027 1.264 24 -0.634 0.186 -0.275 -9.979 0.051 9.995 52 1.489 0.022 0.214 0.839 0.075 1.420 25 -0.650 0.074 -0.167 0.274 0.126 2.837 53 -0.656 0.058 -0.148 0.339 0.132 3.061 26 1.469 0.019 0.201 0.964 0.049 1.003 54 -0.634 0.222 -0.308 -13.311 0.037 13.426 27 -0.753 0.036 -0.141 -0.441 0.052 0.459 55 -0.637 0.281 -0.366 -20.662 0.021 20.914 28 -0.737 0.027 -0.120 -0.543 0.037 0.655

i L.N.I. A.P. i L.N.I. A.P. i L.N.I. A.P. i L.N.I. A.P. 1 0 48 15 0 47 29 0 50 43 1 81 2 0 56 16 0 49 30 0 40 44 1 76 3 0 50 17 0 50 31 0 55 45 1 70 4 0 52 18 0 78 32 0 59 46 1 78 5 0 50 19 0 83 33 1 48 47 1 70 6 0 49 20 0 98 34 1 51 48 1 67 7 0 46 21 0 52 35 1 49 49 1 82 8 0 62 22 0 75 36 0 48 50 1 67 9 1 56 23 1 99 37 0 63 51 1 72

10 0 55 24 0 187 38 0 102 52 1 89 11 0 62 25 (1)0 136 39 0 76 53 (1)0 126 12 0 71 26 1 82 40 0 95 54 0 200 13 0 65 27 0 40 41 0 66 55 0 220

14 1 67 28 0 50 42 1 84

Diagnostic Results for Modified Brown Data

Experiment 1. Modified Brown Data

Experiment 1. Modified Brown Data

Modified Brown data (a) scatter plot; L.N.I. versus A.P. (b) index plot of standardized Pearson residual (c) index plot of leverage values (d) index plot of DFFITS (e) index plot of GSPR (f) index plot of GW (g) index plot of ID (h)

classification plot

Experiment 2. Modified Finney Data

i |rsi| (3.00)

hii (0.154)

|DFFITSi| (0.832)

|rsi*| (3.00)

hii* (0.044)

IDi (2.716) i |rsi|

(3.00) hii

(0.154) |DFFITSi|

(0.832) |rsi*|

(3.00) hii*

(0.044) IDi

(2.716) 1 0.1491 0.072 -0.127 0.000 0.00000 0.532 21 -0.6157 0.100 -0.201 -0.001 0.00007 0.533 2 0.1543 0.068 -0.112 0.000 0.00000 0.532 22 -0.7032 0.061 -0.175 -0.005 0.00068 0.532 3 0.5799 0.061 0.148 0.035 0.01708 0.444 23 -1.0176 0.045 -0.194 -0.151 0.09264 0.331 4 1.6914 0.071 0.353 587.164 0.00013 965.085 24 -1.1836 0.048 -0.210 -0.615 0.17485 0.937 5 0.6074 0.118 0.209 0.041 0.02304 0.417 25 0.6301 0.052 0.134 0.079 0.05443 0.288 6 0.5683 0.149 0.215 0.020 0.00763 0.489 26 -0.5520 0.084 -0.166 0.000 0.00001 0.533 7 -0.3570 0.093 -0.099 0.000 0.00000 0.532 27 0.6174 0.067 0.143 0.061 0.03996 0.344 8 -0.9807 0.039 -0.190 -0.110 0.06880 0.357 28 -0.9577 0.044 -0.201 -0.086 0.05323 0.389 9 -0.4759 0.094 -0.150 0.000 0.00000 0.532 29 0.7854 0.096 0.222 0.576 0.42628 1.829

10 2.7516 0.102 0.790 44522.925 0.00000 73194.797 30 -0.7693 0.128 -0.244 -0.008 0.00156 0.531 11 2.8168 0.098 0.774 56039.735 0.00000 92128.254 31 0.4223 0.160 0.184 0.001 0.00009 0.531 12 -1.1087 0.110 -0.359 -0.276 0.27368 0.572 32 -1.3568 0.288 -0.444 -1.144 1.28008 4.314 13 -1.3554 0.124 -0.459 -1.997 0.90055 3.504 33 -1.0578 0.037 -0.200 -0.229 0.14620 0.342 14 0.5540 0.058 0.134 0.023 0.00941 0.480 34 0.7921 0.045 0.171 0.731 0.30392 1.639 15 0.4682 0.151 0.171 0.003 0.00043 0.528 35 0.8071 0.039 0.161 0.870 0.24012 1.678 16 0.3559 0.088 0.096 0.000 0.00001 0.532 36 0.5649 0.126 0.201 0.020 0.00801 0.488 17 0.1516 0.057 -0.127 0.000 0.00000 0.532 37 -0.9577 0.044 -0.201 -0.086 0.05323 0.389 18 1.6115 0.066 0.335 386.514 0.00026 635.219 38 -0.7972 0.061 -0.198 -0.015 0.00438 0.524 19 -1.2160 0.073 -0.243 -0.718 0.21561 1.072 39 0.9132 0.037 0.173 2.658 0.18395 4.477 20 0.5176 0.066 0.122 0.012 0.00351 0.511

i Y Vol. Rate i Y Vol. Rate i Y Vol. Rate 1 1 3.70 0.825 14 1 1.40 2.330 27 1 1.80 1.500 2 1 3.50 1.090 15 1 0.75 3.750 28 0 0.95 1.900 3 1 1.25 2.500 16 1 2.30 1.640 29 1 1.90 0.950 4 1 0.75 1.500 17 1 3.20 1.600 30 0 1.60 0.400 5 1 0.80 3.200 18 1 0.85 1.415 31 1 2.70 0.750 6 1 0.70 3.500 19 0 1.70 1.060 32 0 2.35 0.030 7 0 0.60 0.750 20 1 1.80 1.800 33 0 1.10 1.830 8 0 1.10 1.700 21 0 0.40 2.000 34 1 1.10 2.200 9 0 0.90 0.750 22 0 0.95 1.360 35 1 1.20 2.000

10 (0)1 0.90 0.450 23 0 1.35 1.350 36 1 0.80 3.330 11 (0)1 0.80 0.570 24 0 1.50 1.360 37 0 0.95 1.900 12 0 0.55 2.750 25 1 1.60 1.780 38 0 0.75 1.900 13 0 0.60 3.000 26 0 0.60 1.500 39 1 1.30 1.625

Modified Finney Data Diagnostic Results for Modified Finney Data

Modified Finney data (a) character plot; rate versus volume with the response values (1, 0) (b) index plot of standardized Pearson residual (c) index plot of leverage values (d) index plot of DFFITS (e) index plot of GSPR (f) index plot of GW (g)

index plot of log(ID) (h) classification plot.

Experiment 2. Modified Finney Data

Reliability Checking: Models Parameters Estimation and Test

Modified Brown Data

LR model fit and significance test

Results for all observations Results without outliers Parameter estimation Parameter estimation

Predictor Coef. S. E. Z P Odds Ratio

95% Conf. Int. Coef. S. E. Z P Odds

Ratio 95% Conf. Int.

Lower Upper Lower Upper Constant -0.463 0.674 -0.69 0.492 -4.134 1.486 -2.78 0.005 A.P. -0.003 0.008 -0.42 0.677 1.00 0.98 1.01 0.055 0.022 2.51 0.012 1.06 1.01 1.10 Test Test Test that all slopes are zero: G = 0.183, df = 1, P-Value = 0.669 Test that all slopes are zero: G = 7.31, df = 1, P-Value = 0.007 Goodness -of-fit test Goodness -of-fit test

Χ2 df p χ2 Df p Pearson 42.144 34 0.159 Pearson 33.295 28 0.225 Deviance 53.407 34 0.018 Deviance 41.167 28 0.052 Hosmer-Lemeshow (H-L) 23.059 8 0.003 Hosmer-Lemeshow (H-L) 6.044 8 0.642 Model Summery Model Summery Log-Likelihood (LL) -2 LL Cox & Snell R2 Nagelkerke R2 Log-Likelihood (LL) -2 LL Cox & Snell R2 Nagelkerke R2

-34.681 69.363 0.003 0.005 - 28.562 57.123 0.139 0.190

Predicted probabilities versus A.P.

Predicted status

All observations Without outliers Actual Status Actual status

Absence (0)

Presence (1)

Total Absence (0)

Presence (1)

Total

Absence (0)

32 (58.18%)

23 (41.82 %)

55 (100%)

25 (45.45%)

10 (18.18%)

35 (63.64%)

Presence (1)

0 (0%)

0 (0%)

0 (0%)

7 (12.73%)

13 (23.64%)

20 (36.36%)

Total 32 (58.18%)

23 (41.82 %) 55 32

(58.18%) 23

(41.82%) 55

Correct classif. 58.18% 69.09%

Classification results with outliers and without outliers

Mosaic plot (a) classification with outliers (b) classification without outliers

Performance Evaluation: Classification

ROC Curves results

Area S. E. Sig. (p)

Asymptotic 95% Conf. Int. Lower Bound Upper Bound

All observations 0.219 0.064 0.000 0.094 0.345 Without outliers 0.781 0.064 0.000 0.655 0.906

ROC curve (a) data with outliers (b) data without outliers

Performance Evaluation: Predictive Ability

This paper proposes a diagnostic measure for identifying multiple influential observations in logistic regression.

It introduces a classification graph to classify outliers, high leverage points and influential observations in the same plot at one time.

Diagnostic results show that the proposed measure efficiently identifies multiple influential cases, and the graph is helpful for visualizing outlier categories.

Results show that without careful outlier investigation, it may not be possible to get reliable knowledge using logistic regression for predictive modeling and classification.

Conclusions

The outlier investigation in logistic regression is highly related to the issues raised for reliable knowledge discovery. (i) Outlier detection is one of the major factors that affect the reliability of the discovery

process. (ii) The conditions for reliable knowledge discovery can be improved by parameter estimation

and testing the significance of the estimates. (iii) Proper outlier diagnostics and treatment (deletion or correction of the outlying

observations) can improve the reliability of discovered knowledge. (iv) We can trust the discovered knowledge is reliable and reflects the real data if the test

results meet the required statistical significance level.

Therefore outlier detection and proper treatment is vital for obtaining reliable knowledge, and should be considered as a data preprocessing step in knowledge

discovery in databases (KDD).

The proposed diagnostic method is introduced for the binomial response variable in logistic regression. Future research will investigate the diagnostic method for (i) multinomial response variables, and (ii) large and high dimensional data as higher dimensional data presents extra problems that need to be addressed.

Conclusions

Question ?

outlier detection in logistic regression: a quest for ... · outlier detection in logistic...

Documents