Download - Detecting Outliers
SW388R7Data Analysis &Computers II
Slide 0
Detecting Outliers
Detecting univariate outliers
Detecting multivariate outliers
SW388R7Data Analysis &Computers II
Slide 0
Outliers
! Outliers are cases that have data values that arevery different from the data values for the majorityof cases in the data set.
! Outliers are important because they can change the
results of our data analysis. ! Whether we include or exclude outliers from a data
analysis depends on the reason why the case is anoutlier and the purpose of the analysis.
SW388R7Data Analysis &Computers II
Slide 0
Univariate and Multivariate Outliers
! Univariate outliers are cases that have an unusualvalue for a single variable. In our analyses, we willbe concerned with univariate outliers for thedependent variable in our data analysis.
! Multivariate outliers are cases that have an unusual
combination of values for a number of variables. The value for any of the indvidual variables may notbe a univariate outlier, but, in combination withother variables, is a case that occurs very rarely. Inour analyses, we will be concerned with multivariateoutliers for the set of independent variables in ourdata analysis.
SW388R7Data Analysis &Computers II
Slide 0
Standard Scores Detect Univariate Outliers
! One way to identify univariate outliers is to convertall of the scores for a variable to standard scores.
! If the sample size is small (80 or fewer cases), a
case is an outlier if its standard score is ±2.5 orbeyond.
! If the sample size is larger than 80 cases, a case is
an outlier if its standard score is ±3.0 or beyond ! This method applies to interval level variables, and
to ordinal level variables that are treated as metric. It does not apply to nominal level variables.
SW388R7Data Analysis &Computers II
Slide 0
Mahalanobis D2 and Multivariate Outliers
! Mahalanobis D2 is a multidimensional version of a z-score. It measures the distance of a case from thecentroid (multidimensional mean) of a distribution,given the covariance (multidimensional variance) ofthe distribution.
! A case is a multivariate outlier if the probability
associated with its D2 is 0.001 or less. D2 follows achi-square distribution with degrees of freedomequal to the number of variables included in thecalculation.
! Mahalanobis D2 requires that the variables be
metric, i.e. interval level or ordinal level variablesthat are treated as metric.
SW388R7
Data Analysis &
Computers II
Slide 0
Problem 1
In the dataset GSS2000.sav, is the following statement true,
false, or an incorrect application of a statistic?
In the dataset, there are 2 cases that should be evaluated as
univariate outliers for highest year of school completed.
1. True
2. True with caution
3. False
4. Incorrect application of a statistic
SW388R7
Data Analysis &
Computers II
Slide 0
Descriptive statistics compute standard scores
To compute standard scores inSPSS, select the DescriptiveStatistics | Descriptives…command from the Analyzemenu.
SW388R7
Data Analysis &
Computers II
Slide 0
Select the variable(s) for the analysis
First, click on the variableto be included in theanalysis to highlight it.
Second, click on rightarrow button to move thehighlighted variable to thelist of variables.
SW388R7
Data Analysis &
Computers II
Slide 0
Mark the option for computing standard scores
First, click on the checkbox to savestandard score values as a new variable inthe dataset. The new variable will have the letter zprepended to its name, e.g. the standardscore variable for “educ” will be “zeduc”.
Second, click on the OKbutton to complete theanalysis request.
SW388R7
Data Analysis &
Computers II
Slide 0
The z-score variable in the data editor
The variable containingthe standard scores will beadded to the list ofvariables in the dataeditor.
To identify outliersbelow –3.0, we sortthe database inascending order. Right click on thevariable header zeducand select the SortAscending commandfrom the popupmenu.
SW388R7Data Analysis &Computers II
Slide 0
Outliers with unusually low scores
Cases that are outliers becausethey have unusually lowscores for the variable willappear at the top of the sortedlist. Since there are 269 cases withvalid data for the variable, thecriterion for identifying anoutlier is ±3.0. In this example, we have twooutliers with z-scores less than–3.0.
SW388R7
Data Analysis &
Computers II
Slide 0
Additional information about the outliers
To see additional information about theoutliers, we highlight the rows containingthe outliers and scroll horizontally to othervariables in which we are interested, forexample, the id numbers for the cases.
SW388R7
Data Analysis &
Computers II
Slide 0
The raw data scores for the outliers
Before deciding whether we retain or omitoutliers from the analysis, we shouldexamine the raw scores that made thesecases outliers. In this example, one of our subjects hadcompleted only 2 years of school andanother had completed only 3 years.
SW388R7
Data Analysis &
Computers II
Slide 0
Comparing the raw scores to the mean
When we compare the raw data values of 2 and 3 tothe mean (13.12) and standard deviation (2.930) ofthe distribution for the variable, we see why thesecases are outliers for this distribution. Completing 2and 3 years of school is unusual in a distribution thathad a mean of 13 years.
The Descriptives output helpsus in evaluating the raw datascores for the outliers.
SW388R7Data Analysis &Computers II
Slide 0
Outliers with unusually high scores
To identify outliersabove +3.0, we sortthe database indescending order. Right click on thevariable header zeducand select the SortDescending commandfrom the popup menu.
SW388R7
Data Analysis &
Computers II
Slide 0
Descriptive statistics compute standard scores
Cases that are outliers becausethey have unusually highscores for the variable will nowappear at the top of the sortedlist. In this example, there areno outliers with extremelylarge values.
The answer to this problem is True. Univariate outliers are detected by computing standard scoresfor the variable. Computing standardard scores requires that thevariable be metric.Highest year of school completed (educ) is aninterval level or metric variable, satisfying the requirement forcomputing standard scores. Since there are 269 cases with valid data for the variable, thecriterion for identifying an outlier is ±3.0. In this dataset, 2cases have a z-score value outside this range (20000391:-3.45; 20001984: -3.80).
SW388R7
Data Analysis &
Computers II
Slide 0
Deleting the z-score variable
Once we are finishedwith the outlier analysis,we should delete thevariables that wereadded to the data set. First, click on the zeduccolumn header to selectthe entire column.
Second, select the Clearcommand from the Editmenu to delete the columnfrom the dataset.
SW388R7Data Analysis &Computers II
Slide 0
Other problems on univariate outliers
! A problem may ask about outliers for a nominal levelvariable. The answer will be “An inappropriateapplication of a statistic” since z-scores cannot becomputed for nominal level variables.
! A problem may ask about outliers for an ordinal level
variable. If the number of outliers in the problemstatement is accurate, the correct answer to thequestion is “True with caution” since we may berequired to defend treating an ordinal variable asmetric.
! A problem may contain an inaccurate number of
outliers for the variable. The answer will be “False.”
SW388R7
Data Analysis &
Computers II
Slide 0
Problem 2
In the dataset GSS2000.sav, is the following
statement true, false, or an incorrect application of
a statistic? Use 0.001 as the level of significance.
In the dataset, there is 1 case that should be
evaluated as a multivariate outlier for the
combination of: number of hours worked in the past
week, occupational prestige score, and highest year
of school completed.
1. True
2. True with caution
3. False
4. Incorrect application of a statistic
SW388R7
Data Analysis &
Computers II
Slide 0
Mahalanobis D2 is computed by Regression
To compute Mahalanobis D2 inSPSS, select the Regression |Linear… command from theAnalyze menu.
SW388R7
Data Analysis &
Computers II
Slide 0
Adding the independent variables
The SPSS Linear Regression procedurecomputes Mahalanobis D2 for the set ofindependent variables entered into thedialog box. Move the variables: hrs1, prestg80, andeduc to the list of independent variables.
SW388R7
Data Analysis &
Computers II
Slide 0
Adding an arbitrary dependent variable
First, arbitrarily select avariable to use as thedependent variable. Thevariable should a numericvariable that does not haveany missing cases. For example, click on the firstnumeric variable in the list ofvariables: wrkstat.
Second, click on the rightarrow button to movewrkstat to the text box forthe dependent variable.
SPSS will not compute the Regression unlesswe specify a dependent variable, even thoughthe dependent variable is not used in theanalysis of multivariate outliers.
SW388R7
Data Analysis &
Computers II
Slide 0
Adding Mahalanobis D2 to the dataset
To request that SPSS add the value ofMahalanobis D2 to the data set, click onthe Save button to open the save dialogbox.
SW388R7
Data Analysis &
Computers II
Slide 0
Specify saving Mahalanobis D2 distance
Second, complete therequest for Mahalanobisdistance by clicking on theContinue button.
First, mark thecheckbox forMahalanobis in theDistances panel. Allother checkboxes canbe unchecked.
SW388R7
Data Analysis &
Computers II
Slide 0
Specify the statistics output needed
To understand why aparticular case is an outlier,we want to examine thedescriptive statistics for eachvariable. Click on the Statistics…button to request thestatistics.
SW388R7
Data Analysis &
Computers II
Slide 0
Request descriptive statistics
Second, complete therequest for descriptivestatistics by clicking on theContinue button.
First, mark the checkbox forDescriptives. All othercheckboxes can be unchecked.
SW388R7
Data Analysis &
Computers II
Slide 0
Complete the request for Mahalanobis D2
To complete the request forthe regression analysis thatwill compute Mahalanobis D2,click on the OK button.
SW388R7
Data Analysis &
Computers II
Slide 0
Mahalanobis D2 scores in the data editor
If we look in the column farthestto the right in the data editor, wesee that SPSS has calculated theMahalanobis D² scores for us in avariable it has named "mah_1." The evaluation for outliers,however, requires the probabilityfor the Mahalanobis D² and notthe scores themselves.
SW388R7
Data Analysis &
Computers II
Slide 0
Computing the probability of D²
To compute the probability ofD², we will use an SPSSfunction in a Computecommand.
First, select theCompute… commandfrom the Transformmenu.
SW388R7
Data Analysis &
Computers II
Slide 0
Specifying the variable name and function
First, in the target variable text box, type the name"p_mah_1" as an acronym for the probability of themah_1, the Mahalanobis D² score.
Second, scroll down the list of functions to findCDF.CHISQ, which calculates the probability ofa variable which follows as chi-squaredistribution, like Mahalanobis D².
Third, click onthe up arrowbutton to movethe highlightedfunction to theNumericExpression textbox.
SW388R7
Data Analysis &
Computers II
Slide 0
Completing the specifications for the function
Second, click on the OKcommand to signalcompletion of the computervariable dialog.
First, to complete the specifications forthe CDF.CHISQ function, type the nameof the variable containing the D² scores,mah_1, followed by a comma, followedby the number of variables used in thecalculations, 3. Since the CDF function (cumulativedensity function) computes thecumulative probability from the left endof the distribution up through a givenvalue, we subtract it from 1 to obtain theprobability in the upper tail of thedistribution.
SW388R7
Data Analysis &
Computers II
Slide 0
Probabilities for D² in the data editor
To sort the data set, right click onthe column header p_mah_1, andselect Sort Ascending from thepopup menu.
SPSS used the computecommand to calculate theprobabilities for the D²scores and list them in the dataeditor. To find the smallest probabilityvalue, we will sort the data setin ascending order.
SW388R7
Data Analysis &
Computers II
Slide 0
Identifying outliers
Scroll down the data editorpast the probabilities withmissing values, which are theresult of the computecommand when one or morevariables has missing data.
There are two values less than 0.001,displayed as .0000 and .0007. Two cases had an unusual combination ofvalues on the three variables resulting intheir designation as outliers.
SW388R7
Data Analysis &
Computers II
Slide 0
Answering the original question
The original question asked if the numberof outliers for the combination of threevariables is 1. The answer to this question is falsebecause there are two outliers. In this dataset, 2 cases have aMahalanobis D² with a probability lessthan or equal to 0.001 (20000391:D²=35.58, p<0.0001; 20001785:D²=17.15, p=0.0007).
SW388R7Data Analysis &Computers II
Slide 0
Evaluating Mulitivariate Outliers
! Before we can decide whether we should omit orretain an outlier in our data analysis, we need tounderstand why it is an outlier.
! To accomplish this, we will move the columns for
the variables adjacent to each other in the dataeditor so that we can compare the values for eachcase.
! We will compare the values for each case to the
mean and standard deviation for each variable,computed in the descriptive statistics section of theregression output.
SW388R7
Data Analysis &
Computers II
Slide 0
Moving columns in the data editor – step 1
We will move the column forthe variable prestg80 next tothe column for hrs1.
First, click on the columnheader prestg80 for thevariable we want to move,so that the column isselected.
SW388R7
Data Analysis &
Computers II
Slide 0
Moving columns in the data editor – step 2
Next, click and hold the left mouse buttondown on the column header of the variablewe want to move. A box outline will appear at the bottom ofthe arrow cursor, indicating that SPSS isprepared to move the column.
SW388R7
Data Analysis &
Computers II
Slide 0
Moving columns in the data editor – step 3
Next, while holding the mousebutton down, move the arrowcursor over columns to the leftor right.
A vertical red line will appear betweenthe columns to indicate where thecolumn will be relocated. When the red line is located where wewant to position the column we aremoving, release the mouse button. The column will now be relocated.
SW388R7
Data Analysis &
Computers II
Slide 0
Moving columns in the data editor – step 4
The columns for the variables are nowadjacent to one another, making it easier tocompare values.
Hint: when we move a column, thecommand “Undo Move Variables” will appearat the top of the Edit menu. I find thiscommand the easiest way to return thecolumns to their original locations in the dataeditor. Leaving columns in different locationscan make it harder to find a variable we arelooking for.
SW388R7Data Analysis &Computers II
Slide 0
Highlighting the outliers for analysis
When I finished relocating the three variables, Imoved the p_mah_1 column also, so I could easilyidentify which cases were outliers. Then Ihighlighted the outlier rows and scrolled them tothe top row in the data editor. I can now compare the values for these two casesto the mean and standard deviation of thedistribution for the three variables.
SW388R7
Data Analysis &
Computers II
Slide 0
Evaluating the outlier cases
Descriptive Statistics
1.18 .384 174
41.01 12.599 174
45.16 14.188 174
13.79 2.778 174
LABOR FRCE STATUSNUMBER OF HOURSWORKED LAST WEEKRS OCCUPATIONALPRESTIGE SCORE (1980)HIGHEST YEAR OFSCHOOL COMPLETED
Mean Std. Deviation N
The number of hours worked forboth cases is well below theaverage for the sample. The firstcase has an above averageoccupational prestige scorecombined with below average yearsof education. The second case hasa below average occupationalprestige score combined with aboveaverage education.
SW388R7
Data Analysis &
Computers II
Slide 0
Deleting variables added to dataset
Once we are finished with theoutlier analysis, we should deletethe variables that were added tothe data set. First, select the mah_1 andp_mah_1 columns.
Second, select the Clearcommand from the Editmenu to delete the columnfrom the dataset.
SW388R7Data Analysis &Computers II
Slide 0
Other problems on multivariate outliers
! A problem may ask about outliers for variables thatinclude a nominal level variable. The answer will be“An inappropriate application of a statistic” sinceMahalanobis D² cannot be computed unless allvariables are metric.
! A problem may ask about outliers for variables that
include an ordinal level variable. If the number ofoutliers in the problem statement is accurate, thecorrect answer to the question is “True with caution”since we may be required to defend treating an ordinalvariable as metric.
! A problem may contain an inaccurate number of
outliers for the variable. The answer will be “False.”
SW388R7
Data Analysis &
Computers II
Slide 0
Steps in evaluating outliers
The following is a guide to the decision process for answeringproblems about outliers:
Is the number of outliersstated in the problem thecorrect number?
False
Yes
No
Incorrect applicationof a statistic
Yes
NoAre all of the variables tobe evaluated metric?
Are any of the metricvariables ordinal level?
Yes
NoTrue
True with caution