outliers chapter 5.3 data screening. outliers can bias a parameter estimate

Post on 21-Jan-2016

240 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Outliers

Chapter 5.3 Data Screening

Outliers can Bias a Parameter Estimate

…and the Error associated with that Estimate

Outliers

• Outlier – case with extreme value on one variable or multiple variables

• Why?– Data input error– Not a population you meant to sample– From the population but has really long tails and

very extreme values

Outliers

• Outliers – Two Types• Univariate – for basic univariate statistics– Use these when you have ONE DV or Y variable.

• Multivariate – for some univariate statistics and all multivariate statistics– Use these when you have multiple continuous

variables or lots of DVs.

Outliers

• Univariate• In a normal z-distribution anyone who has a z-

score of +/- 3 is less than .2% of the population.

• Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

Outliers

• Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as

well if it’s one column or 500, so let’s just do that.

Outliers

• Multivariate– Now we need some way to measure distance from

the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!)

• Mahalanobis distance– Creates a distance from the centroid (mean of

means)

Outliers

• Mahalanobis• Centroid is created by plotting the 3D picture

of the means of all the means and measuring the distance– Similar to Euclidean distance

Outliers

• Mahalanobis• No set cut off rule – Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

NOTE: DF here has NOTHING to do with the DF for hypothesis testing.

Outliers

• So do I delete them?• Yes: they are far away from the middle!• No: they may not affect your analysis!• It depends: I need the sample size!• SO?!– Try it with and without them. See what happens.

FISH!

Outliers

• Important side notes:– For ANOVA, t-tests, correlation: you will use a fake

regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.

Outliers

• Important side notes:– For regression based tests: you can run the real

regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different.• You will also use other regression based values for this

analysis.

Outliers

• Important side note:– Many functions in R have their own data screening

options. This guide is for global screening not specific to one analysis.

Outliers

• First, figure out the factor columns, as all columns need to be int or num.– filledin_none[ , -c(1,2)] – Use that dataset code in the next function.

Outliers

• Mahalanobis function• mahalanobis(– Dataset name,– colMeans(dataset name, na.rm = TRUE),– cov(datasetname, use = “pairwise.complete.obs)– )

Outliers

• mahal = mahalanobis(filledin_none[ , -c(1,2)], colMeans(filledin_none[ , -c(1,2)],

na.rm = TRUE),cov(filledin_none[ , -c(1,2)],

use="pairwise.complete.obs"))

Outliers

• Now, let’s get rid of people with bad scores– But what is a bad score?– Use a chi-square table.– DF = # of variables (DVs, variables that you used to

calculate Mahalanobis)– Use p<.001

• Oh, let’s make R do it.

Outliers

• Use the qchisq function, which finds the cut off score for you.– qchisq(1-pvalue, Number of columns)

• cutoff = qchisq(.999,ncol(dataset)) • cutoff = qchisq(.999,ncol(filledin_none[ , -

c(1,2)]))

Outliers

• So, let’s see how many are bad– summary(mahal < cutoff)

• Let’s get rid of those peeps– noout = filledin_none[ mahal < cutoff, ]

top related