robust estimators. estimating population parameters the mean (x) and variance (s 2 ) are the...

34
Robust Estimators

Upload: moris-cox

Post on 13-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Robust Estimators

Page 2: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters

• The mean (X) and variance (s2) are the descriptive statistics that are most commonly used to represent the data points of some sample.

• The real reason that they are the preferred measures of central tendency and variance is because of certain properties they have as estimators of their corresponding population parameters; and 2.

Page 3: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters (cont.)

• Four properties are considered desirable in a population estimator; sufficiency, unbiasedness, efficiency, & resistance.

• Both the mean and the variance are the best estimators in their class in terms of the first three of these four properties.

• To understand these properties, you first need to understand a concept in statistics called the sampling distribution

Page 4: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent
Page 5: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Sampling Distribution DemoBasically, the idea is this – in order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

•Check out this demonstration

Page 6: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Properties of a Statistic

• 1) Sufficiency

A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter. For example, this property makes the mean more attractive as a measure of central tendency compared to the mode or median.

Page 7: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters

• 2) Unbiasedness

A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating.

• Using the resampling procedure, the mean can be shown to be an unbiased estimator.

Page 8: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Assessing the Bias of an Estimator

Explanation of N-1 in s2 formula.

• If the more intuitive formula for s2 is used:

• it turns out to underestimate 2

sx2 =(X – X )2

N

Page 9: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Assessing the Bias of an Estimator (cont.)

• This bias to underestimate is caused by the act of sampling and it can be shown that this bias can be reduced if N-1 is used in the denominator instead of N.

• Note that this is only true when calculating s2, if you have a measurable population and you want to calculate 2, you use N in the denominator, not N-1.

22

1

( )

1

ni

xi

X Xs

N

Page 10: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Degrees of Freedom

• The mean of 6, 8, & 10 = 8.

• If I allow you to change as many of these numbers as you want BUT the mean must stay 8, how many of the numbers are you free to vary?

Page 11: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Degrees of Freedom• The point of this exercise is that when the

mean is fixed, it removes a degree of freedom from your sample -- this is like actually subtracting 1 from the number of observations in your sample.

• It is for this reason that we use N-1 in the denominator when we calculate s2 • The calculation requires that the mean be

fixed first which effectively removes -- fixes -- one of the data points.

Page 12: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters

• 3) Efficiency

The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over a bunch of independently chosen samples. The smaller the variance, the more efficient the statistic is said to be.

Page 13: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters

• 4) Resistance

The resistance of an estimator refers to the degree to which that estimate is effected by extreme values.

As mentioned previously, both X and s2 are highly sensitive to extreme values.

Page 14: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Estimating Population Parameters

• 4) Resistance

Despite this, they are still the most commonly used estimates of the corresponding population parameters, mostly because of their superiority over other measures in terms sufficiency, unbiasedness, & efficiency. However…

Page 15: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

J. W. Tukey

“Statistics is a science in my opinion, and it is no more a branch of mathematics than are physics, chemistry and economics; for if its methods fail the test of experience – not the test of logic – they will be discarded”

Page 16: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Outline

• Different measures– trimmed mean, winsorized mean, M-

estimators– Median absolute deviation

• Estimation– Bootstrap (a re-sampling procedure)

• End result– Increase your chances of getting significant

results

Page 17: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Two (sometimes) related questions

• What is the best measure to estimate for the scientific purpose?– Mean v median

• What is the best way to measure the estimate?

Page 18: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

These question both concern robustness, but each concern something else:

Which measure: measurement level and meaningfulness of a statistic.

How measured: the distribution of the data

Page 19: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Robust Estimation

• What is Robust? – Influence plots, breakdown points (resistance)

• Robust Estimators– Trimmed, winsorized, and the M-Estimators

• Computer intensive methods– The bootstrap

• Mostly on central tendency, but briefly on others.

Page 20: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

What to measure: Central Tendency

0 1 2 3 4 5 6 7 8 9 10

VAR00001

0

1

2

3

4

5

6

Fre

qu

ency

Mean = 3.95Std. Dev. = 2.11449N = 20

Histogram

How many hours of TV watched yesterday

Page 21: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Trimmed Mean

• Simply cut off the ends• Example: 10% trim• N=20• Cut out 2 values (10%), lowest

and highest• Mean = 3.95• Median = 3.50• 10% trimmed mean = 3.78

122233333344444556810

222333333444445568

Page 22: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Winsorized Mean

• Make some percentage of most extreme values the same as the previous value

• Doing this to 10% of our data…

• = 3.90

122233333344444556810

22223333334444455688

Page 23: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

M-Estimators

• More mathematically complex, but that’s what computers are good for

• Calculate the center of the distribution and give less weight to values that are further away from center

• Different M-Estimators give different weights for deviating values

Page 24: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

Y

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

Y

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

Y

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

Y

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

Y

-2 0 2 4 6 8Distance

0.0

0.2

0.4

0.6

0.8

1.0

YWeight

Hampel

Huber

Andrews &Bisquare

Cauchy

Fair

Page 25: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

All together now

• Mean = 3.95• Median = 3.50• 10% trim mean = 3.78• 10% Winsorized = 3.90• Huber’s = 3.51• Tukey’s s = 3.33• Hampel’s = 3.37• Andrew’s = 3.33

Page 26: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

S vs. AD vs. Median Absolute Deviation

• MAD = the median absolute deviation from the median

• Efficiency: MAD > s > AD

• s vs. AD– For normal distribution AD = .8(s)

• Turns out the efficiency advantage of s over AD is nullified and even reversed for even slightly less than normal data.

Page 27: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

• MAD– For normal distribution .67(s) or .85(AD)– More efficient than either s or AD– Also more resistant than either

• Tukey– MAD “the best of an inferior lot”

Page 28: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Classical Statistics

• Parameter Estimates (“Point” Estimates)– Sample mean and sample standard deviation– Sample correlation and covariance estimates– Linear least squares model fits

• Statistical Inference– t-statistic and t-interval for an unkown mean– Standard errors and t-values for regression coefficients– F-tests for regression model hypotheses

Page 29: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Outliers have “unbounded influence” on classical statistics, resulting in:

• Inaccurate parameter estimates and predictions

• Inaccurate statistical inference

– Standard errors are too large– Confidence intervals are too wide– t-statistics lack power

• In short, unreliable when outliers are present

Classical stats are often not robust

Page 30: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Resistance

• A Fundamental Continuity Concept- Small changes in the data result in only small

changes in estimate- “Change a few, so what” J.W. Tukey

• Valuable Consequence- A good fit to the bulk of the data

- Reliable, automatic outlier detection

Page 31: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

J. W. Tukey (1979)

“… just which robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.”

Page 32: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Bootstrap

• More voodoo

• The basic idea involves sampling with replacement to produce random samples of size n from the original data– Each of these samples provides an estimate

of the parameter of interest. – Repeating the sampling a large number of

times provides information on the variability of the estimator.

Page 33: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Bootstrap

• From previous data example– 1000 samples

• Distribution of Means of each sample

• Mean = 3.942.5 3.0 3.5 4.0 4.5 5.0

0.0

0.2

0.4

0.6

0.8

1.0

Value

De

nsi

ty

Param

Page 34: Robust Estimators. Estimating Population Parameters The mean (X) and variance (s 2 ) are the descriptive statistics that are most commonly used to represent

Bootstrap

• With smaller data sets we should be concerned about biasedness– Does our statistic reliably estimate the

population parameter?

• Bootstrapped estimates and subsequent inferential use may be preferable in such low N situations