statistics: unlocking the power of data lock 5 exam 2 review stat 101 dr. kari lock morgan 11/13/12...

26
Statistics: Unlocking the Power of Data Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9

Upload: george-lawrence

Post on 14-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Statistics: Unlocking the Power of Data Lock5

Exam 2 Review

STAT 101

Dr. Kari Lock Morgan

11/13/12

Review of Chapters 5-9

Statistics: Unlocking the Power of Data Lock5

Exam 2In class Thursday 11/15

Cumulative, covering chapters 1-9 (but not 8.2 or 9.2… everything we have done so far in the course)

Closed book, but allowed 2 double-sided pages of notes prepared by you

You will need a calculator, and will need to know how to compute p-values for normal, t, chi-square, and F distributions using your calculator

Practice exam and solutions to review problems available under documents on the course webpage

Statistics: Unlocking the Power of Data Lock5

• Tuesday• Prof Morgan, 1 – 2:30 pm, Old Chem 216

• Wednesday• Prof Morgan, 2 – 3 pm, Old Chem 216• Prof Morgan, 4:30 – 5:30 pm, Old Chem 216• Heather, 8 – 9pm, Old Chem 211A

• Thursday• Prof Morgan, 1 – 2:30 pm, Old Chem 216

• Also, the Stat Education Center in Old Chem 211A is open Sunday – Thurs 4pm – 9pm with stat majors and stat PhD students available to answer questions

Office Hours This Week

Statistics: Unlocking the Power of Data Lock5

Was the sample randomly selected?

Possible to generalize to

the population

Yes

Should not generalize to

the population

No

Was the explanatory variable randomly

assigned?

Possible to make

conclusions about causality

Yes

Can not make conclusions

about causality

No

Data Collection

Statistics: Unlocking the Power of Data Lock5

Variable(s) Visualization Summary StatisticsCategorical bar chart,

pie chartfrequency table,

relative frequency table, proportion

Quantitative dotplot, histogram,

boxplot

mean, median, max, min, standard deviation,

z-score, range, IQR,five number summary

Categorical vs Categorical

side-by-side bar chart, segmented bar chart

two-way table, difference in proportions

Quantitative vs Categorical

side-by-side boxplots statistics by group, difference in means

Quantitative vs Quantitative

scatterplot correlation,simple linear regression

Statistics: Unlocking the Power of Data Lock5

Confidence Interval

• A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples

• A 95% confidence interval will contain the true parameter for 95% of all samples

Statistics: Unlocking the Power of Data Lock5

• How unusual would it be to get results as extreme (or more extreme) than those observed, if the null hypothesis is true?

• If it would be very unusual, then the null hypothesis is probably not true!

• If it would not be very unusual, then there is not evidence against the null hypothesis

Hypothesis Testing

Statistics: Unlocking the Power of Data Lock5

• The p-value is the probability of getting a statistic as extreme (or more extreme) as that observed, just by random chance, if the null hypothesis is true

• The p-value measures evidence against the null hypothesis

p-value

Statistics: Unlocking the Power of Data Lock5

Hypothesis Testing

1.State Hypotheses

2.Calculate a test statistic, based on your sample data

3.Create a distribution of this test statistic, as it would be observed if the null hypothesis were true

4.Use this distribution to measure how extreme your test statistic is

Statistics: Unlocking the Power of Data Lock5

Distribution of the Sample Statistic

1.Sampling distribution: distribution of the statistic based on many samples from the population

2.Bootstrap Distribution: distribution of the statistic based on many samples with replacement from the original sample

3.Randomization Distribution: distribution of the statistic assuming the null hypothesis is true

4.Normal, t,2, F: Theoretical distributions used to approximate the distribution of the statistic

Statistics: Unlocking the Power of Data Lock5

Sample Size Conditions

• For large sample sizes, either simulation methods or theoretical methods work

• If sample sizes are too small, only simulation methods can be used

Statistics: Unlocking the Power of Data Lock5

• For confidence intervals, you find the desired percentage in the middle of the distribution, then find the corresponding value on the x-axis

• For p-values, you find the value of the observed statistic on the x-axis, then find the area in the tail(s) of the distribution

Using Distributions

Statistics: Unlocking the Power of Data Lock5

Confidence IntervalsBest Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

P%

Best Guess at Sampling Distribution

Statistic

2 3 4 5 6 7 8

Observed Statistic

P%P%P%

Upper BoundUpper Bound

Lower Bound

Statistics: Unlocking the Power of Data Lock5

Confidence IntervalsN(0,1)

-3 -2 -1 0 1 2 3

N(0,1)

-3 -2 -1 0 1 2 3

P%

N(0,1)

-3 -2 -1 0 1 2 3

P% z*

*sample statistic z SE Return to original scale with

Statistics: Unlocking the Power of Data Lock5

Hypothesis TestingDistribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

p-value

Statistics: Unlocking the Power of Data Lock5

General Formulas• When performing inference for a single

parameter (or difference in two parameters), the following formulas are used:

sample statistic null value

SEz

*sample statistic z SE

Statistics: Unlocking the Power of Data Lock5

General Formulas• For proportions (categorical variables), the

normal distribution is used

• For inference involving any quantitative variable (means, correlation, slope), the t distribution is used

Statistics: Unlocking the Power of Data Lock5

Standard Error

• The standard error is the standard deviation of the sample statistic

• The formula for the standard error depends on the type of statistic (which depends on the type of variable(s) being analyzed)

Statistics: Unlocking the Power of Data Lock5

Parameter Distribution Conditions Standard Error

ProportionNormal

All counts at least 10np ≥ 10, n(1 – p) ≥ 10

Difference in Proportions

NormalAll counts at least 10

n1p1 ≥ 10, n1(1 – p1) ≥ 10, n2p2 ≥ 10, n2(1 – p2) ≥ 10

Mean t, df = n – 1 n ≥ 30 or data normal

Difference in Means

t, df = smaller of n1 – 1, n2 – 1

n1 ≥ 30 or data normal, n2 ≥ 30 or data normal

Paired Diff. in Means

t, df = nd – 1 nd ≥ 30 or data normal

Correlationt, df = n – 2 n ≥ 30

(1 )p p

n

2

n

1 1

1

2 2

2

(1 ) (1 )p p p p

n n

2 21 2

1 2n n

2

1

d

n

21

2

r

n

pg 470

Statistics: Unlocking the Power of Data Lock5

Multiple Categories• These formulas do not work for categorical

variables with more than two categories, because there are multiple parameters

• For one or two categorical variables with multiple categories, use 2 tests

• For testing for a difference in means across multiple groups, use ANOVA

Statistics: Unlocking the Power of Data Lock5

• Simple linear regression estimates the population model

• with the sample model:

Simple Linear Regression

0 1i i iy x

0 1ˆ ˆˆi iy x

Statistics: Unlocking the Power of Data Lock5

Simple Linear Regression• Inference for the slope can be done using

Statistics: Unlocking the Power of Data Lock5

• Confidence intervals and hypothesis tests for the slope can be done using the familiar formulas:

• Population Parameter: 1, Sample Statistic:

• Use t-distribution with n – 2 degrees of freedom

Inference for the Slope

sample statistic null value

SEt

*sample statistic t SE

Statistics: Unlocking the Power of Data Lock5

• A confidence interval has a given chance of capturing the mean y value at a specified x value (the point on the line)

• A prediction interval has a given chance of capturing the y value for a particular case at a specified x value (the actual point)

Intervals

Statistics: Unlocking the Power of Data Lock5

Inference based on the simple linear model is only valid if the following conditions hold:

1) Linearity2) Constant Variability of Residuals3) Normality of Residuals

Conditions for SLR

Statistics: Unlocking the Power of Data Lock5

Inference Methods

http://prezi.com/c1xz1on-p4eb/stat-101/