population mean. problem. notation - michigan state … mean. problem. notation . populati ... ti...

39
RECALL: In last class, we learned statistical inference for population mean. Problem. Notation Populati on Notation Meaning The population mean The sample mean The population standard deviation s The sample standard deviation n The sample size

Upload: dangcong

Post on 05-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

RECALL: In last class, we learned statistical inference for population mean.

Problem. Notation

Population Notation

Meaning

The population mean

𝑋� The sample mean

𝜎 The population standard deviation

s The sample standard deviation n The sample size

Page 2: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

RECALL:

Point estimation. (sample mean ) Distribution of

Confidence Interval One-sample z-interval (population SD is known) One-sample t-interval (only sample SD is known) Remark: 1. T-interval needs normal assumption. 2. , which is related to n-1 and C%, can be obtained from t-table.

XX

*1−nt

nstX n

*1−±

Page 3: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

RECALL: Hypothesis Testing about 𝜇

Z-Test (population SD is known) Test statistic: P-value:

Null Hypothesis H0 vs. Alternative Hypothesis HA

H0 : vs.

HA : (two-sided)

HA : (one-sided)

HA : (one-sided)

Alternative Hypothesis HA P-value formula

HA : (two-sided) P-value=2P(Z>|z|)

HA : (one-sided) P-value=P(Z>z)

HA : (one-sided) P-value=P(Z<z)

Page 4: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

RECALL: Hypothesis Testing about 𝜇

T-Test (sample SD s is known) Test statistic: P-value:(df=n-1)

Null Hypothesis H0 vs. Alternative Hypothesis HA

H0 : vs.

HA : (two-sided)

HA : (one-sided)

HA : (one-sided)

Alternative Hypothesis HA P-value formula

HA : (two-sided) Two-tail prob. of |t|

HA : (one-sided) One-tail prob. of |t|

HA : (one-sided) One-tail prob. of |t|

ns

Xt 0µ−=

Page 5: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

RECALL: TI commands (under STATTESTS): T-interval: use 8: T Interval T-Test: use 2:T-Test Decisions: If p-value< alpha level, reject H0, and we say the test is statistically significant at this alpha level); If p-value>alpha level, fail to reject H0, and we say the test is not statistically significant at this alpha level); Errors: Type I error: decide to reject H0, but actually H0 is true; Type II error: decide to retain H0, but actually H0 is false; P(Type I error)=alpha level.

Page 6: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

Exploring Relationship Between Variables

Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression

Page 7: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

WHERE ARE WE GOING? People might ask the following questions in the real

life: 1. Is the price of sneakers related to how long they last? 2. Is smoking related to lung cancer? 3. Do baseball teams that score more runs sell more tickets to

their games?

Chapter 7 will look at relationships between two quantitative variables X and Y. Scatterplot Correlation

Page 8: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS Is the price of sneakers related to how long they last?

Following table shows some data collected for sneakers:

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12

Price Years Price($) 1 20.00 2 21.99 3 23.29 4 25.99 5 29.99 6 34.99 7 39.99 8 44.99 9 49.99

10 59.99

This is an example of scatterplot. x-axis represents variable years and y-axis represents prices.

Page 9: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOT Scatterplots may be the most common and most

effective display for paired data.

Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables

010203040506070

0 2 4 6 8 10 12

Price X-axis: Years, Explanatory variable which explains or influences changes in the other variable. Y-axis: Price, Response variable which measures an outcome of a study.

Page 10: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS

How do we describe the scatterplot? Or, What information about the relationship of the two variables can we get by looking at the scatterplot?

Please look at the scatterplot of the sneakers example, and think about what can you tell about the relationship of years and price.

010203040506070

0 2 4 6 8 10 12

Price We are going to describe the relationship from four different aspects. 1) Direction 2) Form 3) Strength 4) Unusual features

Page 11: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOT Look for direction: What’s my

design—positive, negative or neither? Negative A pattern like this that runs from the upper left to the lower right is said to be negative. Y variable decreases as the X variable increases. Positive

A pattern running the other way is called positive.

Y variable increases as X variable increases.

0 10 20 30 40 50

05

1015

Scatterplot

X

Y

0 10 20 30 40 50

-10

-50

Scatterplot

X

Y

Page 12: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOT The example in

the text shows a negative association between central pressure and maximum wind speed

As the central pressure increases, the maximum wind speed decreases

Page 13: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS Look for Form: straight, curved or something

exotic, or no pattern?

0 2 4 6 8 10

05

1015

2025

30

Scatterplot

X

Y

0 2 4 6 8 10

050

010

0015

0020

0025

0030

00

Scatterplot

X

Y

0 2 4 6 8 10

-2-1

01

2

Scatterplot

X

Y

Straight line, linear Curved No pattern

In this part, we are more interested in the linear pattern.

Page 14: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS Look for strength: how much scatter? Or, how strong

the relationship is? Strong: the points appear tightly clustered in a single stream.

Weak: the swarm of points seem to form a vague cloud through which we can barely discern any trend or pattern

0 2 4 6 8 10

05

1015

2025

30

Scatterplot

X

Y

0 2 4 6 8 10

02

46

810

Scatterplot

X

Y

0 2 4 6 8 10

-10

12

34

56

Scatterplot

X

Y

0 2 4 6 8 10

-2-1

01

2

Scatterplot

X

Y

Page 15: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS Look for the Unusual Features: Are there

outliers or subgroups?

0 2 4 6 8 10

-20

24

68

10

Scatterplot

X

Y

0 5 10 15

05

1015

2025

30

Scatterplot

X

Y

The point circled is a potential outlier There are two clusters.

Page 16: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

Slide 1- 16

TERM 1: SCATTERPLOT-ROLES FOR VARIABLES

It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis.

This determination is made based on the roles played by the variables.

When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.

Page 17: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOTS Summary

A Scatterplot shows the relationship between two quantitative variables measured on the same individual.

The variable that is designated the X variable is called the explanatory variable

The variable that is designated the Y variable is called the response variable

Always plot the explanatory variable on the horizontal (x) axis

Always plot the response variable on the vertical (y) axis

In examining scatterplots, look for an overall pattern showing the form, direction and strength of the relationship

Look also for outliers or other deviations from this pattern

Page 18: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 1: SCATTERPLOT Example: Fast food is often considered unhealthy because

much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. Analyze the association between fat content and calories.

Fat(g) 20 30 35 36 40 40 44 Calories 410 580 590 570 640 680 660

400

500

600

700

18 28 38 48

Cal

orie

Fat

Comment on the scatterplot: 1) Direction Positive 2) Form Roughly linear 3) Strength Moderately strong 4) Unusual features No.

Page 19: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION From scatterplots, we can look for the relationship between two

quantitative variables and whether the relationship is strong or weak. But how strong is it?

Correlation coefficient (or simply correlation) is a quantitative measure of linear relationship (association) between two quantitative variables.

Finding the correlation coefficient, denoted by r, by hand:

Where and are standard deviations for X and Y respectively.

Remarks: Before you use correlation, you must check several conditions:

Quantitative Variables Condition Straight Enough Condition Outlier Condition

yxssnyyxx

r)1(

))((−

−−= ∑

xs ys

Page 20: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION (Revisit the calories example) Here are the fat and calories

contents of several brands of burgers.

What is the correlation coefficient of x (fat) and y (calories)? Solution:

Add up the products: 2700+50+0+(-20)+250+450+630=4060 Correlation r=4060/{(7-1)*7.98*89.81}=0.9442

Deviations in x Deviations in y Product 20-35=-15 410-590=-180 (-15)*(-180)=2700 30-35=-5 580-590=-10 (-5)*(-10)=50 35-35= 0 590-590= 0 0*0=0 36-35= 1 570-590=-20 1*(-20)=-20 40-35= 5 640-590= 50 5*50=250 40-35= 5 680-590= 90 5*90=450 44-35= 9 660-590= 70 9*70=630

X: Fat(g) 20 30 35 36 40 40 44 Y: Calories 410 580 590 570 640 680 660

Page 21: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION

Page 22: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

Slide 1- 22 CORRELATION PROPERTIES The sign of a correlation coefficient gives the

direction of the linear association. Positive sign Positive linear association Negative sign Negative linear association Correlation is always between -1 and +1.

Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line.

A correlation near zero corresponds to a weak linear association.

Example: The correlation between fat and calories as 0.9442 indicates a strong positive linear association between them.

Page 23: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION Cautions about correlation:

Quantitative Variables Condition: Correlation applies only to quantitative variables.

Straight Enough Condition: Correlation measures the strength only of the linear association.

Outlier Condition: Outliers can distort the correlation dramatically.

-2 -1 0 1 2

-4-2

02

4

x

y

r=0.92 -2 -1 0 1 2

-20

24

68

x

y

r=0.098

-2 -1 0 1 2

-50

510

x

y

With the outlier: r=0.795

Without the outlier: r=0.938

Page 24: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION Correlation≠Causation

Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Based on the fat and calories contents of several brands of burgers, the correlation between them is r=0.9442. Which conclusion is most accurate?

A. More fat in the burgers causes higher calories B. The burgers containing more fat tend to have higher

calories Comment: Even though A sounds all right, it is not the conclusion can

be derived/explained by the correlation. Correlation is an objective story teller of the linear

association between two variables. It can’t tell the causation.

Page 25: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

Slide 1- 25 CORRELATION PROPERTIES (CONT.) Correlation treats x and y symmetrically:

The correlation of x with y is the same as the correlation of y with x.

Correlation has no units. Correlation is not affected by shifting and

rescaling of either variable. Correlation depends only on the z-scores, and they

are unaffected by changes in center or scale. i.e. corr(aX+b,cY+d)=corr(X,Y) where a,b,c,d are

constants.

Page 26: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 2: CORRELATION Example: Here are several scatterplots. The calculated

correlations are -0.923, -0.487, 0.006 and 0.777. Which is which?

-10 -5 0 5 10

-120

-80

-40

020

(a)

X

Y

-10 -5 0 5 10

-20

-10

010

20

(b)

X

Y

-10 -5 0 5 10

-20

-10

010

20

(c)

X

Y

-10 -5 0 5 10

-20

-10

010

2030

(d)

X

Y

-0.923

0.006 0.777

-0.487

Page 27: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

QUESTION: CAN WE DO MORE? Scatterplot and correlation are useful tolls

helping us to learn the (linear) association between two quantitative variables.

Can we answer the following question: Fast food is often considered unhealthy because much of it is high in fat. What is the calorie content of a kind of fast food with 28g fat?

400450500550600650700

18 28 38 48Fat

Cal

orie

If we want to estimate a unknown value based on the known values, this is called a prediction. One way to do the prediction is by constructing a linear model.

Page 28: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL Let’s look at the burger example again.

Fat(g) 20 30 35 36 40 40 44 Calories 410 580 590 570 640 680 660

20 25 30 35 40

400

450

500

550

600

650

BURGERS

FAT

CA

LOR

IES

The red line does not go through all the points, but it can summarize the general pattern with only a couple of parameters: Calories = a+b*fat. This model can be used to predict the Calories based on the fat contain. Explanatory Var: Fat Response Var: Calories

Page 29: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL

20 25 30 35 40

400

450

500

550

600

650

BURGERS

FAT

CA

LOR

IES

residual

Predicted value: we call the estimate made from a model the predicted value, denoted as . Residual: The difference between the observed value and its associated predicted value is called the residual. The line of best fit is the line for which the sum of the squared residuals is smallest. And it’s called the least squares line.

y

Prediction

Page 30: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL

Page 31: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL X: Fat(g) 20 30 35 36 40 40 44 Y: Calories 410 580 590 570 640 680 660

Fat: Calories: Correlation: r=0.9442 Slope: Intercept: Linear model: Q2: What is the predicted calorie when the fat is 30g? When x=30, Q3: What is the residual for the burger with 30g fat? When x=30, the residual is

20 25 30 35 40

400

450

500

550

600

650

BURGERS

FAT

CA

LOR

IES

=210.8+11.06x

Q1: Please construct a linear regression model to predict the calories based on fat.

Page 32: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL Remarks: Since regression and correlation are closely

related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition Outlier Condition

Page 33: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 3: LINEAR MODEL (PARAMETERS) We write a and b for the slope and intercept of the

line. They are called the coefficients of the linear model.

The coefficient b is the slope, which tells us how rapidly the predicted value ( ) changes with respect to x. As the value of x increases by 1 unit, the predicted value of y will be increased by b units.

The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. In other words, the intercept a is the predicted value of y when x=0

y

Page 34: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

Intercept and Slope (examples) Fast food is often considered unhealthy because much of it

is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. To analyze the association between fat content and calories, the equation of the regression model is: Predicted calories=217.95+10.63*fat For this linear equation, slope=10.63, intercept=217.95

Q1: What does the slope 10.63 mean? A1: An increase in fat of 1 gram is associated with an increase in

calories of 10.63. Q2: If the fat increases by 2 grams, how many more calories are

expected to be contained in the burger? A2: 2*10.63=21.26 Q3: What does the intercept 217.95 mean here? A3: Theoretically, it means: when the burger contains no fat at all,

the amount of calories is 217.95.

Page 35: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 4: RESIDUAL PLOT After you construct the linear model, you have to check whether

the linear model makes sense or not. Residual plot can be used to check the appropriateness of the

linear model. Residual plot is the scatterplot of the residuals versus the x-

values. If a linear model is appropriate, then the residual plot shouldn’t have any interesting features,

like a direction or shape. It should stretch horizontally, with about the same amount

of scatter throughout. It should show no bends, and it should have no outliers.

-10 -5 0 5 10

-2-1

01

2

X

Residu

als

Page 36: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 4: RESIDUAL SCATTERPLOT Now, let’s try to diagnose the model for the calorie

and fat example. Fat(g): x 20 30 35 36 40 40 44 Calories: y 410 580 590 570 640 680 660 Predicted calories: 430.6 536.9 590 600.6 643.2 643.2 685.7 Residual: -20.6 43.1 0 -30.6 -3.2 36.8 -25.7

20 25 30 35 40

-30

-20

-10

010

2030

40

fat

resi

dual

s

Residual plot

x

Page 37: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TERM 4: RESIDUAL PLOT Example: Tell what each of the residual plots below

indicates about the appropriateness of the linear model that was fit to the data.

-2 -1 0 1 2

-2-1

01

2

(a)

x1

y1

-2 -1 0 1 2

-6-5

-4-3

-2-1

01

(b)

x2

y2

-2 -1 0 1 2

-4-2

02

46

(c)

x3

y3

(a) (b) (c)

Page 38: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

TI for correlation and regression equation The first time you do this:

Press 2nd, CATALOG (above 0) Scroll down to DiagnosticOn Press ENTER, ENTER Read “Done” Your calculator will remember this setting even when turned

off

Enter predictor (x) values in L1 Enter response (y) values in L2

Pairs must line up There must be the same number of predictor and response

values

Press STAT, > (to CALC) Scroll down to 8:LinReg(a+bx), press ENTER, ENTER Read intercept a, slope b and correlation r at the screen

Page 39: population mean. Problem. Notation - Michigan State … mean. Problem. Notation . Populati ... TI commands (under STAT TESTS): ... -5 0 5 10 x y With the outlier: r=0.795

IMPORTANT NOTES: Take-home quiz is due on Monday. No late

submission will be accepted. Keep the ID assignment and bring it to class on

Monday. Sample exam will be handed out on Monday. We

will discuss the questions on Wednesday. Suggested Problem Set 4 will be collected on

next Thursday. Final exam will be on next Thursday. 2 hours in

class. Please prepare one page A4 size cheat sheet (one-sided) on your own. Formula sheet will not be provided in final exam. Cheat sheet will be collected together with the final exam.