* obviously, the pattern of the points in the sample does not match the pattern of the population

*Regression Analysis


*When we studied bivariate data, we found a that was the best model for the data we had

*We assumed that the data we had made up the entire population

*We calculate r to judge the strength and direction of the linear model


*But, what if the data we used to find that particular was really only a sample of data from a larger population?

*Does the pattern of our sample match the pattern of the whole population? Or did we pick a sample with 10 lucky points?


*Obviously, the pattern of the points in the sample does not match the pattern of the population.


*r, the correlation coefficient of the sample doesn’t equal ,r the

correlation coefficient of the population


*Question: If our sample is clustered in an ellipse and looks fairly linear, does it come from a population with a similar ellipse or not?

*Question: Is our r a good estimate of r?


*We could do some sort of inference procedure with r and r, but that is complicated, so we’ll do the inference procedure on the slope instead.

*Remember , so the slope and correlation coefficient are tied together, so this will work

Population

Parameters

*Correlation coefficient of the population =

*y-intercept of the population =

*Slope of the

*population =

Sample

Statistics

*Correlation coefficient of the sample = r

*y-intercept of the sample = a

*Slope of the sample = b



*b1 – the slope of the LSRL is an unbiased estimator of the true slope,

*Conditions for Inference on Regression

*The true relationship between x and y is linear*Check the scatter plot and residual plot

*For any value of x the values of y are independent*Random sample

*For any value of x the y-values are normally distributed*Check a histogram or boxplot of the residuals

*The standard deviation of the y values is constant*Check the scatter plot and residual plot


*s =

*s is the standard error of the residuals, an estimate of the true standard deviation of the y-values of the population

*Interpretation of s = This is the measure of the variation of (y in context) for a given (x in context)


*Return to Sampling Distributions

*How do we know, just from our sample, anything about the population?

*We have a picture of what the samples should look like from the sampling distribution

*If we take sample after sample of the same size from this population and calculate the slope each time, we get a normal approximation with a mean equal to the true slope and a standard deviation = =


* =

*This is the standard deviation of the slopes

*Interpretation of - This is how much we would expect the sample slopes of predicting (y in context) with (x in context) to vary from sample to sample


*Confidence Interval

*statistic ± (crit. val.)(std. dev. of stat.)

*b1 ± t*(Std. dev. of stat.) or (std. error)

*1. In a study of the performance of a computer printer, the size (in kilobytes) and the printing time (in seconds) for each of 22 small text files were recorded. A regression line was a satisfactory description of the relationship between size and printing time. The results of the regression analysis are shown below.

Dependent variable: Printing Time SourceSum of Squares df Mean Square F-ratio Regression 53.3315 1 53.3315 140 Residual 7.62381 20 0.38115 Variable Coefficient s.e. of coeff t-ratio prob Constant 11.6559 0.3153 37 <0.0001 Size 3.47812 0.294 11.8 <0.001 Rsquared = 87.5% Rsquared(adjusted) = 86.9% s=0.6174 with 22-2 = 20 degrees of freedom

95% Confidence Interval

Dependent variable: Printing Time Source Sum of Squares df Mean Square F-ratio Regression 53.3315 1 53.3315 140 Residual 7.62381 20 0.38115 Variable Coefficient s.e. of coeff t-ratio prob Constant 11.6559 0.3153 37 <0.0001 Size 3.47812 0.294 11.8 <0.001 s=0.6174 with 22-2 = 20 degrees of freedom

Rsquared = 87.5% Rsquared(adjusted) = 86.9%


*Hypothesis Test*Null Hypothesis : There is no relationship between x and y, there is no correlation

*Ho: = 0

*Ha: or or

*Where is the true slope of the relationship between (x in context) and (y in context)


*Hypothesis Test

*Test statistic =

*t =

*1. In a study of the performance of a computer printer, the size (in kilobytes) and the printing time (in seconds) for each of 22 small text files were recorded. A regression line was a satisfactory description of the relationship between size and printing time. The results of the regression analysis are shown below.

Dependent variable: Printing Time SourceSum of Squaresdf Mean Square F-ratio Regression 53.3315 1 53.3315 140 Residual 7.62381 20 0.38115 Variable Coefficient s.e. of coeff t-ratio prob Constant 11.6559 0.3153 37 <0.0001 Size 3.47812 0.294 11.8 <0.001 Rsquared = 87.5%Rsquared(adjusted) = 86.9% s=0.6174 with 22-2 = 20 degrees of freedom

Sufficient Evidence of a linear relationship?

* obviously, the pattern of the points in the sample does not match the pattern of the population

Documents

error slide

residual plot slide

values of y

y values

scatter plot

value of x

sample doesnt

random sample