linear regression brian healy, phd bio203. previous classes hypothesis testing hypothesis testing...
TRANSCRIPT
![Page 1: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/1.jpg)
Linear regressionLinear regression
Brian Healy, PhDBrian Healy, PhD
BIO203BIO203
![Page 2: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/2.jpg)
Previous classesPrevious classes
Hypothesis testingHypothesis testing– ParametricParametric– NonparametricNonparametric
CorrelationCorrelation
![Page 3: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/3.jpg)
What are we doing today?What are we doing today?
Linear regressionLinear regression– Continuous outcome with continuous, Continuous outcome with continuous,
dichotomous or categorical predictordichotomous or categorical predictor– Equation: Equation:
Interpretation of coefficientsInterpretation of coefficients Connection between regression andConnection between regression and
– correlationcorrelation– t-testt-test– ANOVAANOVA
xxXYE 10)|(
![Page 4: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/4.jpg)
Big pictureBig picture Linear regression is the most commonly used Linear regression is the most commonly used
statistical technique. It allows the comparison statistical technique. It allows the comparison of dichotomous, categorical and continuous of dichotomous, categorical and continuous predictors with a continuous outcome.predictors with a continuous outcome.
Extensions of linear regression allow Extensions of linear regression allow – Dichotomous outcomes- logistic regressionDichotomous outcomes- logistic regression– Survival analysis- Cox proportional hazards Survival analysis- Cox proportional hazards
regression regression – Repeated measures Repeated measures
Amazingly, many of the analyses we have Amazingly, many of the analyses we have learned can be completed using linear learned can be completed using linear regressionregression
![Page 5: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/5.jpg)
ExampleExample Yesterday, Yesterday,
we we investigateinvestigated the d the associatioassociation between n between age and age and BPF using BPF using a a correlation correlation coefficientcoefficient
Can we fit Can we fit a line to a line to this data?this data?
.75
.8.8
5.9
.95
BP
F
20 30 40 50 60Age
![Page 6: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/6.jpg)
Quick math reviewQuick math review As you remember from As you remember from
high school math, the high school math, the basic equation of a line basic equation of a line is given by is given by y=mx+b y=mx+b where where mm is the slope is the slope and and bb is the y-intercept is the y-intercept
One definition of One definition of mm is is that for every one unit that for every one unit increase in x, there is increase in x, there is an m unit increase in yan m unit increase in y
One definition of One definition of bb is is the value of y when x the value of y when x is equal to zerois equal to zero
Line
y = 1.5x + 4
0
2
4
6
8
10
12
14
16
18
20
0 2 4 6 8 10 12
![Page 7: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/7.jpg)
PicturePicture
Look at the data in Look at the data in this picturethis picture
Does there seem to Does there seem to be a correlation be a correlation (linear relationship) (linear relationship) in the data?in the data?
Is the data Is the data perfectly linear?perfectly linear?
Could we fit a line Could we fit a line to this data?to this data?
0
5
10
15
20
25
0 2 4 6 8 10 12
![Page 8: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/8.jpg)
What is linear regression?What is linear regression?
Linear regression tries Linear regression tries to find the best line to find the best line (curve) to fit the data (curve) to fit the data
The method of finding The method of finding the best line (curve) is the best line (curve) is least squares, which least squares, which minimizes the sum of minimizes the sum of the distance from the the distance from the line for each of pointsline for each of points
y = 1.5x + 4
0
5
10
15
20
25
0 2 4 6 8 10 12
![Page 9: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/9.jpg)
How do we find the best How do we find the best line?line?
Let’s look at three Let’s look at three candidate linescandidate lines
Which do you think Which do you think is the best?is the best?
What is a way to What is a way to determine the best determine the best line to use?line to use?
![Page 10: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/10.jpg)
ResidualsResiduals The actual observations, The actual observations,
yyii, may be slightly off the , may be slightly off the population line because population line because of variability in the of variability in the population. The equation population. The equation is is yyi i = = + + xxii + + ii, , where where ii is the deviation from is the deviation from the population line (See the population line (See picture). picture).
This is called the This is called the residualresidual
This is the distance from the line for patient 1,
![Page 11: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/11.jpg)
Least squaresLeast squares
The method employed to find the best line The method employed to find the best line is called least squares. This method finds is called least squares. This method finds the values of the values of that minimize the squared that minimize the squared vertical distance from the line to each of vertical distance from the line to each of the point. This is the same as minimizing the point. This is the same as minimizing the sum of the the sum of the ii
22
n
ii
n
ii xy
1
2110
1
2
![Page 12: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/12.jpg)
Estimates of regression Estimates of regression coefficientscoefficients
Once we have solved the least squares Once we have solved the least squares equation, we obtain estimates for the equation, we obtain estimates for the ’s, ’s, which we refer to as which we refer to as
The final least squares equation is where The final least squares equation is where yhat is the mean value of y for a value of yhat is the mean value of y for a value of xx11
10ˆ,ˆ
110ˆˆˆ xy
xy
xx
yyxx
n
ii
n
iii
10
1
2
11
ˆˆ
ˆ
![Page 13: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/13.jpg)
Assumptions of linear Assumptions of linear regressionregression
LinearityLinearity– Linear relationship between outcome and predictorsLinear relationship between outcome and predictors– E(Y|X=x)=E(Y|X=x)=++xx1 1 + + 22xx22
22 is still a linear regression is still a linear regression equation because each of the equation because each of the ’s is to the first ’s is to the first powerpower
Normality of the residualsNormality of the residuals– The residuals, The residuals, ii, are normally distributed, N(0, , are normally distributed, N(0,
Homoscedasticity of the residualsHomoscedasticity of the residuals– The residuals, The residuals, ii, have the same variance, have the same variance
IndependenceIndependence– All of the data points are independentAll of the data points are independent– Correlated data points can be taken into account Correlated data points can be taken into account
using multivariate and longitudinal data methodsusing multivariate and longitudinal data methods
![Page 14: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/14.jpg)
Linearity assumptionLinearity assumption
One of the assumptions of linear One of the assumptions of linear regression is that the relationship between regression is that the relationship between the predictors and the outcomes is linearthe predictors and the outcomes is linear
We call this the population regression line We call this the population regression line E(Y | X=x) = E(Y | X=x) = y|xy|x = = + + 11xx
This equation says that the mean of y This equation says that the mean of y given a specific value of x is defined by given a specific value of x is defined by the the coefficients coefficients
The coefficients act exactly like the slope The coefficients act exactly like the slope and y-intercept from the simple equation and y-intercept from the simple equation of a line from beforeof a line from before
![Page 15: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/15.jpg)
Normality and Normality and homoscedasticity assumptionhomoscedasticity assumption
Two other Two other assumptions of assumptions of linear regression are linear regression are related to the related to the ii’s’s– Normality- the Normality- the
distribution of the distribution of the residuals are normal. residuals are normal.
– Homoscedasticity- Homoscedasticity- the variance of y the variance of y given x is the same given x is the same for all values of xfor all values of x
Distribution of y-values at each value of x is normal with the same variance
![Page 16: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/16.jpg)
ExampleExample
Here is a Here is a regression regression equation for the equation for the comparison of age comparison of age and BPFand BPF
iii ageBPF
ageageBPFE
10
10)|(
.75
.8.8
5.9
.95
BP
F
20 30 40 50 60Age
![Page 17: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/17.jpg)
ResultsResults
The estimated The estimated regression regression equationequation
ageFBP *0029.0957.0ˆ .7
5.8
.85
.9.9
5
20 30 40 50 60Age
BPF predval
![Page 18: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/18.jpg)
_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29
. regress bpf age
Estimated intercept
Estimated slope
![Page 19: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/19.jpg)
The final regression equation is The final regression equation is
The coefficients mean The coefficients mean – the estimate of the mean BPF for a patient with the estimate of the mean BPF for a patient with
an age of 0 is 0.957 (an age of 0 is 0.957 (00hat)hat)
– an increase of one year in age leads to an an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF estimated decrease of 0.0029 in mean BPF ((11hat)hat)
ageFBP *0029.0957.0ˆ
Interpretation of regression Interpretation of regression coefficientscoefficients
![Page 20: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/20.jpg)
Unanswered questionsUnanswered questions
Is the estimate of Is the estimate of 11 ( (11hat) hat) significantly different than zero? In significantly different than zero? In other words, is there a significant other words, is there a significant relationship between the predictor relationship between the predictor and the outcome?and the outcome?
Have the assumptions of regression Have the assumptions of regression been met?been met?
![Page 21: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/21.jpg)
Estimate of variance for Estimate of variance for hathat’s’s
In order to determine if there is a significant In order to determine if there is a significant association, we need an estimate of the association, we need an estimate of the variance of variance of 00hat and hat and 11hathat
ssy|xy|x is the residual variance in y after is the residual variance in y after accounting for x (standard deviation from accounting for x (standard deviation from regression, root mean square error)regression, root mean square error)
n
ii
xy
xx
ses
1
2
|1̂ˆ
n
ii
xy
xx
x
nses
1
2
2
|0
1ˆˆ
![Page 22: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/22.jpg)
Test statisticTest statistic
For both regression coefficients, we use a For both regression coefficients, we use a t-statistic to test any specific hypothesist-statistic to test any specific hypothesis– Each has n-2 degrees of freedom (This is the Each has n-2 degrees of freedom (This is the
sample size-number of parameters estimated)sample size-number of parameters estimated)
What is the usual null hypothesis for What is the usual null hypothesis for 11??
111
ˆˆ
ˆ
est
0
00
ˆˆ
ˆ
est
![Page 23: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/23.jpg)
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=-3.67 (27 dof)Test statistic: t=-3.67 (27 dof)
5)5) p-value=0.0011p-value=0.0011
6)6) Since the p-value is less than 0.05, we Since the p-value is less than 0.05, we reject the null hypothesisreject the null hypothesis
7)7) We conclude that there is a significant We conclude that there is a significant association between age and BPFassociation between age and BPF
![Page 24: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/24.jpg)
_cons .957443 .035037 27.33 0.000 .885553 1.029333 age -.0028799 .0007845 -3.67 0.001 -.0044895 -.0012704 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04061 Adj R-squared = 0.3083 Residual .044524108 27 .001649041 R-squared = 0.3330 Model .022226034 1 .022226034 Prob > F = 0.0010 F( 1, 27) = 13.48 Source SS df MS Number of obs = 29
. regress bpf age
Estimated intercept
Estimated slope p-value for slope
![Page 25: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/25.jpg)
Comparison to correlationComparison to correlation
In this example, we found a relationship In this example, we found a relationship between the age and BPF. We also between the age and BPF. We also investigated this relationship using investigated this relationship using correlationcorrelation
We get the same p-value!!We get the same p-value!! Our conclusion is exactly the same!!Our conclusion is exactly the same!! There are other relationships we will see There are other relationships we will see
laterlaterMethodMethod p-valuep-value
CorrelationCorrelation 0.00100.0010
Linear Linear regressionregression
0.00100.0010
![Page 26: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/26.jpg)
Confidence interval for Confidence interval for 11
As we have done previously, we can As we have done previously, we can construct a confidence interval for the construct a confidence interval for the regression coefficientsregression coefficients
Since we are using a t-distribution, we do Since we are using a t-distribution, we do not automatically use 1.96. Rather we use not automatically use 1.96. Rather we use the cut-off from the t-distributionthe cut-off from the t-distribution
Interpretation of confidence interval is same Interpretation of confidence interval is same as we have seen previouslyas we have seen previously
1,2/11,2/1ˆˆ*ˆ,ˆˆ*ˆ estest dofdof
![Page 27: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/27.jpg)
InterceptIntercept STATA also provides a test statistic STATA also provides a test statistic
and p-value for the estimate of the and p-value for the estimate of the interceptintercept
This is for HThis is for Hoo: : 0 0 = 0, which is often not = 0, which is often not a hypothesis of interest because this a hypothesis of interest because this corresponds to testing whether the corresponds to testing whether the BPF is equal to zero at age of 0BPF is equal to zero at age of 0
Since BPF can’t be 0 at age 0, this test Since BPF can’t be 0 at age 0, this test is not really of interestis not really of interest
We can center covariates to make this We can center covariates to make this test importanttest important
![Page 28: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/28.jpg)
PredictionPrediction
![Page 29: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/29.jpg)
PredictionPrediction
Beyond determining if there is a Beyond determining if there is a significant association, linear regression significant association, linear regression can also be used to make predictionscan also be used to make predictions
Using the regression equation, we can Using the regression equation, we can predict the BPF for patients with specific predict the BPF for patients with specific age valuesage values– Ex. A patient with age=40Ex. A patient with age=40
The expected BPF for a patient of age 40 The expected BPF for a patient of age 40 based on our experiment is 0.841based on our experiment is 0.841
841.040*0029.0957.0ˆ FBP
![Page 30: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/30.jpg)
ExtrapolationExtrapolation
Can we predict the BPF for a patient Can we predict the BPF for a patient with age 80? What assumption would with age 80? What assumption would we be making?we be making?
.75
.8.8
5.9
.95
20 30 40 50 60Age
BPF predval
![Page 31: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/31.jpg)
Confidence interval for Confidence interval for predictionprediction
We can place a confidence interval around We can place a confidence interval around our predicted mean valueour predicted mean value
This corresponds to the plausible values for This corresponds to the plausible values for the mean BPF at a specific agethe mean BPF at a specific age
To calculate a confidence interval for the To calculate a confidence interval for the predicted mean value, we need an predicted mean value, we need an estimate of variability in the predicted estimate of variability in the predicted meanmean
n
ii
xy
xx
xx
nsyes
1
2
2
|
1ˆˆ
![Page 32: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/32.jpg)
Confidence intervalConfidence interval
Note that the standard error equation has Note that the standard error equation has a different magnitude based on the x a different magnitude based on the x value. In particular, the magnitude is the value. In particular, the magnitude is the least when x=the mean of xleast when x=the mean of x
Since the test statistic is based on the t-Since the test statistic is based on the t-distribution, our confidence interval isdistribution, our confidence interval is
This confidence interval is rarely used for This confidence interval is rarely used for hypothesis testing because hypothesis testing because
yestyyesty dfdf ˆˆ*ˆ,ˆˆ*ˆ ,2/,2/
![Page 33: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/33.jpg)
.75
.8.8
5.9
.95
20 30 40 50 60Age
![Page 34: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/34.jpg)
Prediction intervalPrediction interval A confidence interval for a mean provides A confidence interval for a mean provides
information regarding the accuracy of a information regarding the accuracy of a estimated mean value for a sample sizeestimated mean value for a sample size
Often, we are interested in how accurate Often, we are interested in how accurate our prediction would be for a single our prediction would be for a single observation, not the mean of a group of observation, not the mean of a group of observations. This is called a prediction observations. This is called a prediction intervalinterval
What would you estimate as the value for What would you estimate as the value for a single new observation?a single new observation?
Do you think a prediction interval is Do you think a prediction interval is narrower or wider?narrower or wider?
![Page 35: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/35.jpg)
Prediction intervalPrediction interval
Confidence interval always tighter than Confidence interval always tighter than prediction intervalsprediction intervals
The variability in the prediction of a single The variability in the prediction of a single observation contains two types of observation contains two types of variabilityvariability– Variability of the estimate of the mean Variability of the estimate of the mean
(confidence interval)(confidence interval)– Variability around the estimate of the mean Variability around the estimate of the mean
(residual variability)(residual variability) 22| ˆˆ~ yessyse xy
yestyyesty dfdf~ˆ*~,~ˆ*~
,2/,2/
![Page 36: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/36.jpg)
.7.8
.91
20 30 40 50 60Age
![Page 37: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/37.jpg)
ConclusionsConclusions
Prediction interval is always wider Prediction interval is always wider than confidence intervalthan confidence interval– Common to find significant differences Common to find significant differences
between groups but not be able to between groups but not be able to predict very accuratelypredict very accurately
– To predict accurately for a single To predict accurately for a single patient, we need limited overlap of the patient, we need limited overlap of the distribution. The benefit of an increased distribution. The benefit of an increased sample size decreasing the standard sample size decreasing the standard error does not helperror does not help
![Page 38: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/38.jpg)
Model checkingModel checking
![Page 39: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/39.jpg)
How good is our model?How good is our model?
Although we have found a relationship Although we have found a relationship between age and BPF, linear regression between age and BPF, linear regression also allows us to assess how well our also allows us to assess how well our model fits the datamodel fits the data
RR22=coefficient of =coefficient of determination=proportion of variance in determination=proportion of variance in the outcome explained by the modelthe outcome explained by the model– When we have only one predictor, it is the When we have only one predictor, it is the
proportion of the variance in y explained by xproportion of the variance in y explained by x2
2|
22
y
xyy
s
ssR
![Page 40: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/40.jpg)
RR22
What if all of the variability in y was explained What if all of the variability in y was explained by x?by x?– What would RWhat would R2 2 equal?equal?– What does this tell you about the correlation What does this tell you about the correlation
between x and y?between x and y?– What if the correlation between x and y is negative?What if the correlation between x and y is negative?
What if none of the variability in y is explained What if none of the variability in y is explained by x?by x?– What would RWhat would R22 equal? equal?– What is the correlation between x and y in this What is the correlation between x and y in this
case?case?
![Page 41: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/41.jpg)
r vs. Rr vs. R22
RR22=(Pearson’s correlation coefficient)=(Pearson’s correlation coefficient)22=r=r22
Since r is between -1 and 1, RSince r is between -1 and 1, R22 is always is always less than rless than r
– r=0.1, Rr=0.1, R22=0.01=0.01– r=0.5, Rr=0.5, R22=0.25=0.25
MethodMethod EstimateEstimate
rr -0.577-0.577
RR22 0.3330.333
![Page 42: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/42.jpg)
Evaluation of modelEvaluation of model
Linear regression required several Linear regression required several assumptionsassumptions– LinearityLinearity– HomoscedasticityHomoscedasticity– NormalityNormality– Independence-usually from study designIndependence-usually from study design
We must determine if the model We must determine if the model assumptions were reasonable or a assumptions were reasonable or a different model may have been neededdifferent model may have been needed
Statistical research has investigated Statistical research has investigated relaxing each of these assumptionsrelaxing each of these assumptions
![Page 43: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/43.jpg)
Scatter plotScatter plot
A good first step in any regression is A good first step in any regression is to look at the x vs. y scatter plot. to look at the x vs. y scatter plot. This allows us to seeThis allows us to see– Are there any outliers?Are there any outliers?– Is the relationship between x and y Is the relationship between x and y
approximately linear?approximately linear?– Is the variance in the data Is the variance in the data
approximately constant for all values of approximately constant for all values of x?x?
![Page 44: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/44.jpg)
Tests for the assumptionsTests for the assumptions
There are several different ways to test There are several different ways to test the assumptions of linear regression. the assumptions of linear regression. – GraphicalGraphical– StatisticalStatistical
Many of the tests use the residuals, which Many of the tests use the residuals, which are the distances from the fitted line and are the distances from the fitted line and the outcomesthe outcomes
iiiii xyyy 10ˆˆˆˆ
![Page 45: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/45.jpg)
Residual plotResidual plot-.
1-.
050
.05
.1R
esid
ual
s
.8 .85 .9Fitted values
If the assumptions of linear regression are met, we will observe a random scatter of points
![Page 46: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/46.jpg)
Investigating linearityInvestigating linearity
Scatter plot of Scatter plot of predictor vs predictor vs outcomeoutcome
What do you notice What do you notice here?here?
One way to handle One way to handle this is to transform this is to transform the predictor to the predictor to include a quadratic include a quadratic or other termor other term
Non-linear relationship
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10 12
![Page 47: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/47.jpg)
AgingAging
Research has Research has shown that the shown that the decrease in decrease in BPF in normal BPF in normal people is people is pretty slow up pretty slow up until age 65 until age 65 and then there and then there is a more is a more steep dropsteep drop
.65
.7.7
5.8
.85
BP
F
40 50 60 70 80Age
![Page 48: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/48.jpg)
Fitted lineFitted line
.65
.7.7
5.8
.85
40 50 60 70 80Age
Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends
![Page 49: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/49.jpg)
What if we fit a line for this?What if we fit a line for this?
Residual plot shows a non-random Residual plot shows a non-random scatter because the relationship is scatter because the relationship is not really linearnot really linear-.
050
.05
Res
idu
als
.72 .74 .76 .78 .8 .82Fitted values
![Page 50: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/50.jpg)
What can we do?What can we do?
If the relationship between x and y is If the relationship between x and y is not linear, we can try a not linear, we can try a transformation of the valuestransformation of the values
Possible transformationsPossible transformations– Add a quadratic termAdd a quadratic term– Fit a spline. This is when there is a slope Fit a spline. This is when there is a slope
for a certain part of the curve and a for a certain part of the curve and a different slope for the rest of the curvedifferent slope for the rest of the curve
![Page 51: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/51.jpg)
Adding a quadratic termAdding a quadratic term.6
5.7
.75
.8.8
5
40 50 60 70 80Age
![Page 52: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/52.jpg)
Residual plotResidual plot
-.05
0.0
5R
esid
ual
s
.7 .72 .74 .76 .78 .8Fitted values
![Page 53: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/53.jpg)
Checking linearityChecking linearity
Plot of residuals vs. the predictor is also Plot of residuals vs. the predictor is also used to detect departures from linearityused to detect departures from linearity
These plots allow you to investigate These plots allow you to investigate each predictor separately so becomes each predictor separately so becomes important in multiple regressionimportant in multiple regression
If linearity holds, we anticipate a If linearity holds, we anticipate a random scatter of the residuals on both random scatter of the residuals on both types of residual plottypes of residual plot
![Page 54: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/54.jpg)
HomoscedasticityHomoscedasticity
The second assumption The second assumption is equal variance across is equal variance across the values of the the values of the predictor predictor
The top plot shows the The top plot shows the assumption is met, assumption is met, while the bottom plot while the bottom plot shows that there is a shows that there is a greater amount of greater amount of variance for larger variance for larger fitted valuesfitted values
![Page 55: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/55.jpg)
ExampleExample0
100
000
200
000
300
000
Exp
ress
ion
leve
l
1 2 3 4 5 6Lipid number
![Page 56: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/56.jpg)
ExampleExample In this example, In this example,
we can fit a we can fit a linear linear regression regression model model assuming that assuming that there is a linear there is a linear increase in increase in expression with expression with lipid number, lipid number, but here is the but here is the residuals plot residuals plot from this from this analysisanalysis
What is wrong?What is wrong?
-100
000
01
0000
02
0000
0R
esid
ual
s
-50000 0 50000 100000Fitted values
![Page 57: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/57.jpg)
Transform the y-valueTransform the y-value
Clearly, the Clearly, the residuals residuals showed that showed that we did not we did not have equal have equal variancevariance
What if we What if we log-transform log-transform our y-value?our y-value?
68
10
12
14
log
Exp
ress
ion
leve
l
1 2 3 4 5 6Lipid number
![Page 58: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/58.jpg)
New regression equationNew regression equation
By transforming the outcome variable we By transforming the outcome variable we have changed our regression equation:have changed our regression equation:– Original: ExpressionOriginal: Expressionii = =00+ + 11*lipid*lipidii++ii
– New: ln(ExpressionNew: ln(Expressionii) =) =00+ + 11*lipid*lipidii++ii
What is the interpretation of What is the interpretation of 11 from the new from the new regression model?regression model?– For every one unit increase in lipid number, there For every one unit increase in lipid number, there
is a is a 11 unit increase in the ln(Expression) on unit increase in the ln(Expression) on averageaverage
– The interpretation has changed due to the The interpretation has changed due to the transformationtransformation
![Page 59: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/59.jpg)
Residual plotResidual plot
On the log-On the log-scale, the scale, the assumption assumption of equal of equal variance variance appears appears much more much more reasonablereasonable
-3-2
-10
12
Res
idu
als
7 8 9 10 11Fitted values
![Page 60: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/60.jpg)
Checking homoscedasticity Checking homoscedasticity
If we do not appear to have equal If we do not appear to have equal variance, a transformation of the variance, a transformation of the outcome variable can be usedoutcome variable can be used– Most common are log-transformation or Most common are log-transformation or
square root transformationsquare root transformation Other approaches involving weighted Other approaches involving weighted
least squares can also be used if a least squares can also be used if a transformation does not worktransformation does not work
![Page 61: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/61.jpg)
NormalityNormality
Regression Regression requires that requires that the residuals the residuals are normally are normally distributeddistributed
To test if the To test if the residuals are residuals are normal:normal:– Histogram of Histogram of
residualsresiduals– Normal Normal
probability plotprobability plot
02
46
8D
ensi
ty
-.1 -.05 0 .05resid
Several statistical tests for normality of residuals are also available
![Page 62: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/62.jpg)
What if normality does not What if normality does not hold?hold?
Transformations of the outcome can Transformations of the outcome can often helpoften help
Changing to another type of Changing to another type of regression that does not require regression that does not require normality of the residualsnormality of the residuals– Logistic regressionLogistic regression– Poisson regressionPoisson regression
![Page 63: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/63.jpg)
OutliersOutliers
Investigating the residuals also provides Investigating the residuals also provides information regarding outliersinformation regarding outliers
If a value is extreme in the vertical If a value is extreme in the vertical direction, the residual will be extreme as direction, the residual will be extreme as wellwell– You will see this in labYou will see this in lab
If a value is extreme in the horizontal If a value is extreme in the horizontal direction, this value can have too much direction, this value can have too much importance (leverage)importance (leverage)– This is beyond the scope of this classThis is beyond the scope of this class
![Page 64: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/64.jpg)
![Page 65: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/65.jpg)
ExampleExample
Another measure of disease burden Another measure of disease burden in MS is the T2 lesion volume in the in MS is the T2 lesion volume in the brainbrain– Over the course of the disease patients Over the course of the disease patients
accumulate brain lesions that they do accumulate brain lesions that they do not recover fromnot recover from
This is a measure of the disease This is a measure of the disease burden in the brainburden in the brain
Is the significant linear relationship Is the significant linear relationship between T2 lesion volume and age?between T2 lesion volume and age?
![Page 66: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/66.jpg)
01
02
03
0L
esio
n v
olu
me
20 30 40 50 60Age
![Page 67: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/67.jpg)
Linear modelLinear model
Our initial linear model:Our initial linear model:– LVLVii = =00++11*age*ageii + +ii
– What is the interpretation of What is the interpretation of 11??– What is the interpretation of What is the interpretation of 00??
Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:
– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?
ii ageVL *062.070.3ˆ
![Page 68: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/68.jpg)
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=0.99 (102 dof)Test statistic: t=0.99 (102 dof)
5)5) p-value=0.32p-value=0.32
6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis
7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume
![Page 69: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/69.jpg)
_cons 3.699857 2.742369 1.35 0.180 -1.739618 9.139333 age .0623605 .0628706 0.99 0.324 -.0623429 .187064 lv_entry Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 3474.0327 103 33.7284729 Root MSE = 5.8081 Adj R-squared = -0.0002 Residual 3440.84404 102 33.7337651 R-squared = 0.0096 Model 33.1886601 1 33.1886601 Prob > F = 0.3236 F( 1, 102) = 0.98 Source SS df MS Number of obs = 104
. regress lv_entry age
Estimated coefficient
p-value
![Page 70: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/70.jpg)
-10
01
02
03
0R
esid
ual
s
5 5.5 6 6.5 7 7.5Fitted values
![Page 71: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/71.jpg)
Linear modelLinear model
Our initial linear model:Our initial linear model:– ln(LVln(LVii) =) =00++11*age*ageii + +ii
– What is the interpretation of What is the interpretation of 11??
– What is the interpretation of What is the interpretation of 00??
Using STATA, we get the following Using STATA, we get the following regression equation:regression equation:
– Is there a significant relationship between age Is there a significant relationship between age and lesion volume?and lesion volume?
ii ageLV *0034.036.1)(n̂l
![Page 72: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/72.jpg)
Hypothesis testHypothesis test
1)1) HH00: : 11=0=0
2)2) Continuous outcome, continuous predictorContinuous outcome, continuous predictor
3)3) Linear regressionLinear regression
4)4) Test statistic: t=0.38 (102 dof)Test statistic: t=0.38 (102 dof)
5)5) p-value=0.71p-value=0.71
6)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail to reject the null hypothesisto reject the null hypothesis
7)7) We conclude that there is no significant We conclude that there is no significant association between age and lesion volumeassociation between age and lesion volume
![Page 73: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/73.jpg)
_cons 1.355875 .3952489 3.43 0.001 .5719006 2.139849 age .0034291 .0090613 0.38 0.706 -.014544 .0214022 lnlv Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 71.5754302 103 .69490709 Root MSE = .8371 Adj R-squared = -0.0084 Residual 71.4750773 102 .700736052 R-squared = 0.0014 Model .100352931 1 .100352931 Prob > F = 0.7059 F( 1, 102) = 0.14 Source SS df MS Number of obs = 104
. regress lnlv age
Estimated coefficient
p-value
![Page 74: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/74.jpg)
-2-1
01
2R
esid
ual
s
1.4 1.45 1.5 1.55Fitted values
![Page 75: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/75.jpg)
Histograms of residualsHistograms of residuals
Untransformed values
Transformed values
![Page 76: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/76.jpg)
Conclusions for model Conclusions for model checkingchecking
Checking model assumptions for Checking model assumptions for linear regression is needed to ensure linear regression is needed to ensure inferences are correctinferences are correct– If you have the wrong model, your If you have the wrong model, your
inference will be wrong as wellinference will be wrong as well Majority of model checking based on Majority of model checking based on
the residualsthe residuals If model fit is bad, should use a If model fit is bad, should use a
different modeldifferent model
![Page 77: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/77.jpg)
Dichotomous predictorsDichotomous predictors
![Page 78: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/78.jpg)
Linear regression with Linear regression with dichotomous predictordichotomous predictor
Linear regression can also be used for Linear regression can also be used for dichotomous predictors, like sexdichotomous predictors, like sex
To do this, we use an indicator variable, To do this, we use an indicator variable, which equals 1 for male and 0 for female. which equals 1 for male and 0 for female. The resulting regression equation for BPF The resulting regression equation for BPF isis
iii sexBPF
sexsexBPFE
10
10)|(
![Page 79: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/79.jpg)
GraphGraph
.75
.8.8
5.9
.95
BP
F
0 .2 .4 .6 .8 1Sex
![Page 80: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/80.jpg)
The regression equation can be rewritten asThe regression equation can be rewritten as
The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0 0 is the mean BPF when sex=0, in the female is the mean BPF when sex=0, in the female
groupgroup
– is the mean BPF when sex=1, in the is the mean BPF when sex=1, in the male groupmale group
What is the interpretation of What is the interpretation of 11??
– For a one-unit increase in sex, there is a For a one-unit increase in sex, there is a 11 increase in mean of the BPFincrease in mean of the BPF
– The difference in mean BPF between the males The difference in mean BPF between the males and femalesand females
imale
ifemale
BPF
BPF
10
0
![Page 81: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/81.jpg)
Interpretation of resultsInterpretation of results
The final regression equation is The final regression equation is
The meaning of the coefficients in this case The meaning of the coefficients in this case are are – 0.823 is the estimate of the mean BPF in the 0.823 is the estimate of the mean BPF in the
female groupfemale group– 0.037 is the estimate of the mean increase in 0.037 is the estimate of the mean increase in
BPF between the males and femalesBPF between the males and females– What is the estimated mean BPF in the males?What is the estimated mean BPF in the males?
How could we test if the difference between How could we test if the difference between the groups is statistically significant?the groups is statistically significant?
sexFBP *037.0823.0ˆ
![Page 82: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/82.jpg)
Hypothesis testHypothesis test
1)1) HH00: There is no difference based on gender : There is no difference based on gender ((11=0)=0)
2)2) Continuous outcome, dichotomous Continuous outcome, dichotomous predictorpredictor
3)3) Linear regressionLinear regression4)4) Test statistic: t=1.82 (27 dof)Test statistic: t=1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we Since the p-value is more than 0.05, we
fail to reject the null hypothesisfail to reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant
difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females
![Page 83: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/83.jpg)
_cons .8228636 .0100022 82.27 0.000 .8023407 .8433865 sex .0371364 .0203586 1.82 0.079 -.004636 .0789087 bpf Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total .066750142 28 .002383934 Root MSE = .04691 Adj R-squared = 0.0767 Residual .059426595 27 .002200985 R-squared = 0.1097 Model .007323547 1 .007323547 Prob > F = 0.0792 F( 1, 27) = 3.33 Source SS df MS Number of obs = 29
. regress bpf sex
Estimated difference between groups
p-value for difference
![Page 84: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/84.jpg)
-.1
-.05
0.0
5.1
Res
idu
als
.82 .83 .84 .85 .86Fitted values
![Page 85: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/85.jpg)
T-testT-test
As hopefully you remember, you As hopefully you remember, you could have tested this same null could have tested this same null hypothesis using a two sample t-testhypothesis using a two sample t-test
Linear regression makes an equal Linear regression makes an equal variance assumption, so let’s use the variance assumption, so let’s use the same assumption for our t-testsame assumption for our t-test
![Page 86: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/86.jpg)
Hypothesis testHypothesis test
1)1) HH00: There is no difference based on gender : There is no difference based on gender 2)2) Continuous outcome, dichotomous predictorContinuous outcome, dichotomous predictor3)3) t-testt-test4)4) Test statistic: t=-1.82 (27 dof)Test statistic: t=-1.82 (27 dof)5)5) p-value=0.079p-value=0.0796)6) Since the p-value is more than 0.05, we fail Since the p-value is more than 0.05, we fail
to reject the null hypothesisto reject the null hypothesis7)7) We conclude that there is no significant We conclude that there is no significant
difference in the mean BPF in males difference in the mean BPF in males compared to femalescompared to females
![Page 87: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/87.jpg)
Pr(T < t) = 0.0396 Pr(|T| > |t|) = 0.0792 Pr(T > t) = 0.9604 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Ho: diff = 0 degrees of freedom = 27 diff = mean(0) - mean(1) t = -1.8241 diff -.0371364 .0203586 -.0789087 .004636 combined 29 .8318276 .0090667 .0488255 .8132553 .8503998 1 7 .86 .0196457 .0519775 .8119288 .9080712 0 22 .8228636 .0096717 .0453645 .8027502 .8429771 Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] Two-sample t test with equal variances
. ttest bpf, by(sex)
![Page 88: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/88.jpg)
Amazing!!!Amazing!!!
We get the same result using both We get the same result using both approaches!!approaches!!
Linear regression has the advantages Linear regression has the advantages of: of: – Allowing multiple predictors (tomorrow)Allowing multiple predictors (tomorrow)– Accommodating continuous predictors Accommodating continuous predictors
(relationship to correlation)(relationship to correlation)– Accommodating categorical predictors Accommodating categorical predictors
(tomorrow)(tomorrow) Very flexible approachVery flexible approach
![Page 89: Linear regression Brian Healy, PhD BIO203. Previous classes Hypothesis testing Hypothesis testing –Parametric –Nonparametric Correlation Correlation](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649d975503460f94a81427/html5/thumbnails/89.jpg)
ConclusionConclusion
Indicator variables can be used to Indicator variables can be used to represent dichotomous variables in a represent dichotomous variables in a regression equationregression equation
Interpretation of the coefficient for Interpretation of the coefficient for an indicator variable is the same as an indicator variable is the same as for a continuous variablefor a continuous variable– Provides a group comparisonProvides a group comparison
Tomorrow we will see how to use Tomorrow we will see how to use regression to match ANOVA resultsregression to match ANOVA results