introduction to biostatistical analysis using r statistics course for first-year phd students
DESCRIPTION
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students. Session 4 Lecture : Regression Analysis Practical : multiple regression. Lecturer : Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/1.jpg)
Introduction to Biostatistical AnalysisUsing R
Statistics course for first-year PhD students
Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 049 8272807Skype: lorenzo.marini
http://www.biodiversity-lorenzomarini.eu/
Session 4
Lecture: Regression AnalysisPractical: multiple regression
![Page 2: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/2.jpg)
Statistical modelling: more than one parameter
Nature of the response variable
NORMAL POISSON, BINOMIAL …
GLM
Categorical Continuous Categorical + continuous
General Linear Models
Generalized
Linear Models
ANOVA Regression ANCOVA
Nature of the explanatory variables
Session 3 Session 4
Session 3
![Page 3: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/3.jpg)
REGRESSION
Simple linear
-One X
-Linear relation
REGRESSION
Multiple linear
-2 or > Xi
- Linear relation
Non-linear
-One X
-Complex relation
Polynomial
-One X but more slopes
- Non-linear relation
Linear methods Non-linear methods
![Page 4: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/4.jpg)
LINEAR REGRESSION lm()
Regression analysis is a technique used for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent continuous variables (explanatory variables)
Assumptions
Independence: The Y-values and the error terms must be independent of
each other.
Linearity between Y and X.
Normality: The populations of Y-values and the error terms are normally
distributed for each level of the predictor variable x
Homogeneity of variance: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x.(don’t test for normality or heteroscedasticity, check the residuals instead!)
![Page 5: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/5.jpg)
AIMS
1.To describe the linear relationships between Y and Xi (EXPLANATORY APPROACH) and to quantify how much of the total variation in Y can be explained by the linear relationship with Xi.
2. To predict new values of Y from new values of Xi (PREDICTIVE APPROACH)
LINEAR REGRESSION lm()
Yi = α + βxi + εi Y
responseXi
predictors
We estimate one INTERCEPTand one or more SLOPES
![Page 6: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/6.jpg)
Simple LINEAR regression step by step:
SIMPLE LINEAR REGRESSION:
I step:
-Check linearity [visualization with plot()]
II step:
-Estimate the parameters (one slope and one intercept)
III step:
-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
![Page 7: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/7.jpg)
Normality
SIMPLE LINEAR REGRESSION:
Do not test the normality over the whole y
![Page 8: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/8.jpg)
MODEL
SIMPLE LINEAR REGRESSION
yi = α + βxi
α = ymean- β*xmean
β = Σ [(xi-xmean)(yi-ymean)]
Σ (xi-xmean)2
SLOPE
INTERCEPT
The model gives the fitted values
Residuals= observed yi- fitted yi
Observed value
RESIDUALS
The model does not explained everything
![Page 9: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/9.jpg)
SIMPLE LINEAR REGRESSION
Least square regression explanation
library(animation)###########################################
##Slope changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares()ani.stop()
############################################
# Intercept changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares(ani.type = "i")ani.stop()
![Page 10: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/10.jpg)
SIMPLE LINEAR REGRESSION
Hypothesis testing
Ho: β = 0 (There is no relation between X and Y)
H1: β ≠ 0
We must measure the unreliability associated with each of the
estimated parameters (i.e. we need the standard errors)
SE(β) = [(residual SS/(n-2))/Σ(xi - xmean)]2
t = (β – 0) / SE(β)
Parameter t testing
Parameter t testing (test the single parameter!)
![Page 11: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/11.jpg)
SIMPLE LINEAR REGRESSION
Measure of goodness-of-fit
Total SS = Σ(yobserved i- ymean)2
Model SS = Σ(yfitted i - ymean)2
Residual SS = Total SS - Model SS
R2 = Model SS /Total SS
Explained variation
It does not provide information
about the significance
If the model is significant (β ≠ 0)How much variation is explained?
![Page 12: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/12.jpg)
SIMPLE LINEAR REGRESSION: example 1
If the model is significant, then model checking
1. Linearity between X and Y?
ok
No patterns in the residuals vs. predictor plot
![Page 13: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/13.jpg)
2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals
> shapiro.test(residuals)
Shapiro-Wilk normality test
data: residuals
W = 0.9669, p-value = 0.2461
ok
ok
SIMPLE LINEAR REGRESSION: example 1
![Page 14: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/14.jpg)
3. Homoscedasticity
Call:lm(formula = abs(residuals) ~ yfitted)Coefficients: Estimate SE t P(Intercept) 2.17676 2.04315 1.065 0.293yfitted 0.11428 0.07636 1.497 0.142
SIMPLE LINEAR REGRESSION: example 1
ok
![Page 15: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/15.jpg)
no
NO LINEARITY between X and Y
SIMPLE LINEAR REGRESSION: example 2
no
yes
1. Linearity between X and Y?
![Page 16: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/16.jpg)
> shapiro.test(residuals)
Shapiro-Wilk normality test
data: residuals
W = 0.8994, p-value = 0.001199 no
no
SIMPLE LINEAR REGRESSION: example 2
2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals
![Page 17: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/17.jpg)
SIMPLE LINEAR REGRESSION: example 2
3. Homoscedasticity
NO YES
![Page 18: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/18.jpg)
Transformation of the data-Box-cox transformation (power transformation of the response)
-Square-root transformation
-Log transformation
-Arcsin transformation
How to deal with non-linearity and non-normality situations?
SIMPLE LINEAR REGRESSION: example 2
Polynomial regression
Regression with multiple terms (linear, quadratic, and cubic)
Y= a + b1X + b2X2 + b3X3 + error X is one variable!!!
![Page 19: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/19.jpg)
POLYNOMIAL REGRESSION: one X, n parameters
Hierarchy in the testing (always test the highest)!!!!
X + X2 + X3 X + X2 Xn.s.
Stop Stop
P<0.01 P<0.01
Stop
P<0.01
n.s.No relation
NB Do not delete lower terms even if non-significant
n.s.
![Page 20: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/20.jpg)
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression
Regression with two or more variables
Y= a + b1X1 + b2X2 +… + biXi
The Multiple Regression Model
There are important issues involved in carrying out a multiple regression:
• which explanatory variables to include (VARIABLE SELECTION);
• NON-LINEARITY in the response to the explanatory variables;
• INTERACTIONS between explanatory variables;
• correlation between explanatory variables (COLLINEARITY);
• RELATIVE IMPORTANCE of variables
Assumptions
Same assumptions as in the simple linear regression!!!
![Page 21: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/21.jpg)
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL
Regression with two or more variables
Y = a+ b1X1+ b2X2+…+ biXi
Each slope (bi) is a partial regression coefficient:
bi are the most important parameters of the multiple regression model.
They measure the expected change in the dependent variable
associated with a one unit change in an independent variable holding
the other independent variables constant. This interpretation of partial
regression coefficients is very important because independent
variables are often correlated with one another.
![Page 22: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/22.jpg)
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL EXPANDED
We can add polynomial terms and interactions
Y= a + linear terms + quadratic & cubic terms+ interactions
QUADRATIC AND CUBIC TERMS account for NON-LINEARITY
INTERACTIONS account for non-independent effects of the factors
![Page 23: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/23.jpg)
Multiple regression step by step:
MULTIPLE LINEAR REGRESSION:
I step:
-Check collinearity (visualization with pairs() and correlation)
-Check linearity
II step:
-Variable selection and model building (different procedures to select the significant variables)
III step:
-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
![Page 24: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/24.jpg)
Let’s begin with an example from air pollution studies. How is ozone
concentration related to wind speed, air temperature and the intensity
of solar radiation?
MULTIPLE LINEAR REGRESSION: I STEP
I STEP:
-Check collinearity
-Check linearity
![Page 25: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/25.jpg)
Model simplification
1. Remove non-significant interaction terms.
2. Remove non-significant quadratic or other non-linear terms.
3. Remove non-significant explanatory variables.
4. Amalgamate explanatory variables that have similar parameter values.
Start with a complex model with interactions and quadraticand cubic terms
Minimum Adequate Model
How to carry out a model simplification in multiple regression
MULTIPLE LINEAR REGRESSION: II STEP
II STEP: MODEL BUILDING
![Page 26: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/26.jpg)
Start with the most complicate model (it is one approach)
model1<lm( ozone ~ temp*wind*rad+I(rad2)+I(temp2+I(wind2))
Estimate
Std.Error t Pr(>t)
(Intercept) 5.7E+02 2.1E+02 2.74 0.01 **
temp -1.1E+01 4.3E+00 -2.50 0.01 *
wind -3.2E+01 1.2E+01 -2.76 0.01 **
rad -3.1E-01 5.6E-01 -0.56 0.58
I(rad^2) -3.6E-04 2.6E-04 -1.41 0.16
I(temp^2) 5.8E-02 2.4E-02 2.44 0.02 *
I(wind^2) 6.1E-01 1.5E-01 4.16 0.00 ***
temp:wind 2.4E-01 1.4E-01 1.74 0.09
temp:rad 8.4E-03 7.5E-03 1.12 0.27
wind:rad 2.1E-02 4.9E-02 0.42 0.68
temp:wind:rad -4.3E-04 6.6E-04 -0.66 0.51Delete only the highest interaction temp:wind:rad
!!!!!!We cannot delete these terms!!!!!!!
MULTIPLE LINEAR REGRESSION: II STEP
![Page 27: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/27.jpg)
At each deletion test:Is the fit of a
simpler model worse?
Manual model simplification(It is one of the many philosophies)Deletion the non-significant terms one by one:
Hierarchy in the deletion:1. Highest interactions2. Cubic terms3. Quadratic terms4. Linear terms
If you have quadratic and cubic terms significant you cannotdelete the linear or the quadratic term even if they are not significant
If you have an interaction significant you cannotdelete the linear terms even if they are not significant
COMPLEX
SIMPLE
Deletion
MULTIPLE LINEAR REGRESSION: II STEP
IMPORTANT!!!
![Page 28: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/28.jpg)
III STEP: we must check the assumptions
We can transform the data (e.g. Log-transformation of y)
model<lm( log(ozone) ~ temp + wind + rad + I(wind2))
MULTIPLE LINEAR REGRESSION: III STEP
NONO
Variance tends to increase with y Non-normal errors
![Page 29: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/29.jpg)
The log-transformation has improved our model but maybe there is an outlier
MULTIPLE LINEAR REGRESSION: more than one x
![Page 30: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/30.jpg)
PARTIAL REGRESSION:
With partial regression we can remove the effect of one or
more variables (covariates) and test a further factor which
becomes independent from the covariates
WHEN?• Would like to hold third variables constant, but cannot
manipulate.• Can use statistical control.
HOW?• Statistical control is based on residuals. If we regress Y
on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.
![Page 31: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/31.jpg)
PARTIAL REGRESSION: VARIATION PARTITIONING
Relative importance of groups of explanatory variables
Longitude (km)
EnvironmentSpace
Latitu
de (km
)
SiteFull.model<lm(species ~ environment i + space i)
R2= 76% (TOTAL EXPLAINED VARIATION)
What is space and what is environment?
Unexpl.
Total variation
Explained variation
Space∩
Envir.
Response variable: orthopteran species richness
Explanatory variable: SPACE (latitude + longitude) +
ENVIRONMENT (temperature + land-cover heterogeneity)
![Page 32: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/32.jpg)
VARIATION PARTITIONING: varpart(vegan)
Env.model<lm(SPECIES ~ temp + het)
Pure.Space.model<lm(ENV.RESIDUALS ~ lat + long)
env.residuals
Full.model<lm(SPECIES ~ temp + het + lat + long)
TVE=76%
VE=15%
Space.model<lm(SPECIES ~ lat + long)
Pure.env.model<lm(SPACE.RESIDUALS ~ tem + het)
space.residuals
VE=40%
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
![Page 33: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/33.jpg)
NON-LINEAR REGRESSION: nls()
Sometimes we have a mechanistic model for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data.
We must SPECIFY the exact nature of the function as part of the model formula when we use non-linear modelling
In place of lm() we write nls() (this stands for ‘non-linear least squares’). Then, instead of y~x+I(x2)+I(x3) (polynomial), we write the y~function to spell out the precise nonlinear model we want R to fit to the data.
![Page 34: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/34.jpg)
NON-LINEAR REGRESSION: step by step
3. Start fitting the different models
1. Plot y against x
2. Get an idea of the family of functions that you can fit
7. Compare PAIRS of models and choose the best
5. [Get the MAM for each by model simplification]
6. Check the residuals
Multimodel inference
(minimum deviance +
minimum number of parameters)
Compare GROUPS of model at a time
Alternative approach
AIC = scaled deviance +2k
Model weights and model average
[see Burnham & Anderson, 2002]
4. Specify initial guesses for the values of the parameters
k= parameter number + 1
![Page 35: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/35.jpg)
nls(): examples of function families
Asymptotic functions S-shaped functions
Humped functions Exponential functions
![Page 36: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/36.jpg)
nls(): Look at the data
Using the data plot work out sensible starting values. It always helps in cases like this to work out the equation’s at the limits – i.e. find the values of y when x=0 and when x=
0 10 20 30 40 50
04
08
01
20
age
bo
ne
Asymptotic functions
S-shaped functions
Humped functions
Exponential functions
?
Asymptotic exponential
Understand the role of the parameters a, b, and c
cxebay *~
![Page 37: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/37.jpg)
nls(): Look at the data
Can we try another function from the same family?
Fit the model cxebay *~
Model choice is always an important issue in curve fitting
(particularly for prediction)
0 10 20 30 40 50
02
04
06
08
01
20
age
bo
ne
2. Extract the fitted values (yi)
3. Check graphically the curve fitting
Different behavior at the limits!
Think about your biological system not just residual deviance!
1. Estimate of a, b, and c (iterative)
![Page 38: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/38.jpg)
nls(): Look at the data
Fit a second model
1. Extract the fitted values (yi)
2. Check graphically the curve fitting
bx
axy
1~
0 10 20 30 40 50
02
04
06
08
01
20
age
bo
ne
You can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis–Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months.
Michaelis–Menten
![Page 39: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/39.jpg)
Application of regression: prediction
Regression models for prediction
Spatial extent + data range
A model can be used to predict values of y in space or in time
knowing new xi values
0 20 40 60 80
02
04
06
08
01
20
age
bo
ne
NOYES
Before using a model for prediction it has to be VALIDATED!!!
2 APPROACHES
![Page 40: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/40.jpg)
VALIDATION
1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model).
2. If data scarce, must resort to “artificially produced” validation sets
Model fit PredictedR
eal y
Cross-validation
Bootstrap
Residual=Prediction error
![Page 41: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/41.jpg)
K-FOLD CROSS-VALIDATION
Split randomly the data in K groups with roughly the same size
Take turns using one group as test set and the other k-I as training set for fitting the model
1 2 53 4
Train TrainTrainTrain Test
Train TrainTrain Train Test
Train TrainTrainTrainTest
Train Train TrainTrain Test
Cross-validation estimate of prediction error is average of these
Train TrainTrainTrainTest 1. Prediction error1
2. Prediction error2
3. Prediction error3
4. Prediction error4
5. Prediction error5
![Page 42: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/42.jpg)
BOOTSTRAP
1. Generate a large number (n= 10 000) of bootstrap samples
3. The mean of these estimates is the bootstrap estimate of prediction error
n=10000n=10000
2. For each bootstrap sample, compute the prediction error
Error1 Error2Error3 Errorn…
……
![Page 43: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students](https://reader036.vdocuments.net/reader036/viewer/2022070411/5681479e550346895db4d674/html5/thumbnails/43.jpg)
If you can, use an independent data set for validating the model
1. Do not use your model for prediction without carrying out a validation
Application of regression: prediction
If you cannot, use at least bootstrap or cross-validation
2. Never extrapolate