this work is licensed under a creative commons attribution...
TRANSCRIPT
![Page 1: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/1.jpg)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2006, The Johns Hopkins University and Rafael A. Irizarry. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed.
![Page 2: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/2.jpg)
Prediction: Using statistics Prediction: Using statistics to put your money where to put your money where
your mouth isyour mouth is
Rafael A. Irizarry
![Page 3: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/3.jpg)
What distinguishes science What distinguishes science from religion?from religion?
We can predict!
![Page 4: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/4.jpg)
ExamplesExamples
Religion ScienceEarth is flat Earth is round
(Eratosthenes of Cyrene220 BC)
Sun orbits earth Earth orbits sun (Copernicus circa 1500)
Wine turns into blood
Tastes like wine (me 1984)
![Page 5: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/5.jpg)
Two approachesTwo approaches• Physical models
– A physicist can predict any two objects will hit ground at same time
– A chemist predicts Na + Cl tastes salty– An astronomer can predict the next eclipse– An M.D. can predict that if you eat cyanide you die– An engineer can predict a bridge won’t fall
• Stochastic models– I can predict that if you toss 10,000 coins you will
see between 45-55% heads– I can predict Vegas will make money– I can predict the Oakland A’s will win more games
than any other team with similar payroll
![Page 6: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/6.jpg)
MLB Wins versus PayrollMLB Wins versus Payroll
Oakland A’s ignore “experts” and use statistics instead
![Page 7: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/7.jpg)
MLB residuals versus PayrollMLB residuals versus Payroll
![Page 8: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/8.jpg)
Many Problems in ScienceMany Problems in Science
NatureX Y
Sometimes we want to understand nature
Sometime we don’t really care
We are always happy if we can predict Y
![Page 9: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/9.jpg)
Most common approachMost common approach• Use parametric statistical model
• Fit the model, interpret parameters, predict Y given X
Linear RegressionGLM
Cox ModelX Y
![Page 10: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/10.jpg)
ExampleExample
![Page 11: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/11.jpg)
ExampleExample
![Page 12: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/12.jpg)
ExampleExample
![Page 13: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/13.jpg)
RegressionRegression• Model: height and weight are normal and
correlated then regression line gives best predictor– E[ Y | X ] = Avg Y + (SD of Y / SD of X) (correlation) (X - Avg X)
• But this is only the case if model is correct
• Regression is now used in applications where its hard to tell if assumptions hold
![Page 14: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/14.jpg)
When are parametric models used?When are parametric models used?
• Used lots:– Behavioral Sciences, Psychology, Epidemiology,
Economics
• Not used much or at all:– Finance, fraud detection, zip code reading,
face/voice recognition
• Lack of proper assessments causes unwarranted optimism about models
![Page 15: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/15.jpg)
ExampleExample• Create a binary outcome and 6 covariates for
25 individuals• Make everything completely uncorrelated• Fit a regression models• A likely result*
– AIC chooses a model with 4 covariates– One covariate has p < 0.05– If we predict outcomes we get 80% right!
• Is this a good model? How would we know in real life?
*For one simulation. Code to reproduce is available upon request
![Page 16: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/16.jpg)
OverOver--fittingfitting• How can can we predict 80% when there is no
information in the covariates?
• Important fact: If we assess the fit of model on the same data we fitted, it will appear better than it really is.
• A fair assessment would happen on a new data set… which we rarely can get… but we can fake it!
![Page 17: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/17.jpg)
CrossCross--validationvalidation• Leave out 10% of data at random (test set)
• Fit model on the remaining 90% (train set)
• See how well our fit predicts on the test set
• Repeat above various times
• In our example our CV error is 50% !
![Page 18: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/18.jpg)
Example of overExample of over--fittingfittingHeight Gender Swede Age IQ Hair
colorWeight Region Blood
TypeAsian
5’8’’ M Yes 35 118 Blond 150 Midwest AB No
6’1 F No 28 120 Brown 110 Midwest O+ No
6’1 M No 29 118 Black 190 NE A Yes
• We want to predict height• Easily find a model that with perfect R2
• An example says, on average:– Women are 1 inch taller than men– Swedes are 5 inch shorter than non-Swedes
![Page 19: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/19.jpg)
My Random SampleMy Random Sample
![Page 20: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/20.jpg)
Hard ProblemsHard Problems• Many covariates… easy to over-fit
• We care mostly/only about predicting outcome
• Examples abound
• Statisticians have developed some of the best methods… despite few of us working on these problems
• CV is an essential tool
![Page 21: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/21.jpg)
Algorithmic ApproachAlgorithmic Approach
UnknownX YCART
Random ForestsLogic Regression
Linear Discriminant Analysis
![Page 22: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/22.jpg)
CART: Olive ExampleCART: Olive Example
![Page 23: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/23.jpg)
CART: Olive ExampleCART: Olive Example
![Page 24: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/24.jpg)
CART: Olive ExampleCART: Olive Example
But how do we decide what branches to use/keep? Pick best predictor!
![Page 25: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/25.jpg)
ConclusionConclusion• When interpreting a p-value consider the
alternative hypothesis: “My model is wrong”
• If you can, use cross-validation to assess models
• There are many methods designed specifically for prediction
![Page 26: This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatisticsLectureSeries05/PDFs/Irizarry.pdf · Two approaches • Physical models – A physicist](https://reader034.vdocuments.net/reader034/viewer/2022042412/5f2c6b072cbe6f71ec766978/html5/thumbnails/26.jpg)
Some harsh quotesSome harsh quotesThe statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.
Leo Breiman
The whole area of guided regression is fraught with intellectual, statistical, computational, and subject matter difficulties
Mosteller and Tukey