1. common problems with regression. a. inferring causation ...frederic/13/f16/day15.pdfthat there is...
TRANSCRIPT
Stat 13, Intro. to Statistical Methods for the Life and Health Sciences.
1.Commonproblemswithregression.a.Inferringcausation.b.Extrapolation.c.Curvature.
2.Testingsignificanceofcorrelationorslope.
NoclassThuNov24,Thanksgiving.Readch9.Hw4is10.1.8,10.3.14,10.3.21,10.4.11andisdueTueNov29.
ThefinalFri Dec9,8am-11,right here,willbeonch1-10.BringaPENCILandCALCULATORandanybooksornotesyouwant.Nocomputers.http://www.stat.ucla.edu/~frederic/13/F16.
1
Commonproblemswithregression.
• a.Correlationisnotcausation.Especiallywithobservationaldata.
Commonproblemswithregression.
Commonproblemswithregression.Holmes and Willett (2004) reviewed all prospective studies on fat consumption and breast cancer with at least 200 cases of breast cancer. "Not one study reported a significant positive association with total fat intake.... Overall, no association was observed between intake of total, saturated, monounsaturated, or polyunsaturated fat and risk for breast cancer."
They also state "The dietary fat hypothesis is largely based on the observation that national per capita fat consumption is highly correlated with breast cancer mortality rates. However, per capita fat consumption is highly correlated with economic development. Also, low parity and late age at first birth, greater body fat, and lower levels of physical activity are more prevalent in Western countries, and would be expected to confound the association with dietary fat."
Commonproblemswithregression.• b.Extrapolation.
Commonproblemswithregression.• b.Extrapolation.• Oftenresearchersextrapolatefromhighdosestolow.
Commonproblemswithregression.• b.Extrapolation.Therelationshipcanbenonlinearthough.Researchersalsooftenextrapolatefromanimalstohumans.
Zaichkina etal.(2004)onhamsters
Commonproblemswithregression.• c.Curvature.Thebestfittinglinemightfitpoorly.Portetal.(2005).
Commonproblemswithregression.• c.Curvature.Thebestfittinglinemightfitpoorly.Wongetal.(2011).
Howwelldoesthelinefit?• 𝑟" isameasureoffit.Itindicatestheamountofscatteraroundthebestfittingline.• Residualplotscanindicatecurvature,outliers,orheteroskedasticity.
• 1 − 𝑟"� 𝑠(isusefulasameasureofhowfaroffpredictionswouldhavebeenonaverage.
Notethatregressionresidualshavemeanzero,whetherthelinefitswellorpoorly.
Commonproblemswithregression.• d.Statisticalsignificance.Couldtheobservedcorrelationjustbeduetochancealone?
InferencefortheRegressionSlope:Theory-BasedApproachSection10.5
Dostudentswhospendmoretimeinnon-academicactivitiestendtohavelowerGPAs?Example10.4
Dostudentswhospendmoretimeinnon-academicactivitiestendtohavelowerGPAs?
• Thesubjectswere34undergraduatestudentsfromtheUniversityofMinnesota.• Theywereaskedquestionsabouthowmuchtimetheyspentinactivitieslikework,watchingTV,exercising,non-academiccomputeruse,etc.aswellaswhattheircurrentGPAwas.• WearegoingtotesttoseeifthereisanegativeassociationbetweenthenumberofhoursperweekspentonnonacademicactivitiesandGPA.
Hypotheses• NullHypothesis:ThereisnoassociationbetweenthenumberofhoursstudentsspendonnonacademicactivitiesandstudentGPAinthepopulation.
• AlternativeHypothesis:ThereisanegativeassociationbetweenthenumberofhoursstudentsspendonnonacademicactivitiesandstudentGPAinthepopulation.
DescriptiveStatistics• GPAA = 3.60 − 0.0059(nonacademichours).• Whatdotheslopeandy-interceptmean?
ShuffletoDevelopNullDistribution
• Wearegoingtoshufflejustaswedidwithcorrelationtodevelopanulldistribution.• Theonlydifferenceisthatwewillbecalculatingtheslopeeachtimeandusingthatasourstatistic.• atestofassociationbasedonslopeisequivalenttoatestofassociationbasedonacorrelationcoefficient.
Betavs Rho• Testingtheslopeoftheregressionlineisequivalenttotestingthecorrelation (samep-value,butobviouslydifferentconfidenceintervalssincethestatisticsaredifferent)• Hencethesehypothesesareequivalent.• Ho:β =0Ha:β <0(Slope)• Ho:ρ =0Ha:ρ < 0(Correlation)
• Sampleslope(b)Population(β:beta)• Samplecorrelation(r)Population(ρ:rho)
• Whenwedothetheorybasedtest,wewillbeusingthet-statisticwhichcanbecalculatedfromeithertheslopeorcorrelation.
Introduction• Ournulldistributionsareagainbell-shapedandcenteredat0(foreithercorrelationorslopeasourstatistic).
The book on p549 finds a p value of 3.3% by simulation.
ValidityConditions• Undercertainconditions,theory-basedinferenceforcorrelationorslopeoftheregressionlineuset-distributions.• Wecouldusesimulationsorthetheory-basedmethodsfortheslopeoftheregressionline.• Wewouldgetthesamep-valueifweusedcorrelationasourstatistic.
PredictingHeartRatefromBodyTemperatureExample10.5A
HeartRateandBodyTemp• Earlierwelookedattherelationshipbetweenheartrateandbodytemperaturewith130healthyadults• PredictedHeartRate = −166.3 + 2.44 Temp• r=0.257
HeartRateandBodyTemp
• Wetestedtoseeifwehadconvincingevidencethatthereisapositiveassociationbetweenheartrateandbodytemperatureinthepopulationusingasimulation-basedapproach.(Wewillmakeit2-sidedthistime.)• NullHypothesis:Thereisnoassociationbetweenheartrateandbodytemperatureinthepopulation. β=0• AlternativeHypothesis:Thereisanassociationbetweenheartrateandbodytemperatureinthepopulation. β≠0
HeartRateandBodyTemp
Wegetaverysmallp-value(0.0036).Anythingasextremeasourobservedslopeof2.44happeningbychanceisveryrare
HeartRateandBodyTemp
• Wecanalsoapproximatea95%confidenceintervalobservedstatistic+ 2SDofstatistic2.44± 2(0.842)=0.76to4.12
• Whatdoesthismean?We’re95%confidentthat,inthepopulationofhealthyadults,each1° increaseinbodytempisassociatedwithanincreaseinheartrateofbetween0.76to4.12beatsperminute
HeartRateandBodyTemp
• Thetheory-basedapproachshouldworkwellsincethedistributionhasanicebellshape• Alsocheckthescatterplot
HeartRateandBodyTemp
• Wewillusethet-statistictogetourtheory-basedp-value.• Wewillfindatheory-basedconfidenceintervalfortheslope.• Onp554,thebooknotestheformula
• t= PQRSTURT
�.
• Herethetstatisticis2.97.• Thep-valueis.36%.Sothecorrelationisstatisticallysignificantlygreaterthanzero.
SmokingandDrinkingExample10.5B
ValidityConditionsRememberourvalidityconditionsfortheory-basedinferenceforslopeoftheregressionequation.
1. Thescatterplotshouldfollowalineartrend.2. Thereshouldbeapproximatelythesamenumber
ofpointsaboveandbelowtheregressionline(symmetry).
3. Thevariabilityofverticalslicesofthepointsshouldbesimilar.Thisiscalledhomoskedasticity.
ValidityConditions• Let’slookatsomescatterplotsthatdonotmeettherequirements.
SmokingandDrinkingTherelationshipbetweennumberofdrinksandcigarettesperweekforarandomsampleofstudentsatHopeCollege.
Thedotat(0,0)represents524students
Aretheconditionsmet?Hardtosay.Thebooksaysno.
SmokingandDrinking• Whentheconditionsarenotmet,applyingsimulation-basedinferenceispreferabletotheory-basedt-testsandCIs.
ValidityConditions
• Whatdoyoudowhenvalidityconditionsaren’tmetfortheory-basedinference?• Usethesimulated-basedapproach.
• Anotherstrategyisto“transform”thedataonadifferentscalesoconditionsaremet.• Thelogarithmicscaleiscommon.
• Onecanalsofitadifferentcurve,notnecessarilyaline.