8. goodness of fit in regression. - ucla statisticsfrederic/10/sum19/day02.pdf · 2019. 8. 7. ·...
TRANSCRIPT
1.Two-waytables.2.Histograms.3.mean,median,IQR,zscore.4.skew.5.Boxplots.6.Scatterplotsandcorrelation.7.Regression.8.Goodnessoffitinregression.9.Commonregressionpitfalls.
Simpledatasummaries• Forcategoricaldata,two-waytablescanbeuseful.
• Forquantitativedata,histogramsareuseful.
• Forarelativefrequencyhistogram,thepercentageofpeopleinthebinisshownratherthanthewholenumber.
• Here,n=25.0.2=20%ofpeopleinthesamplehad3quarts.Thenumberofpeoplewith3quartswas0.2x25=5.• Thesizesofthebinscanbeadjustedandthelookofthehistogramcanbeinfluencedbythebinsizes.
• Withhistograms,lookforsymmetry,skew,bimodality,andoutliers.
• Therange=maximumobservedvalue– minimum.• Forroughlysymmetricdata,themeanandsd aregoodsummariesofthecenterandspread.• Whenthedataareskewedorthereareseriousoutliers,themedianandtheIQRcanbepreferable.
3.mean,median,IQR,zscore.• Themedianisthemiddleinthesortedlistofvalues.ItisavalueMwhere50%oftheobservationsare≤M.Differentsoftwareusedifferentconventions,butwewillusetheconventionthat,ifthereisarangeofpossiblemedians,youtakethemiddleofthatrange.
• Forexample,supposedataare1,3,7,7,8,9,12,14.• M=7.5.• Suppose25%oftheobservationsliebelowacertainvaluex.Thenxiscalledthelowerquartile (or25th percentile).
• Similarly,if25%oftheobservationsaregreaterthanx,thenxiscalledtheupperquartile (or75th percentile).
• ThelowerquartilecanbecalculatedbyfindingthemedianM,andthendeterminingthemedianofthevaluesbelowM.SimilarlytheupperquartileisthemedianofthevaluesgreaterthanM.
IQRandFive-NumberSummary• Thedifferencebetweenthequartilesiscalledtheinter-quartilerange (IQR),anothermeasureofvariabilityalongwithstandarddeviation.• Thefive-numbersummary forthedistributionofaquantitativevariableconsistsoftheminimum,lowerquartile,median,upperquartile,andmaximum.• TechnicallytheIQRisnottheinterval(25thpercentile,75thpercentile),butthedifference75th percentile– 25th .• Supposedataare1,3,7,7,8,9,12,14.• M=7.5,25th percentile=5,75th percentile=10.5.IQR=5.5.
zscore.• Manydatasetsareroughlysymmetricandthehistogramsomewhatresemblesthenormalcurve.
• Forsuchdata,about2/3ofobservationsarewithin1SDofthemean,andabout95%arewithin2SDsofthemean.• Itcanbeusefultoconvertvaluestozscores.Thissimplymeanstakingavaluexandstandardizingitbysubtractingthesamplemeanandthendividingbys.• IQ.mean=100,s=15.Ifx=125,thenz=(125-100)/15=1.33.• About95%ofz-scoresarebetween-2and2,fornormaldata.
• 4.Skew.• Themeanandsd areveryinfluencedbyoutliersandskew.However,themedianandIQRaremuchmoreresistanttooutliersandskew.InmanycasesthemedianandIQRwillnotchangeatallbytheadditionofoneortwohugeoutliers.
• Forrightskeweddata,mean>median.Forleftskeweddata,mean<median.
GeyserEruptionsExample6.1
OldFaithfulInter-EruptionTimes
• Howdothefive-numbersummaryandIQRdifferforinter-eruptiontimesbetween1978and2003?
OldFaithfulInter-EruptionTimes
• 1978IQR=81– 58=23• 2003IQR=98– 87=11
5.Boxplots.
Min Qlower Med Qupper Max
outliersonboxplots.• Adatavaluethatismorethan1.5× IQRabovetheupperquartileorbelowthelowerquartileisconsideredanoutlier.
• Whentheseoccur,thewhiskersonaboxplotextendouttothefarthestvaluenotconsideredanoutlierandoutliersarerepresentedbyadotoranasterisk.
CancerPamphletReadingLevels
• Shortetal.(1995)comparedreadinglevelsofcancelpatientsandreadabilitylevelsofcancerpamphlets.Whatisthe:• Medianreadinglevel?• Meanreadinglevel?
• Arethedataskewedonewayortheother?
• Skewedabittotheright• Meantotherightofmedian
6.ScatterplotsandCorrelation
Time 30 41 41 43 47 48 51 54 54 56 56 56 57 58
Score 100 84 94 90 88 99 85 84 94 100 65 64 65 89
Time 58 60 61 61 62 63 64 66 66 69 72 78 79
Score 83 85 86 92 74 73 75 53 91 85 62 68 72
Supposewecollecteddataontherelationshipbetweenthetimeittakesastudenttotakeatestandtheresultingscore.
Scatterplot
Putexplanatoryvariableonthehorizontalaxis.
Putresponsevariableontheverticalaxis.
DescribingScatterplots•Whenwedescribedatainascatterplot,wedescribethe• Direction(positiveornegative)• Form(linearornot)• Strength(strong-moderate-weak,wewillletcorrelationhelpusdecide)• UnusualObservations
Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitative variables.• Correlationisanumberbetween-1and1.• Withpositivecorrelationonevariableincreases,onaverage,astheotherincreases.• Withnegativecorrelationonevariabledecreases,onaverage,astheotherincreases.• Thecloseritistoeither-1or1thecloserthepointsfittoaline.• Thecorrelationforthetestdatais-0.56.
CorrelationGuidelinesCorrelationValue Strengthof
AssociationWhatthismeans
0.7to1.0 Strong Thepointswillappeartobenearlyastraightline
0.3to0.7 Moderate Whenlookingatthegraphtheincreasing/decreasingpatternwillbeclear,but thereisconsiderablescatter.
0.1to0.3 Weak Withsomeeffortyouwillbeabletoseeaslightlyincreasing/decreasingpattern
0to0.1 None Nodiscernibleincreasing/decreasingpattern
Same StrengthResultswithNegativeCorrelations
BacktothetestdataActuallythelastthreepeopletofinishthetesthadscoresof93,93,and97.
Whatdoesthisdotothecorrelation?
InfluentialObservations• Thecorrelationchangedfrom-0.56(afairlymoderatenegativecorrelation)to-0.12(aweaknegativecorrelation).• Pointsthatarefartotheleftorrightandnotintheoveralldirectionofthescatterplotcangreatlychangethecorrelation.(influentialobservations)
Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitativevariables.• -1< r< 1• Correlationmakesnodistinctionbetweenexplanatoryandresponsevariables.• Correlationhasnounits.Anylinearchangeinunitsofxory,meaningaddingandmultiplyingeveryobservationbysomenumber,doesnotinfluencer.• Correlationisnotresistanttooutliers.Itissensitive.
Calculatingcorrelation,r.
ρ =rho=correlationofthepopulation.SupposethereareNpeopleinthepopulation,X=temperature,Y=heartrate,themeanandsd oftempinthepop.areµ" and𝜎" ,andthepop.meanandsd ofheartrateareµ$and𝜎$.
ρ =*+∑ "-./0
10+23*
$-./414
.
Givenasampleofsizen,weestimateρ using
r= *5.*
∑"-.060
523*
$-.464
.
24
TemperatureandHeartRate
Tmp 98.3 98.2 98.7 98.5 97.0 98.8 98.5 98.7 99.3 97.8HR 72 69 72 71 80 81 68 82 68 65Tmp 98.2 99.9 98.6 98.6 97.8 98.4 98.7 97.4 96.7 98.0HR 71 79 86 82 58 84 73 57 62 89
TemperatureandHeartRate
r=0.378
HeartRateandBodyTemp• Anothergroupstudiedtherelationshipbetweenheartrateandbodytemperaturewith130healthyadults• PredictedHeartRate = −166.3 + 2.44 Temp• r=0.257
• Fittingalinetoascatterplot.• Unlessthepointsareperfectlylinearlyalligned,therewillnotbeasinglelinethatgoesthrougheverypoint.• Wewantalinethatgetsascloseaspossibletoallthepoints.
7.Regression.
Regression.• Wewantalinethatminimizestheverticaldistancesbetweenthelineandthepoints• Thesedistancesarecalledresiduals.• Thelinewewillfindactuallyminimizesthesumofthesquaresoftheresiduals.• Thisiscalledaleast-squaresregressionline.
AreDinnerPlatesGettingLarger?
GrowingPlates?• TherearemanyrecentarticlesandTVreportsabouttheobesityproblem.• Onereasonsomehavegivenisthatthesizeofdinnerplatesareincreasing.• Aretheseblackcirclesthesamesize,orisonelargerthantheother?
GrowingPlates?• Theyappeartobethesamesizeformany,buttheoneontherightisabout20%largerthantheleft.
• Thissuggeststhatpeoplewillputmorefoodonlargerdinnerplateswithoutknowingit.
• Thereisnameforthisphenomenon:Delboeufillusion
GrowingPlates?• Researchersgathereddatatoinvestigatetheclaimthatdinnerplatesaregrowing• Americandinnerplatessoldonebay onMarch30,2010(VanIttersum andWansink,2011)• Yearmanufacturedanddiameteraregiven.
GrowingPlates?• Bothyear(explanatoryvariable)anddiameterininches(responsevariable)arequantitative.• Eachdotrepresentsoneplateinthisscatterplot.
GrowingPlates?• Theassociationappearstoberoughlylinear• Theleastsquaresregressionlineisadded.
RegressionLineTheregressionequationis𝑦K = 𝑎 + 𝑏𝑥:• a isthey-intercept• b istheslope• x isavalueoftheexplanatoryvariable• ŷ isthepredictedvaluefortheresponsevariable
• Foraspecificvalueofx,thecorrespondingdistancey − 𝑦K (oractual– predicted)isaresidual
RegressionLine• Theleastsquareslineforthedinnerplatedatais𝑦K = −14.8 + 0.0128𝑥• OrdiameterR = −14.8 + 0.0128(year)• Thisallowsustopredictplatediameterforaparticularyear.
Slope𝑦K = −14.8 + 0.0128𝑥
• Whatisthepredicteddiameterforaplatemanufacturedin2000?• -14.8+0.0128(2000)=10.8in.
• Whatisthepredicteddiameterforaplatemanufacturedin2001?• -14.8+0.0128(2001)=10.8128in.
• Howdoesthiscomparetoourpredictionfortheyear2000?• 0.0128larger
• Slopeb =0.0128meansthatusingtheregressionline,diametersarepredictedtoincreaseby0.0128inchesperyearonaverage
Slope• Slopeisthepredictedchangeintheresponsevariableforone-unitchangeintheexplanatoryvariable.• Boththeslopeandthecorrelationcoefficientforthisstudywerepositive.• Theslopeis0.0128• Thecorrelationis0.604
• Theslopeandcorrelationcoefficientwillalwayshavethesamesign.
y-intercept• They-interceptiswheretheregressionlinecrossesthey-axisorthepredictedresponsewhentheexplanatoryvariableequals0.• Wehaday-interceptof-14.8inthedinnerplateequation.Whatdoesthistellusaboutourdinnerplateexample?• Dinnerplatesinyear0were-14.8inches.
• Howcanitbenegative?• Theequationworkswellwithintherangeofvaluesgivenfortheexplanatoryvariable,butfailsoutsidethatrange.
• Ourequationshouldonlybeusedtopredictthesizeofdinnerplatesfromabout1950to2010.
Extrapolation• Predictingvaluesfortheresponsevariableforvaluesoftheexplanatoryvariablethatareoutsideoftherangeoftheoriginaldataiscalledextrapolation.
r2
• Whiletheinterceptandslopehavemeaninginthecontextofyearanddiameter,rememberthatthecorrelationdoesnothaveunits.Itisjust0.604.• Thesquareofthecorrelationr2 hasmeaningintermsofvariation.• r2 =0.6042=0.365or36.5%• 36.5%ofthevariationinplatesize(theresponsevariable)canbeexplainedbyitslinearassociationwiththeyear(theexplanatoryvariable).
Slopeofregressionline.• Suppose𝑦K =a+bx istheregressionline.
• Theslopeboftheregressionlineisb=r6460 .
Thisisusuallythethingofprimaryinteresttointerpret,asthepredictedincreaseinyforeveryunitincreaseinx.• Bewareofassumingcausationthough,esp.withobservationalstudies.Bewaryofextrapolationtoo.
• Theintercepta=$ - b " .
• TheSDoftheresidualsis 1 − 𝑟V� 𝑠$.Thisisagoodestimateofhowmuchtheregression
predictionswilltypicallybeoffby.
8.Howwelldoesthelinefit?• 𝑟V isameasureoffit.Itindicatestheamountofscatteraroundthebestfittingline.• Residualplotscanindicatecurvature,outliers,orheteroskedasticity.
• 1 − 𝑟V� 𝑠$isaveryusefulsummary. Itisameasure• ofhowfaroffpredictionswouldhavebeenonaverage.
-2 -1 0 1 2 3 4 5
-4-2
02
46
8
x
y
-2 -1 0 1 2 3 4 5-10
-8-6
-4-2
02
x
residual
• Heteroskedasticity:whenthevariabilityinyisnotconstantasxvaries.
Howwelldoesthelinefit?• 𝑟V isameasureoffit.Itindicatestheamountofscatteraroundthebestfittingline.• Residualplotscanindicatecurvature,outliers,orheteroskedasticity.
• 1 − 𝑟V� 𝑠$isusefulasameasureofhowfaroffpredictionswouldhavebeenonaverage.
Notethatregressionresidualshavemeanzero,whetherthelinefitswellorpoorly.
9.commonproblemswithregression.
• a.Correlationisnotcausation.Especiallywithobservationaldata.
Commonproblemswithregression.
Commonproblemswithregression.Holmes and Willett (2004) reviewed all prospective studies on fat consumption and breast cancer with at least 200 cases of breast cancer. "Not one study reported a significant positive association with total fat intake.... Overall, no association was observed between intake of total, saturated, monounsaturated, or polyunsaturated fat and risk for breast cancer."
They also state "The dietary fat hypothesis is largely based on the observation that national per capita fat consumption is highly correlated with breast cancer mortality rates. However, per capita fat consumption is highly correlated with economic development. Also, low parity and late age at first birth, greater body fat, and lower levels of physical activity are more prevalent in Western countries, and would be expected to confound the association with dietary fat."
Commonproblemswithregression.• b.Extrapolation.
Commonproblemswithregression.• b.Extrapolation.• Oftenresearchersextrapolatefromhighdosestolow.
Commonproblemswithregression.• b.Extrapolation.Therelationshipcanbenonlinearthough.Researchersalsooftenextrapolatefromanimalstohumans.
Zaichkina etal.(2004)onhamsters
Commonproblemswithregression.• c.Curvature.Thebestfittinglinemightfitpoorly.Portetal.(2005).
Commonproblemswithregression.• c.Curvature.Thebestfittinglinemightfitpoorly.Wongetal.(2011).
Commonproblemswithregression.• d.Statisticalsignificance.Couldtheobservedcorrelationjustbeduetochancealone?