scientific method_ statistical errors _ nature news & comment

12
Print NATURE | NEWS FEATURE Scientific method: Statistical errors P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. 12 February 2014 For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white. The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either leftwing or rightwing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The P value, a common index for the strength of evidence, was 0.01 — usually interpreted as 'very significant'. Publication in a highimpact journal seemed within Motyl's grasp. But then reality intervened. Sensitive to controversies over reproducibility, Motyl and his adviser, Brian Nosek, decided to replicate the study. With extra data, the P value came out as 0.59 — not even close to the conventional level of significance, 0.05. The effect had disappeared, and with it, Motyl's dreams of youthful fame 1 . It turned out that the problem was not in the data or in Motyl's analyses. It lay in the surprisingly slippery nature of the P value, which is neither as reliable nor as objective as most scientists assume. “P values are not doing their job, because they can't,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statistics are used. For many scientists, this is especially worrying in light of the reproducibility concerns. In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false 2 ; since then, a string of highprofile replication problems has forced scientists to rethink how they evaluate results. At the same time, statisticians are looking for better ways of thinking about data, to help scientists to avoid missing important information Regina Nuzzo DALE EDWIN MURRAY

Upload: lester-the-nightfly

Post on 18-Dec-2015

232 views

Category:

Documents


1 download

DESCRIPTION

Comprehensive review of statistical errors

TRANSCRIPT

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 1/12

    Print

    NATURE | NEWSFEATURE

    Scientificmethod:StatisticalerrorsPvalues,the'goldstandard'ofstatisticalvalidity,arenotasreliableasmanyscientistsassume.

    12February2014

    Forabriefmomentin2010,MattMotylwasonthebrinkofscientificglory:hehaddiscoveredthatextremistsquiteliterallyseetheworldinblackandwhite.

    Theresultswereplainasday,recallsMotyl,apsychologyPhDstudentattheUniversityofVirginiainCharlottesville.Datafromastudyofnearly2,000peopleseemedtoshowthatpoliticalmoderatessawshadesofgreymoreaccuratelythandideitherleftwingorrightwingextremists.Thehypothesiswassexy,hesays,andthedataprovidedclearsupport.ThePvalue,acommonindexforthestrengthofevidence,was0.01usuallyinterpretedas'verysignificant'.PublicationinahighimpactjournalseemedwithinMotyl'sgrasp.

    Butthenrealityintervened.Sensitivetocontroversiesoverreproducibility,Motylandhisadviser,BrianNosek,decidedtoreplicatethestudy.Withextradata,thePvaluecameoutas0.59notevenclosetotheconventionallevelofsignificance,0.05.Theeffecthaddisappeared,andwithit,Motyl'sdreamsofyouthfulfame1.

    ItturnedoutthattheproblemwasnotinthedataorinMotyl'sanalyses.ItlayinthesurprisinglyslipperynatureofthePvalue,whichisneitherasreliablenorasobjectiveasmostscientistsassume.Pvaluesarenotdoingtheirjob,becausetheycan't,saysStephenZiliak,aneconomistatRooseveltUniversityinChicago,Illinois,andafrequentcriticofthewaystatisticsareused.

    Formanyscientists,thisisespeciallyworryinginlightofthereproducibilityconcerns.In2005,epidemiologistJohnIoannidisofStanfordUniversityinCaliforniasuggestedthatmostpublishedfindingsarefalse2sincethen,astringofhighprofilereplicationproblemshasforcedscientiststorethinkhowtheyevaluateresults.

    Atthesametime,statisticiansarelookingforbetterwaysofthinkingaboutdata,tohelpscientiststoavoidmissingimportantinformation

    ReginaNuzzo

    DALEEDWINMURRAY

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 2/12

    oractingonfalsealarms.Changeyourstatisticalphilosophyandallofasuddendifferentthingsbecomeimportant,saysStevenGoodman,aphysicianandstatisticianatStanford.Then'laws'handeddownfromGodarenolongerhandeddownfromGod.They'reactuallyhandeddowntousbyourselves,throughthemethodologyweadopt.

    OutofcontextPvalueshavealwayshadcritics.Intheiralmostninedecadesofexistence,theyhavebeenlikenedtomosquitoes(annoyingandimpossibletoswataway),theemperor'snewclothes(fraughtwithobviousproblemsthateveryoneignores)andthetoolofasterileintellectualrakewhoravishessciencebutleavesitwithnoprogeny3.Oneresearchersuggestedrechristeningthemethodologystatisticalhypothesisinferencetesting3,presumablyfortheacronymitwouldyield.

    TheironyisthatwhenUKstatisticianRonaldFisherintroducedthePvalueinthe1920s,hedidnotmeanittobeadefinitivetest.Heintendeditsimplyasaninformalwaytojudgewhetherevidencewassignificantintheoldfashionedsense:worthyofasecondlook.Theideawastorunanexperiment,thenseeiftheresultswereconsistentwithwhatrandomchancemightproduce.Researcherswouldfirstsetupa'nullhypothesis'thattheywantedtodisprove,suchastherebeingnocorrelationornodifferencebetweentwogroups.Next,theywouldplaythedevil'sadvocateand,assumingthatthisnullhypothesiswasinfacttrue,calculatethechancesofgettingresultsatleastasextremeaswhatwasactuallyobserved.ThisprobabilitywasthePvalue.Thesmalleritwas,suggestedFisher,thegreaterthelikelihoodthatthestrawmannullhypothesiswasfalse.

    ForallthePvalue'sapparentprecision,Fisherintendedittobejustonepartofafluid,nonnumericalprocessthatblendeddataandbackgroundknowledgetoleadtoscientificconclusions.Butitsoongotsweptintoamovementtomakeevidencebaseddecisionmakingasrigorousandobjectiveaspossible.Thismovementwasspearheadedinthelate1920sbyFisher'sbitterrivals,PolishmathematicianJerzyNeymanandUKstatisticianEgonPearson,whointroducedanalternativeframeworkfordataanalysisthatincludedstatisticalpower,falsepositives,falsenegativesandmanyotherconceptsnowfamiliarfromintroductorystatisticsclasses.TheypointedlyleftoutthePvalue.

    ButwhiletherivalsfeudedNeymancalledsomeofFisher'sworkmathematicallyworsethanuselessFishercalledNeyman'sapproachchildishandhorrifying[for]intellectualfreedominthewestotherresearcherslostpatienceandbegantowritestatisticsmanualsforworkingscientists.Andbecausemanyoftheauthorswerenonstatisticianswithoutathoroughunderstandingofeitherapproach,theycreatedahybridsystemthatcrammedFisher'seasytocalculatePvalueintoNeymanandPearson'sreassuringlyrigorousrulebasedsystem.ThisiswhenaPvalueof0.05becameenshrinedas'statisticallysignificant',forexample.ThePvaluewasnevermeanttobeusedthewayit'susedtoday,saysGoodman.

    Whatdoesitallmean?OneresultisanabundanceofconfusionaboutwhatthePvaluemeans4.ConsiderMotyl'sstudyaboutpoliticalextremists.MostscientistswouldlookathisoriginalPvalueof0.01andsaythattherewasjusta1%chanceofhisresultbeingafalsealarm.Buttheywouldbewrong.ThePvaluecannotsaythis:allitcandoissummarizethedataassumingaspecificnullhypothesis.Itcannotworkbackwardsandmakestatementsabouttheunderlyingreality.Thatrequiresanotherpieceofinformation:theoddsthatarealeffectwas

    R.NUZZOSOURCE:T.SELLKEETAL.AM.STAT.55,6271(2001)

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 3/12

    ThePvaluewasnevermeanttobeusedthewayit'susedtoday.

    thereinthefirstplace.Toignorethiswouldbelikewakingupwithaheadacheandconcludingthatyouhaveararebraintumourpossible,butsounlikelythatitrequiresalotmoreevidencetosupersedeaneverydayexplanationsuchasanallergicreaction.Themoreimplausiblethehypothesistelepathy,aliens,homeopathythegreaterthechancethatanexcitingfindingisafalsealarm,nomatterwhatthePvalueis.

    Thesearestickyconcepts,butsomestatisticianshavetriedtoprovidegeneralruleofthumbconversions(see'Probablecause').Accordingtoonewidelyusedcalculation5,aPvalueof0.01correspondstoafalsealarmprobabilityofatleast11%,dependingontheunderlyingprobabilitythatthereisatrueeffectaPvalueof0.05raisesthatchancetoatleast29%.SoMotyl'sfindinghadagreaterthanoneintenchanceofbeingafalsealarm.Likewise,theprobabilityofreplicatinghisoriginalresultwasnot99%,asmostwouldassume,butsomethingcloserto73%oronly50%,ifhewantedanother'verysignificant'result6,7.Inotherwords,hisinabilitytoreplicatetheresultwasaboutassurprisingasifhehadcalledheadsonacointossandithadcomeuptails.

    CriticsalsobemoanthewaythatPvaluescanencouragemuddledthinking.Aprimeexampleistheirtendencytodeflectattentionfromtheactualsizeofaneffect.Lastyear,forexample,astudyofmorethan19,000peopleshowed8thatthosewhomeettheirspousesonlinearelesslikelytodivorce(p

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 4/12

    Relatedstories

    Numbercrunch

    Policy:NIHplanstoenhancereproducibility

    Weakstatisticalstandardsimplicatedinscientificirreproducibility

    Morerelatedstories

    Article

    Article PubMed

    Article

    we'renowseeing.Wejustdon'tyethaveallthefixes.

    Statisticianshavepointedtoanumberofmeasuresthatmighthelp.Toavoidthetrapofthinkingaboutresultsassignificantornotsignificant,forexample,Cummingthinksthatresearchersshouldalwaysreporteffectsizesandconfidenceintervals.TheseconveywhataPvaluedoesnot:themagnitudeandrelativeimportanceofaneffect.

    ManystatisticiansalsoadvocatereplacingthePvaluewithmethodsthattakeadvantageofBayes'rule:aneighteenthcenturytheoremthatdescribeshowtothinkaboutprobabilityastheplausibilityofanoutcome,ratherthanasthepotentialfrequencyofthatoutcome.Thisentailsacertainsubjectivitysomethingthatthestatisticalpioneersweretryingtoavoid.ButtheBayesianframeworkmakesitcomparativelyeasyforobserverstoincorporatewhattheyknowabouttheworldintotheirconclusions,andtocalculatehowprobabilitieschangeasnewevidencearises.

    Othersargueforamoreecumenicalapproach,encouragingresearcherstotrymultiplemethodsonthesamedataset.StephenSenn,astatisticianattheCentreforPublicHealthResearchinLuxembourgCity,likensthistousingafloorcleaningrobotthatcannotfinditsownwayoutofacorner:anydataanalysismethodwilleventuallyhitawall,andsomecommonsensewillbeneededtogettheprocessmovingagain.Ifthevariousmethodscomeupwithdifferentanswers,hesays,that'sasuggestiontobemorecreativeandtrytofindoutwhy,whichshouldleadtoabetterunderstandingoftheunderlyingreality.

    Simonsohnarguesthatoneofthestrongestprotectionsforscientistsistoadmiteverything.Heencouragesauthorstobrandtheirpapers'Pcertified,notPhacked'byincludingthewords:Wereporthowwedeterminedoursamplesize,alldataexclusions(ifany),allmanipulationsandallmeasuresinthestudy.Thisdisclosurewill,hehopes,discouragePhacking,oratleastalertreaderstoanyshenanigansandallowthemtojudgeaccordingly.

    Arelatedideathatisgarneringattentionistwostageanalysis,or'preregisteredreplication',sayspoliticalscientistandstatisticianAndrewGelmanofColumbiaUniversityinNewYorkCity.Inthisapproach,exploratoryandconfirmatoryanalysesareapproacheddifferentlyandclearlylabelled.Insteadofdoingfourseparatesmallstudiesandreportingtheresultsinonepaper,forinstance,researcherswouldfirstdotwosmallexploratorystudiesandgatherpotentiallyinterestingfindingswithoutworryingtoomuchaboutfalsealarms.Then,onthebasisoftheseresults,theauthorswoulddecideexactlyhowtheyplannedtoconfirmthefindings,andwouldpubliclypreregistertheirintentionsinadatabasesuchastheOpenScienceFramework(https://osf.io).Theywouldthenconductthereplicationstudiesandpublishtheresultsalongsidethoseoftheexploratorystudies.Thisapproachallowsforfreedomandflexibilityinanalyses,saysGelman,whileprovidingenoughrigourtoreducethenumberoffalsealarmsbeingpublished.

    Morebroadly,researchersneedtorealizethelimitsofconventionalstatistics,Goodmansays.Theyshouldinsteadbringintotheiranalysiselementsofscientificjudgementabouttheplausibilityofahypothesisandstudylimitationsthatarenormallybanishedtothediscussionsection:resultsofidenticalorsimilarexperiments,proposedmechanisms,clinicalknowledgeandsoon.StatisticianRichardRoyallofJohnsHopkinsBloombergSchoolofPublicHealthinBaltimore,Maryland,saidthattherearethreequestionsascientistmightwanttoaskafterastudy:'Whatistheevidence?''WhatshouldIbelieve?'and'WhatshouldIdo?'Onemethodcannotanswerallthesequestions,Goodmansays:Thenumbersarewherethescientificdiscussionshouldstart,notend.

    Nature 506, 150152 (13February2014) doi:10.1038/506150a

    SeeEditorialpage131

    References

    1. Nosek,B.A.,Spies,J.R.&Motyl,M.Perspect.Psychol.Sci.7,615631(2012).Showcontext

    2. Ioannidis,J.P.A.PLoSMed.2,e124(2005).Showcontext

    3. Lambdin,C.TheoryPsychol.22,6790(2012).Showcontext

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 5/12

    Article ISI ChemPort

    Article PubMed ISI ChemPort

    Article PubMed ISI ChemPort

    Article PubMed ISI

    Article PubMed ChemPort

    Article PubMed

    Article

    4. Goodman,S.N.Ann.InternalMed.130,9951004(1999).Showcontext

    5. Goodman,S.N.Epidemiology12,295297(2001).Showcontext

    6. Goodman,S.N.Stat.Med.11,875879(1992).Showcontext

    7. Gorroochurn,P.,Hodge,S.E.,Heiman,G.A.,Durner,M.&Greenberg,D.A.Genet.Med.9,325321(2007).Showcontext

    8. Cacioppo,J.T.,Cacioppo,S.,Gonzagab,G.C.,Ogburn,E.L.&VanderWeele,T.J.Proc.NatlAcad.Sci.USA110,1013510140(2013).

    Showcontext

    9. Simmons,J.P.,Nelson,L.D.&Simonsohn,U.Psychol.Sci.22,13591366(2011).Showcontext

    10. Simonsohn,U.,Nelson,L.D.&Simmons,J.P.J.Exp.Psychol.http://dx.doi.org/10.1037/a0033242(2013).Showcontext

    11. Campbell,J.P.J.Appl.Psych.67,691700(1982).Showcontext

    Relatedstoriesandlinks

    Fromnature.comNumbercrunch12February2014Policy:NIHplanstoenhancereproducibility27January2014Weakstatisticalstandardsimplicatedinscientificirreproducibility11November2013Uncertaintyontrial02October2013Mattersofsignificance29August2013Announcement:Reducingourirreproducibility24April2013Replicationstudies:Badcopy16May2012Blogpost:Let'sgivestatisticstheattentionitdeservesinbiologicalresearchBlogpost:StatisticsisthesexyinscienceNaturespecial:Challengesinirreproducibleresearch

    FromelsewherePsychologicalSciencetutorialonalternativestothePvalueTheBUGS(BayesianinferenceUsingGibbsSampling)ProjectBayesianCognitiveModelling:APracticalCourse

    Authorinformation

    AffiliationsReginaNuzzoisafreelancewriterandanassociateprofessorofstatisticsatGallaudetUniversityinWashingtonDC.

    Forthebestcommentingexperience,pleaseloginorregisterasauserandagreetoourCommunityGuidelines.Youwillberedirected

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 6/12

    37comments Subscribetocomments

    backtothispagewhereyouwillseecommentsupdatinginrealtimeandhavetheabilitytorecommendcommentstootherusers.

    Commentsforthisthreadarenowclosed.

    Guest 2014030607:31PMCananyonehelpmeunderstandthe"probablecause"pictureofthepaper?IadmitthatIamlost.1.Whatisthemeaningof"oddsofhypothesis"?AhypothesiscanbeRight,orwrong.Whatisoddsofitmean?Ifweknowtheodds,dowestillneedtoknowpValue?2.HowcanIgetthenumberinthepicture:with1to19oddsofhypothesis+pValue=0.05>oddsbecome11%vs.89%.Thanks.

    CharlesGreen 2014022104:21PMInstatisticsPvaluesalthoughcalledconfidencevaluesarenotmeasuresofaccuracy.Theyareastatementofwhatthedistributionofresultsthatcanbemadewhenthetestisreplicated.Ifoneusesthemcorrectlytheyareawaytoselectfutureprojects.AtestwithaveryhighPcouldbereplicatedwithasmallersamplethussavinginthecostofreplication.Ifthereplicationreultsfailsthenanothertestisneededwithalargersample.Onlyaftermanytestscanconclusionsbemade,Useofdifferenttestsdoesnotimprovethesituation.ItestedmagazinevsTVspendingwiththreetestsallwithP=98to99.4.Allwerewrongduetoanerrorintheunderlingdata.Iassumecontinuousdatawheniswasdiscretedata.Morethanthreequartersofmyeducationinstatisticsdealtwitherrorsindesignandconsiderationoftheunderlingdata.

    DavidClarke 2014021805:40PMI'mprobablystupidbuthowamIsupposedtoassesstheprobabilityofmyhypothesisintoa"longshot"at19to1"atossup"at1to1a"goodbet"at9to1?

    BarryCohen 2014021510:37PMRegina:InyourfigurelabeledProbableCause,youciteSellkeetal.(2001)asthesource.Ireadthatarticle,andcouldnotseehowyouderivedyourfigurefromthatarticle.Specifically,Sellkeetal.(2001)neededtopositavalueforxi,whichastheydefineditcorrespondstowhatpsychologistsusuallycalldelta(thebasisfordeterminingthepowerofattest),inordertofindtheoddsfordifferentpvalues(seetheirFigure2).Itisnotclearwhatvalueforxiyouareusing,thoughfromyourresults,Iwouldguessitisabout.75,whichwouldleadtoanaveragepowerforrealeffectsofabout.12fora,05,twotailedsignificancetest.Moregenerally,yourfigurereliesonthechanceofa"realeffect"ineachcase,butareyoudefiningarealeffectasanythingotherthanexactlyzero?Doesn'titmatterhowlarge,onaverage,theserealeffectsare?Inotherwords:Weknowtheprobabilityofobtainingapvalueof.05orsmallerwhenthereisnorealeffect,butdoesn'ttheprobabilityofobtainingapvalueof.05orsmallerwhenthereISarealeffectdependonthesizeoftherealeffect(foragivensamplesize)?Whatassumptionareyoumakingthere?

    deborahmayo 2014021503:53AMRegina:ThereisacitationfromNeymaninthisarticlebutIdontseethereference.Idbegratefulifyouprovidedit.I'mfairlysureit'sentirelytakenoutofcontext."worsethanuseless"isatechnicalterm.PoorMotylwasonthebrinkofscientificglorybymeansofshoddystatistics!Glory,Itellyou,glory.Maybeheshouldbegivenamedalfornotrushingintoprintasimaginedbythosewhoviewscienceasanunthinkingscreeningeffort.Ofcourse,itcantbethathesfallenintothedumbestofdumbmisusesofpvalues.Itcannotbethathesexploitingfraudulentusesofstatistics.No,thisauthorblamesthestatisticaltoolsforhishighlyquestionableexploitationofpvalues.Thetruthisthattheonlyshadesofgreyhereisthefactthatmisuseofstatisticsdiffersonlyindegreefromoutandoutfraud.Anyinferenceisquestionableiftheresearchercannotshowthatflawsinhisorheranalysiswouldhavebeendetectedwithhighprobability.Fields(likethisone)thatregularlyspinoutresultswithoutshowingtheyhaveworkedhardorhaveeventriedtosubjecttheirownanalysistoseverescrutinyarepseudoscientific.Pseudoscientistsarefraudsandshouldbetreatedassuch.Sciencewriterswhoexploitthefashionofdumpingonpvaluesonlygivethemexcuses.

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 7/12

    DavidLovell 2014021502:52AMThanksforraisingawarenessonthisRegina.IwouldbeveryinterestedtoseeafollowuparticleorcommentsaboutFalseDiscoveryRate(FDR)proceduresusedinsituationswheremultiplecomparisonsaremade.I'mnotastatisticianmyintuitionisthatFDRfurthermaskstheshortcomingsofNullHypothesisSignificanceTesting.AsfarasIunderstandit,FDRamountstosettingamorestringentpvalueatwhichoneregardsdatatobestatisticallysignificant.FDRproceduresaboundinbioinformaticsandotherareasofmodernquantitativebiosciencewheremeasurementsareplentiful.IsmyskepticismofFDRwarranted?

    BenWise 2014021403:11PMMorrisDeGrootatCMUmadethispointwaybackinthe1980's.PValuesand"significance"measuretheprobabilityofdatagiventhehypothesis,nottheprobabilityofthehypothesisgiventhedata.Thisisexactlybackwards,as(toquoteDeGroot)"Ialreadyknowtheprobabilityofthedata:1,becauseIjustobservedit!".ItiseasytofindactualcaseswhereP(D|H)=0.99andP(D|H)=0.99,butP(H|D)=0.01.InEnglish,thehypothesishasahighPValueandisextremely"significant"butisalmostcertainlywrong.Nowondersomanymedicalstudiesareoverturnedbylaterstudies:theywerehighlysignificant,butnotveryprobable.AcommentwasmadebelowthatBayesianmethodsmustbeusedcarefully.IjustrepeatDeGroot'sresponse(whichIheardwhenhewasconfrontedwiththesamecriticism):itisbettertodotherightcalculationcarefullythantodothewrongoneeasily.

    deborahmayo 2014021503:42AMpvaluesareNOTlikelihoods,however,theypermitcomputationsthatBayesianlikelihoodsalonecannot.Theyallowevaluatingtheprobabilitythatthetestingprocedurewouldhaveresultedinalessimpressivedeparture(fromthenull)undertheassumptionthenullistrue,andalsoundertheassumptionofvaryingdiscrepanciesfromthenull.It'sasmallpartofthepanoplyofmethodsthatuseerrorprobabilities.Guesswhat?Bayesiansaretheoneswhoonlyuselikelihoodsconditionalontheobservedvalue!Sonoerrorprobabilisticassessmentsarepossible.Oh,butthere'saprioryousay?Noerrorcontrolthereeitherjustwhatsomeonebelieves,andveryfewscientistswanttomixtheirpriorbeliefsintothestudy.Thepointoftheresearchistotestclaimsnotbegthequestionbyimputingpriorbeliefs!

    HuwLlewelyn 2014021412:59PMThecommonsensequestionfacedbythosewhointerpretdatainapublicway(e.g.doctors,engineersandresearchscientists)isWhatdoIpredictfromthisobservation?Theanswercanbethat(1)theobservationwillprobablynotbereplicatedandisprobablyspurious(2)thatitsuggestsasimplepredictionorapredictionlinkedtopossiblenarrativeormathematicalmodels(e.g.adiagnosis,aworkingengineeringmodel,ageneralscientifichypothesis,theoryorlaw)thatwillinturnmakemanyotherusefulpredictions.Thefirsthurdletoovercomeiswhetherornottheobservationwillprobablybereplicated.Thiscandonebyshowingthatallthepossiblereasonsfornonreplicationareimprobable.Oneofthesecausesofnonreplicationisthatthenumberofobservationsistoolow(iftheobservationismadeupofanumberofdifferentobservations).Thisiswherestatisticalsignificancetestingcomesin(successfullyrepeatingtheentiresetofobservationsindependentlymakestheprobabilityoffurthernonreplicationduetothisreasonverylowofcourse).Therearemanyotherreasonsfornonreplicationtobeconsideredeg.poordocumentationorvaguewriting,dishonesty,poormethodology,contradictoryobservationsbysomeoneelse,etc,etc.Inorderfortheprobabilityofreplicationtobehigh,theprobabilitiesofallthesecausesofnonreplicationalsohavetobelow.Thereasoningprocessinevitablyhastobesubjectivebutthereisaformalbasisforitinprobabilitytheory(thatincorporatesBayesrule)toguideus(seealsoLlewelynH.Reasoninginmedicineandscience.OUPblog,September2013).

    MarkBrewer 2014021304:51PMI'mgladthisarticlehasprovokeddiscussion.WhatIfindsurprisingisthefactthatthe"ProbableCause"infographicpresentsabeautifulargumentforaBayesianapproach,withoutactuallysayingso,orevenrealisingitisdoingso.

    HT 2014021308:30PMIdon'tthinkthatreplacingfrequentistwiththeBayesianapproachistheanswer,noristhatthemessageofthearticle.

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 8/12

    Bayesianstatisticscandemonstratetheshortcomingsoffrequentiststatisticsverywellinsomesituations(likeintheinfographic),butalsorequiresgreatcaretohandle.Itwouldbenaivetothinkthatresearcherswhoabusepvalueswouldnotdothesametopriorsandmodelspecification.

    deborahmayo 2014021503:48AMThey'dnecessarilydoworseandfraudbustingwouldbedead.Why?Allcriticismsturnonbeingabletoevaluateerrorprobabilities(evenifonlyinformally),e.g.,showingthestudylikeMotyl'shasdonepracticallynothingtopreventtheworstkindofabuseandfraudulentuseofstatistics.Iagreeit'sanicepicture,butthearticleismisleadingindozensofways.Simonsohnisinterviewedbuttheauthordoesn'tbothertomentionthathepointsouthowBayesianstatisticsonlyintroducesmoreflexibilityintotheanalysis.It'squiteabiasedarticle,whichreallydefeatsthepurpose.

    PaulHayes 2014021507:11AMSimonsohnwaswrongonthatpoint:http://doingbayesiandataanalysis.blogspot.co.uk/2011/10/falseconclusionsinfalsepositive.html

    BobOHara 2014021410:10AMIndeed.Infact,wecouldjustreplacetheabuseofpvalueswiththeabuseofBayesfactorsandBayesianpvalues.

    JohnVidale 2014021304:50PMVerygoodarticle,butitmissesthemarkintwoways,IMO.Thisisreallyaprimerforthepublictotakescienceheadlineswithagrainofsalt.First,theunderlyingreasonformisuseofstatisticsisthenaturaloptimismofscientistswethinkwewillfindwhatnoonehaspreviouslyfound,andthatourexperimentwastheonesensiblewaytoexploretheproblem.Thatis,weoverestimatetheapriorilikelihoodthatoursolutionwasright,andweunderestimatetheamountoffiddlingwe(andothers)havedoneleadinguptoourlatestresult.Second,scientistsvarygreatlyintheirfamiliaritywithstatisticsandbasiccommonsensetheyalwayshaveandalwayswill.Requiringtediouspublicationofalldata,studiesthroughsequentialpublications,applicationofmultiplestatisticaltestsineachstudymayamelioratesomeproblems,butwillimpedemanyothers.Ashasbeentrueforever,scientistslookingatdataneedtounderstandstatisticaltoolstousethemright,asasserted.Ialsodoubtitisanewphenomenonthatscientistsrecognizethefallibilityofthelatest,hotteststudy.Irecallseveraloftheireditorstellingmethatmany(most)ScienceandNaturepapersareincorrect.

    AllenBryant 2014021304:04PMHavingtaughtStatisticsforanumberofyears,theissueofwhatthevalueofthePValueis,isn'treallythatimportant.Whatisimportantisaproperlystructuredhypothesistest.ThePvalueisameasureofwhatdegreeofconfidencewewishtoknowsomethingmightbetrueshouldthehypothesistestprovethatwecanrejectournullhypothesisinfavorofthealternativehypothesis.Ifresultsarenotreproducible,perhapsyourhypothesiscan'tberejectedandyouneedtocompletelyreconsideryourhypothesis.

    BenWise 2014021403:23PMMorrisDeGroothadacommentonthislineofreasoningbackinthe1980's.Theonlywaytojudgewhenitis"properlystructured"isbycomparingittoBayesianreasoning,thatis,tomakesureahighP(D|H)occursonlywhenP(H|D)ishigh.ButifyouhavetodotheexactBayesiananalysisinordertomakesurethepvalueheuristic(ashecalledit)isdoingtherightthing,whynotjustkeeptheBayesiananalysisandskiptheheuristic?ThisistheapproachItaughtmygraduatestudentsI'drecommendDeGroot'sworkasaverybalancedapproachthatcombinesbothpracticalcommonsense(youneedpvaluestogetpublished,andtheyareeasiertocalculate)andtheoreticalrigor(whencantheybereliedupon).

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 9/12

    ThomasDent 2014021301:43PMTheauthorshouldtakeherownadviceandshowsomevalidstatisticstobackupsweeping,offthecuffclaimsaboutwhat'mostscientists'or'mostresearchers'mightormightnotdo."MostscientistswouldlookathisoriginalPvalueof0.01andsaythattherewasjusta1%chanceofhisresultbeingafalsealarm."Define'mostscientists'.Whatevidenceisthereforthisclaim?Whatisyoursamplesize:howmanyscientistshavebeenobjectivelytestedonwhattheywouldsayinthesecircumstances?Whatistheeffectsize:howmanydidinfactsaythethingyouclaimed?Isitafairsampleofallscientists,oraresomedisciplinesorsomelevelsofseniorityorsomenationalitiesoverorunderrepresented?Howcanwebesuretheauthordoesn'tcherrypickconversationswheresomeoneappearsnottounderstandpvalues?Thisisnotajokeit'saveryseriouspoint.Therearescientistswhodounderstandpvaluesandputconsiderableeffortintousingthemcorrectlytheparticlephysicscommunityisoneexample.Articleslikethisonewhichblame'thepvalue'foreverythingpeopledowrongwithstatistics,asifthemethoditselfratherthantheusestowhichitisputwassomehowtherootofallevil,amounttounfairlysmearingresultsobtainedbyacorrectandrigoroususageofpvalues,i.e.withblinddataanalysisandhonestlyaccountedfortrialsfactors(the'lookelsewhereeffect').Tosaythatmuddledthinkingandselfdeceptionare*caused*byuseofpvaluesisabsurdpeoplewhoarepronetomuddledthinkingandselfdeceptionwillcarryonbeingsoregardlessofthestatisticalframework.Youmightaswellclaimthatabuseofsignificantfiguresiscausedbytheuseofthedecimalpoint.

    BenWise 2014021403:16PMIthinkthedrivingfactorisnot"mostscientists"butthereviewersofmajorjournals.Itisessentiallyimpossibletogetapaperpublishedwithoutahighpvalue,whichdriveseveryoneelsetodesigntheirworktogeneratehighpvalues,whethertheyagreewiththemornot."Publishorperish".

    AbhaySharma 2014021311:01AMOverselectionandoverreportingoffalsepositiveresultsareincreasinglyplaguingthepublishedresearchwithanalarmingrate(Nature485,1492012).Inthecurrentpractice,suchreportingisconsideredashonesterrorsnotamountingtomisconduct(Nature485,1372012).However,sinceintentionisthecoreofmisconduct,onemayverywellarguethatreportingofresultswithsystematicpositivebiasshouldalsobeplacedundertheambitofmisconduct.Scientificcommunityandpolicymakersneedtoconsiderthistoughoptionintheoverallinterestofscience.[Thisisapartofthecommentsmadeearlier(http://www.nature.com/news/policynihplanstoenhancereproducibility1.14586)].

    MarkAlexander 2014021310:34AMInspiteofthearticle'scommentthatpvaluesfromresearchintophenomenaliketelepathyarelikelytobe"falsealarms",inpointoffact,someofthemostsignificantpvaluesinanyareaofresearchcomefrompreciselythisdirection.Ithinkoftheganzfeldstudies,whichhaveproducedmindbogglingprobabilityvaluesontheorderof10^18.Goodman'sformuladoesn'tdosuchavalueanydamage.Thecommentinthearticleespeciallyinsofarasitgroupssuchresearchwith"homeopathyandaliens"asalabelofderisionreflectsawidespreadbutregrettablelackofknowledgeaboutwhathasbeenachievedinthisarea.

    MarkBrewer 2014021305:09PMI'mafraidthatpvaluesarealwaysgoingtobeflawed(quotingnumbersoftheorderof10^18justsmacksofdesperation)whenthebasicunderlying"science"isflawed.

    MarkAlexander 2014021407:08AMNevertheless,thosepvaluesareobjectivelypresent.Ontheonehand,findingsinthisareaareroutinelydismissedbecause'extraordinaryclaimsrequireextraordinaryevidence'.Butthen,whenpvaluessuchasthesearepresented,itonly'smacksofdesperation'.Whatkindof"science"isthat?

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 10/12

    PaulHayes 2014021502:12AMIt'sgoodscience.Interpretingaprobabilityofonly10^18thatyourtelepathyexperimentresultswerecausedby'chance'asevidencethattheywerecausedbytelepathy,givenwhatisalreadyknownfromrelevantpreviousresults,isbadscience.Verybadscience.http://blogs.discovermagazine.com/cosmicvariance/2008/02/18/telekinesisandquantumfieldtheory/#.Uv7L8XWbBiBhttp://wwwbiba.inrialpes.fr/Jaynes/cc05e.pdf

    MarkAlexander 2014021809:43AMActuallyacquaintyourselfwiththeliteraturebeforeyoupassjudgement.Oratleastagoodchunkofit.Inspiteofmakingreferencetothe'qualityofpreviousresults',it'sclearthatyouhaven'tseenthemandthat'spreciselythepointI'mmaking.Yourfirstlinkrulesoutthesephenomenaasamatterofprinciple.Yoursecondlinkdoesn'timpactontheganzfeldresultsI'mcitinginanywaywhatsoever.It'ssimplyasecondhanddiscussionofwhyallsuchresultsmustsurelybemistaken.Inotherwords,allyou'vedoneiscitetwoclaimswhysuchresultsshouldbedismissedoutofhand.Andno,that'snotgoodscience.

    ChandrikaBRao 2014021309:15AMMisinterpretationofpvalueseldomcomesfrombiostatisticians.Forthebiologist,afterhavingdonemanymonthsofworkgeneratingthedata,thefinalstatisticalanalysisseemstobeaminormatter,notdeservingmuchattention.Manybiologistsprefertodotheirownstatisticalanalysisratherthaninvolvinganotherpersonforthis'minor'work.Thecautiousorconservativeinterpretationofdataprovidedbythestatisticiansometimesdoesn'tgodownwellwithbiologistswantingdefinitiveconclusionsfromtheirhardwork.Manynonstatisticaljournalspublishpvalueswithoutanyassociatedinformationlike:whatwasthesamplesize,whatstatisticaltestwasdone,whathypothesiswasbeingtested,whatwasthepowerofthetestetc.,resultinginsloppystatisticalanalysisandsloppierreporting,makingtruethesaying,"therearelies,damnliesandstatistics".Nonstatisticianrefereesseldomaskadequatequestionsaboutthestatisticalmethodsusedandanalysisdone.Whenanarticlewithsubstandardstatisticalworkgetsacceptedforpublicationinagoodbiologyjournal,thebiologistnolongerfeelstheneedtotalkwithastatistician.Journalswithwordlimitsalsounconsciouslyencouragecuttingcornersonstatisticalanalysisreporting.Iamverysurprisedbytheheatof"Wemustcollaboratewithstatisticians,notletthemdecidewhatisgoodforpatients."comingfromGiovanniCodacciPisanelli.Arestatisticiansirrational,unforgivingogres,notcaringforthegoodofthepatients??

    PeterGerardBeninger 2014021309:13AMI'mhappytoseethatthisissueisbeginningtoemergeontheradarofmorescientists,notablythereadersofNatureandScience.However,theoneswhopersistintheworst,andmostcommon,misuseoffrequentiststatisticsrarely,ifever,readthesejournals,andseemequallyoblivioustothevastnumberofpublications,inallfields,whichmakethesamepoints.Theirpapersconstitutethemajorityofmanymainstreamspecialtyjournals.Ihavetakenthisupdirectlywithsenioreditors,whosimplyreplythattheydotheirjobsbyrelyingonreviewersforqualitycontrol.Mysuggestionthateachreputablejournalshouldhaveafulltimestatisticianonboardtoreviewtheproceduresusedinall'provisionallyaccepted'papers,aswellasforallstatisticallycontestedpapers(asisthepraciceinthebestmedicaljournals),hassofarfallenondeafearsforallofthejournalsinmyownresearchfield.Thesituationwillonlyimproveifwepushthepublishershardenough.

    JanePublic 2014021308:06AMBrian:IthinkIcanansweratleastpartofthisforyou.Ihadadiscussionaboutthiswithsomeonejusttheotherday.AlthoughIdon'tthinkhegotthepoint.Anyway,let'suseapurelyhypotheticalexampletoillustratethepoint.Let'ssaysomeonedecidestostudytheIQsofthestudentsatuniversities,andtestcorrelationsbetweenIQandvariousotherfactors.IQtestsareadministeredto10,000students,andtheresultsmoreorlessfollowtheexpectednormaldistribution,withameasurementerrorof+/2points.Sonowtheystartcomparingwithothermeasuredfactors.Andtheyfindsomethingverysurprising:thereisaverystronginversecorrelation(P=0.001...they'reVERYsureofthis),betweennipplesizeandIQ!(Hey...I'veseenmuchsillierthingsinstudiesbefore.)So...theygoontheeveningnewswiththeirstartlingdiscovery.Butwhatdoesthismean?Well,ifyouweretolookattheeffectsize,itturnsoutthatpeoplewithaureolaethatmeasured1.5cmacrosshadanIQthatwas0.02pointshigherthanstudentswhoseaureolaewere6cmacross.Sotheeffect0.02IQpointsisvery,verysmall.Eventhoughthereisstrongstatistical

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 11/12

    evidenceofacorrelation,theactualeffectissosmallasnottoreallymatter.EvenworsetheeffectissmallerthanthemeasurementerrorfortheirIQtests.(Prettymuchinvalidatingtheirwork,ifanybodybotheredtocheck.)Sothis"statisticalsignificance",whileverystrong,hasaboutzero"significance"intherealworld.Althoughthisisasomewhatexaggeratedexample,thiskindofthingisnotthatunusual.AsIsay,Iwastryingtoexplainthistosomeonetheotherday,aboutexactlythiskindofannouncement:areportedstrongcorrelation,buttheeffectsizewastiny,andessentiallyburiedinthesmallprint.

    Briancrawford 2014021210:21PMIlikedthearticlebuthaveaquickquestion.Whentheauthorsays"TopounceontinyPvaluesandignorethelargerquestionistofallpreytotheseductivecertaintyofsignificance,saysGeoffCumming,anemerituspsychologistatLaTrobeUniversityinMelbourne,Australia.Butsignificanceisnoindicatorofpracticalrelevance,hesays:Weshouldbeasking,'Howmuchofaneffectisthere?',not'Isthereaneffect?'"Howdoyoudecidewhatlevelofeffectisappropriatetoreport?Isitjustsubjectivedecision?Forexample,wouldaenrichmentofparticularsetofgenesof54%inonesamplecomparedwith46%inanotherbeenoughofaneffect?Eveniftheyareverysignificant?

    BobOHara 2014021309:08AMYou'reaskingtherightquestion,butI(asastatistician)can'tansweritforyou:it'sbiologicaljudgement.Andthisisagoodthing.Afterall,youaredoingscience,notstatistics,soyourjudgementofwhatis'significant'shouldbebasedonscience.Ithinkifweallusedeffectsizesandconfidenceintervals,ahiddenbenefitwouldbethatitwouldmakeusthinkmoreabouttheactualscientificrelevanceofourresults.

    BenWise 2014021403:47PMThereare(atleast)twodifferentsensesoftheword"significant"beingmixedtogether.Oneis"havingahighP(Data|Hypothesis)"andtheotheris"reasonabletoactupon".TheIQcorrelationexampleaboveisonethathashighP(D|H)buthasnopracticalimplicationsforanythinganyonewoulddecidetodoornot.Forexample,itdoesnothelponedecidewhethertoreleaseadrugontothemarketornot.Thesecondsenseof"significance"leadsonedirectlyintotherealmofdecisiontheoryandtheactualcostof(e.g.)TypeIandTypeIIerrors.Butdecisionsinvariablyinvolvetheweighingofcostsandbenefits,whichfalldifferentlyondifferentgroups,andsoinvolvealotofdebatesthatclassicalstatisticianstrytoavoid.Again,anicecompromiseisthereportactualprobabilityvalues,likeP(H|D),.Theymakeanice"decoupledinterface",inthattheycanbetakeneitherasasummarystatementofhowstronglythehypothesisisindicated,orasthestartofadecisiontheoreticanalysis.

    LuarMorenoAlvarez 2014021209:47PMAlthoughthisisaverygoodarticle,itislimitedinapproachandreferencestosocialandbiologicalsubjects.Perhaps,inordertoachieveamoregeneralandtechnicalviewofthisimportantissue,adeeperreviewofworksfromStatisticsjournalswouldbedesirable.Thepaper'PValuePrecisionandReproducibility'ofBoos&Stefanskiin'TheAmericanStatistician'(2011),forexample,couldbeusefultotheenrichmentofthisdiscussion.

    GiovanniCodacciPisanelli 2014021208:16PMThisarticleisalongawaitedreminderofwhatstatisticscando,andofwhattheycannotdo!ThePeanutsstripsaboutstatisticiansarefunnier...butnotasextensive.Inclinicaloncologypvaluesoftenaretheaimoftheclinicaltrial.Still,justusingacomputerspreadsheetprogrammeitiseasytoprovethatifyouenterenoughvaluesyoupwillbecomesignificantevenwhenthedifferencebetweentwomeansisminimal.Unfortunately"statisticallysignificant"isconsideredasynonymof"true",butveryoftenitratherseemstomean"clinicallyirrelevant"(atrialwitha0.7weeksdifferenceinprogressionfreesurvivalofpatientswithadvancedcancerhadasignificantp).Butwhatisevenmoredisturbingisthenumberofretractedpapers(forexampleongenesignatures)basedonthe"statisticalevaluation"ofresults...thatcouldnotbereproduced.Wemustcollaboratewithstatisticians,notletthemdecidewhatisgoodforpatients.

  • 4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

    http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 12/12

    Nature ISSN00280836 EISSN14764687

    2015NaturePublishingGroup,adivisionofMacmillanPublishersLimited.AllRightsReserved.partnerofAGORA,HINARI,OARE,INASP,CrossRefandCOUNTER

    deborahmayo 2014021904:10PMAlongawaitedrepeatofacookbookarticlethatfollowstherecipeofsomany"frontpagenews"statexposesineverypurportedsciencemagforyears.andjustasshallow...PoorMotyl,it'ssohardtodogoodscience...

    BobOHara 2014021309:05AMDon'tblamethestatisticians!We'vebeenbangingonaboutthisforyears,butpvaluesarejustsoentrenchedinthewayalotofscientiststhinkscienceshouldbedone.Theproblemisoneofinertia:pvaluesareacceptedasstandard,soscientiststeachtheirstudentsthatthisishowthingsshouldbedone,sothat'salltheylearn.

    SteveSchwartz 2014021207:36PMAcriticalaspectofpvalues,andhypothesistestingunderafrequentistframeworkmoregenerally,thatisnotaddressedbythiscolumn,isthatthesetechniquesweredevelopedandoriginallyimplementedinthesettingwhererandomerroristheonly(oratleastdominant)reasonwhyaparticularstudymightnotyieldthecorrectanswer.Inthesettingofnonexperimentalresearch,wherethe"exposure"orstudyconditionhasnotbeenrandomlyassigned,whetherthestudyyieldsthetruerelationshiphasfarmoretodowithbiasesinmeasurementofkeyvariables,intheselectionofstudysubjects,andinaccountingforconfoundingorsimilarrelationshipsamongvariables.Neitherpvaluesnorconfidenceintervalsmeasurethesefeaturesofastudy.Butbecausepvaluesandconfidenceintervalsareeasytoproduce,andmeasuresofmanynonrandombiasesarenoteasytomake,thestatisticalindiceshavebecomethecoinofthescientificrealminresearchdesignssuchasobservationalstudieswheretheywerenotoriginallyintendedtobeused.

    HT 2014021206:16PMThisarticleservesasawelcomereminderofthemanyfallaciesthatstillbelymodernscientificresearch.However,Ifeelitcoulduseabitmorebalanceinthecontextofreproducibility.IntheparagraphwhereMotyl'spvalueof0.01isrevisited,theprobabilityofreplicatingthisresultata'significant'(p