scientific method_ statistical errors _ nature news & comment

4/14/2015 Scientificmethod:Statisticalerrors:NatureNews&Comment

http://www.nature.com/news/scientificmethodstatisticalerrors1.14700?WT.ec_id=NATURE20140213 1/12

Print

NATURE | NEWSFEATURE

Scientificmethod:StatisticalerrorsPvalues,the'goldstandard'ofstatisticalvalidity,arenotasreliableasmanyscientistsassume.

12February2014

Forabriefmomentin2010,MattMotylwasonthebrinkofscientificglory:hehaddiscoveredthatextremistsquiteliterallyseetheworldinblackandwhite.

Theresultswereplainasday,recallsMotyl,apsychologyPhDstudentattheUniversityofVirginiainCharlottesville.Datafromastudyofnearly2,000peopleseemedtoshowthatpoliticalmoderatessawshadesofgreymoreaccuratelythandideitherleftwingorrightwingextremists.Thehypothesiswassexy,hesays,andthedataprovidedclearsupport.ThePvalue,acommonindexforthestrengthofevidence,was0.01usuallyinterpretedas'verysignificant'.PublicationinahighimpactjournalseemedwithinMotyl'sgrasp.

Butthenrealityintervened.Sensitivetocontroversiesoverreproducibility,Motylandhisadviser,BrianNosek,decidedtoreplicatethestudy.Withextradata,thePvaluecameoutas0.59notevenclosetotheconventionallevelofsignificance,0.05.Theeffecthaddisappeared,andwithit,Motyl'sdreamsofyouthfulfame1.

ItturnedoutthattheproblemwasnotinthedataorinMotyl'sanalyses.ItlayinthesurprisinglyslipperynatureofthePvalue,whichisneitherasreliablenorasobjectiveasmostscientistsassume.Pvaluesarenotdoingtheirjob,becausetheycan't,saysStephenZiliak,aneconomistatRooseveltUniversityinChicago,Illinois,andafrequentcriticofthewaystatisticsareused.

Formanyscientists,thisisespeciallyworryinginlightofthereproducibilityconcerns.In2005,epidemiologistJohnIoannidisofStanfordUniversityinCaliforniasuggestedthatmostpublishedfindingsarefalse2sincethen,astringofhighprofilereplicationproblemshasforcedscientiststorethinkhowtheyevaluateresults.

Atthesametime,statisticiansarelookingforbetterwaysofthinkingaboutdata,tohelpscientiststoavoidmissingimportantinformation

ReginaNuzzo

DALEEDWINMURRAY



oractingonfalsealarms.Changeyourstatisticalphilosophyandallofasuddendifferentthingsbecomeimportant,saysStevenGoodman,aphysicianandstatisticianatStanford.Then'laws'handeddownfromGodarenolongerhandeddownfromGod.They'reactuallyhandeddowntousbyourselves,throughthemethodologyweadopt.

OutofcontextPvalueshavealwayshadcritics.Intheiralmostninedecadesofexistence,theyhavebeenlikenedtomosquitoes(annoyingandimpossibletoswataway),theemperor'snewclothes(fraughtwithobviousproblemsthateveryoneignores)andthetoolofasterileintellectualrakewhoravishessciencebutleavesitwithnoprogeny3.Oneresearchersuggestedrechristeningthemethodologystatisticalhypothesisinferencetesting3,presumablyfortheacronymitwouldyield.

TheironyisthatwhenUKstatisticianRonaldFisherintroducedthePvalueinthe1920s,hedidnotmeanittobeadefinitivetest.Heintendeditsimplyasaninformalwaytojudgewhetherevidencewassignificantintheoldfashionedsense:worthyofasecondlook.Theideawastorunanexperiment,thenseeiftheresultswereconsistentwithwhatrandomchancemightproduce.Researcherswouldfirstsetupa'nullhypothesis'thattheywantedtodisprove,suchastherebeingnocorrelationornodifferencebetweentwogroups.Next,theywouldplaythedevil'sadvocateand,assumingthatthisnullhypothesiswasinfacttrue,calculatethechancesofgettingresultsatleastasextremeaswhatwasactuallyobserved.ThisprobabilitywasthePvalue.Thesmalleritwas,suggestedFisher,thegreaterthelikelihoodthatthestrawmannullhypothesiswasfalse.

ForallthePvalue'sapparentprecision,Fisherintendedittobejustonepartofafluid,nonnumericalprocessthatblendeddataandbackgroundknowledgetoleadtoscientificconclusions.Butitsoongotsweptintoamovementtomakeevidencebaseddecisionmakingasrigorousandobjectiveaspossible.Thismovementwasspearheadedinthelate1920sbyFisher'sbitterrivals,PolishmathematicianJerzyNeymanandUKstatisticianEgonPearson,whointroducedanalternativeframeworkfordataanalysisthatincludedstatisticalpower,falsepositives,falsenegativesandmanyotherconceptsnowfamiliarfromintroductorystatisticsclasses.TheypointedlyleftoutthePvalue.

ButwhiletherivalsfeudedNeymancalledsomeofFisher'sworkmathematicallyworsethanuselessFishercalledNeyman'sapproachchildishandhorrifying[for]intellectualfreedominthewestotherresearcherslostpatienceandbegantowritestatisticsmanualsforworkingscientists.Andbecausemanyoftheauthorswerenonstatisticianswithoutathoroughunderstandingofeitherapproach,theycreatedahybridsystemthatcrammedFisher'seasytocalculatePvalueintoNeymanandPearson'sreassuringlyrigorousrulebasedsystem.ThisiswhenaPvalueof0.05becameenshrinedas'statisticallysignificant',forexample.ThePvaluewasnevermeanttobeusedthewayit'susedtoday,saysGoodman.

Whatdoesitallmean?OneresultisanabundanceofconfusionaboutwhatthePvaluemeans4.ConsiderMotyl'sstudyaboutpoliticalextremists.MostscientistswouldlookathisoriginalPvalueof0.01andsaythattherewasjusta1%chanceofhisresultbeingafalsealarm.Buttheywouldbewrong.ThePvaluecannotsaythis:allitcandoissummarizethedataassumingaspecificnullhypothesis.Itcannotworkbackwardsandmakestatementsabouttheunderlyingreality.Thatrequiresanotherpieceofinformation:theoddsthatarealeffectwas

R.NUZZOSOURCE:T.SELLKEETAL.AM.STAT.55,6271(2001)



ThePvaluewasnevermeanttobeusedthewayit'susedtoday.

thereinthefirstplace.Toignorethiswouldbelikewakingupwithaheadacheandconcludingthatyouhaveararebraintumourpossible,butsounlikelythatitrequiresalotmoreevidencetosupersedeaneverydayexplanationsuchasanallergicreaction.Themoreimplausiblethehypothesistelepathy,aliens,homeopathythegreaterthechancethatanexcitingfindingisafalsealarm,nomatterwhatthePvalueis.

Thesearestickyconcepts,butsomestatisticianshavetriedtoprovidegeneralruleofthumbconversions(see'Probablecause').Accordingtoonewidelyusedcalculation5,aPvalueof0.01correspondstoafalsealarmprobabilityofatleast11%,dependingontheunderlyingprobabilitythatthereisatrueeffectaPvalueof0.05raisesthatchancetoatleast29%.SoMotyl'sfindinghadagreaterthanoneintenchanceofbeingafalsealarm.Likewise,theprobabilityofreplicatinghisoriginalresultwasnot99%,asmostwouldassume,butsomethingcloserto73%oronly50%,ifhewantedanother'verysignificant'result6,7.Inotherwords,hisinabilitytoreplicatetheresultwasaboutassurprisingasifhehadcalledheadsonacointossandithadcomeuptails.

CriticsalsobemoanthewaythatPvaluescanencouragemuddledthinking.Aprimeexampleistheirtendencytodeflectattentionfromtheactualsizeofaneffect.Lastyear,forexample,astudyofmorethan19,000peopleshowed8thatthosewhomeettheirspousesonlinearelesslikelytodivorce(p



Relatedstories

Numbercrunch

Policy:NIHplanstoenhancereproducibility

Weakstatisticalstandardsimplicatedinscientificirreproducibility

Morerelatedstories

Article

Article PubMed

Article

we'renowseeing.Wejustdon'tyethaveallthefixes.

Statisticianshavepointedtoanumberofmeasuresthatmighthelp.Toavoidthetrapofthinkingaboutresultsassignificantornotsignificant,forexample,Cummingthinksthatresearchersshouldalwaysreporteffectsizesandconfidenceintervals.TheseconveywhataPvaluedoesnot:themagnitudeandrelativeimportanceofaneffect.

ManystatisticiansalsoadvocatereplacingthePvaluewithmethodsthattakeadvantageofBayes'rule:aneighteenthcenturytheoremthatdescribeshowtothinkaboutprobabilityastheplausibilityofanoutcome,ratherthanasthepotentialfrequencyofthatoutcome.Thisentailsacertainsubjectivitysomethingthatthestatisticalpioneersweretryingtoavoid.ButtheBayesianframeworkmakesitcomparativelyeasyforobserverstoincorporatewhattheyknowabouttheworldintotheirconclusions,andtocalculatehowprobabilitieschangeasnewevidencearises.

Othersargueforamoreecumenicalapproach,encouragingresearcherstotrymultiplemethodsonthesamedataset.StephenSenn,astatisticianattheCentreforPublicHealthResearchinLuxembourgCity,likensthistousingafloorcleaningrobotthatcannotfinditsownwayoutofacorner:anydataanalysismethodwilleventuallyhitawall,andsomecommonsensewillbeneededtogettheprocessmovingagain.Ifthevariousmethodscomeupwithdifferentanswers,hesays,that'sasuggestiontobemorecreativeandtrytofindoutwhy,whichshouldleadtoabetterunderstandingoftheunderlyingreality.

Simonsohnarguesthatoneofthestrongestprotectionsforscientistsistoadmiteverything.Heencouragesauthorstobrandtheirpapers'Pcertified,notPhacked'byincludingthewords:Wereporthowwedeterminedoursamplesize,alldataexclusions(ifany),allmanipulationsandallmeasuresinthestudy.Thisdisclosurewill,hehopes,discouragePhacking,oratleastalertreaderstoanyshenanigansandallowthemtojudgeaccordingly.

Arelatedideathatisgarneringattentionistwostageanalysis,or'preregisteredreplication',sayspoliticalscientistandstatisticianAndrewGelmanofColumbiaUniversityinNewYorkCity.Inthisapproach,exploratoryandconfirmatoryanalysesareapproacheddifferentlyandclearlylabelled.Insteadofdoingfourseparatesmallstudiesandreportingtheresultsinonepaper,forinstance,researcherswouldfirstdotwosmallexploratorystudiesandgatherpotentiallyinterestingfindingswithoutworryingtoomuchaboutfalsealarms.Then,onthebasisoftheseresults,theauthorswoulddecideexactlyhowtheyplannedtoconfirmthefindings,andwouldpubliclypreregistertheirintentionsinadatabasesuchastheOpenScienceFramework(https://osf.io).Theywouldthenconductthereplicationstudiesandpublishtheresultsalongsidethoseoftheexploratorystudies.Thisapproachallowsforfreedomandflexibilityinanalyses,saysGelman,whileprovidingenoughrigourtoreducethenumberoffalsealarmsbeingpublished.

Morebroadly,researchersneedtorealizethelimitsofconventionalstatistics,Goodmansays.Theyshouldinsteadbringintotheiranalysiselementsofscientificjudgementabouttheplausibilityofahypothesisandstudylimitationsthatarenormallybanishedtothediscussionsection:resultsofidenticalorsimilarexperiments,proposedmechanisms,clinicalknowledgeandsoon.StatisticianRichardRoyallofJohnsHopkinsBloombergSchoolofPublicHealthinBaltimore,Maryland,saidthattherearethreequestionsascientistmightwanttoaskafterastudy:'Whatistheevidence?''WhatshouldIbelieve?'and'WhatshouldIdo?'Onemethodcannotanswerallthesequestions,Goodmansays:Thenumbersarewherethescientificdiscussionshouldstart,notend.

Nature 506, 150152 (13February2014) doi:10.1038/506150a

SeeEditorialpage131

References

1. Nosek,B.A.,Spies,J.R.&Motyl,M.Perspect.Psychol.Sci.7,615631(2012).Showcontext

2. Ioannidis,J.P.A.PLoSMed.2,e124(2005).Showcontext

3. Lambdin,C.TheoryPsychol.22,6790(2012).Showcontext



Article ISI ChemPort

Article PubMed ISI ChemPort

Article PubMed ISI ChemPort

Article PubMed ISI

Article PubMed ChemPort

Article PubMed

Article

4. Goodman,S.N.Ann.InternalMed.130,9951004(1999).Showcontext

5. Goodman,S.N.Epidemiology12,295297(2001).Showcontext

6. Goodman,S.N.Stat.Med.11,875879(1992).Showcontext

7. Gorroochurn,P.,Hodge,S.E.,Heiman,G.A.,Durner,M.&Greenberg,D.A.Genet.Med.9,325321(2007).Showcontext

8. Cacioppo,J.T.,Cacioppo,S.,Gonzagab,G.C.,Ogburn,E.L.&VanderWeele,T.J.Proc.NatlAcad.Sci.USA110,1013510140(2013).

Showcontext

9. Simmons,J.P.,Nelson,L.D.&Simonsohn,U.Psychol.Sci.22,13591366(2011).Showcontext

10. Simonsohn,U.,Nelson,L.D.&Simmons,J.P.J.Exp.Psychol.http://dx.doi.org/10.1037/a0033242(2013).Showcontext

11. Campbell,J.P.J.Appl.Psych.67,691700(1982).Showcontext

Relatedstoriesandlinks

Fromnature.comNumbercrunch12February2014Policy:NIHplanstoenhancereproducibility27January2014Weakstatisticalstandardsimplicatedinscientificirreproducibility11November2013Uncertaintyontrial02October2013Mattersofsignificance29August2013Announcement:Reducingourirreproducibility24April2013Replicationstudies:Badcopy16May2012Blogpost:Let'sgivestatisticstheattentionitdeservesinbiologicalresearchBlogpost:StatisticsisthesexyinscienceNaturespecial:Challengesinirreproducibleresearch

FromelsewherePsychologicalSciencetutorialonalternativestothePvalueTheBUGS(BayesianinferenceUsingGibbsSampling)ProjectBayesianCognitiveModelling:APracticalCourse

Authorinformation

AffiliationsReginaNuzzoisafreelancewriterandanassociateprofessorofstatisticsatGallaudetUniversityinWashingtonDC.

Forthebestcommentingexperience,pleaseloginorregisterasauserandagreetoourCommunityGuidelines.Youwillberedirected



37comments Subscribetocomments

backtothispagewhereyouwillseecommentsupdatinginrealtimeandhavetheabilitytorecommendcommentstootherusers.

Commentsforthisthreadarenowclosed.

Guest 2014030607:31PMCananyonehelpmeunderstandthe"probablecause"pictureofthepaper?IadmitthatIamlost.1.Whatisthemeaningof"oddsofhypothesis"?AhypothesiscanbeRight,orwrong.Whatisoddsofitmean?Ifweknowtheodds,dowestillneedtoknowpValue?2.HowcanIgetthenumberinthepicture:with1to19oddsofhypothesis+pValue=0.05>oddsbecome11%vs.89%.Thanks.

CharlesGreen 2014022104:21PMInstatisticsPvaluesalthoughcalledconfidencevaluesarenotmeasuresofaccuracy.Theyareastatementofwhatthedistributionofresultsthatcanbemadewhenthetestisreplicated.Ifoneusesthemcorrectlytheyareawaytoselectfutureprojects.AtestwithaveryhighPcouldbereplicatedwithasmallersamplethussavinginthecostofreplication.Ifthereplicationreultsfailsthenanothertestisneededwithalargersample.Onlyaftermanytestscanconclusionsbemade,Useofdifferenttestsdoesnotimprovethesituation.ItestedmagazinevsTVspendingwiththreetestsallwithP=98to99.4.Allwerewrongduetoanerrorintheunderlingdata.Iassumecontinuousdatawheniswasdiscretedata.Morethanthreequartersofmyeducationinstatisticsdealtwitherrorsindesignandconsiderationoftheunderlingdata.

DavidClarke 2014021805:40PMI'mprobablystupidbuthowamIsupposedtoassesstheprobabilityofmyhypothesisintoa"longshot"at19to1"atossup"at1to1a"goodbet"at9to1?

BarryCohen 2014021510:37PMRegina:InyourfigurelabeledProbableCause,youciteSellkeetal.(2001)asthesource.Ireadthatarticle,andcouldnotseehowyouderivedyourfigurefromthatarticle.Specifically,Sellkeetal.(2001)neededtopositavalueforxi,whichastheydefineditcorrespondstowhatpsychologistsusuallycalldelta(thebasisfordeterminingthepowerofattest),inordertofindtheoddsfordifferentpvalues(seetheirFigure2).Itisnotclearwhatvalueforxiyouareusing,thoughfromyourresults,Iwouldguessitisabout.75,whichwouldleadtoanaveragepowerforrealeffectsofabout.12fora,05,twotailedsignificancetest.Moregenerally,yourfigurereliesonthechanceofa"realeffect"ineachcase,butareyoudefiningarealeffectasanythingotherthanexactlyzero?Doesn'titmatterhowlarge,onaverage,theserealeffectsare?Inotherwords:Weknowtheprobabilityofobtainingapvalueof.05orsmallerwhenthereisnorealeffect,butdoesn'ttheprobabilityofobtainingapvalueof.05orsmallerwhenthereISarealeffectdependonthesizeoftherealeffect(foragivensamplesize)?Whatassumptionareyoumakingthere?

deborahmayo 2014021503:53AMRegina:ThereisacitationfromNeymaninthisarticlebutIdontseethereference.Idbegratefulifyouprovidedit.I'mfairlysureit'sentirelytakenoutofcontext."worsethanuseless"isatechnicalterm.PoorMotylwasonthebrinkofscientificglorybymeansofshoddystatistics!Glory,Itellyou,glory.Maybeheshouldbegivenamedalfornotrushingintoprintasimaginedbythosewhoviewscienceasanunthinkingscreeningeffort.Ofcourse,itcantbethathesfallenintothedumbestofdumbmisusesofpvalues.Itcannotbethathesexploitingfraudulentusesofstatistics.No,thisauthorblamesthestatisticaltoolsforhishighlyquestionableexploitationofpvalues.Thetruthisthattheonlyshadesofgreyhereisthefactthatmisuseofstatisticsdiffersonlyindegreefromoutandoutfraud.Anyinferenceisquestionableiftheresearchercannotshowthatflawsinhisorheranalysiswouldhavebeendetectedwithhighprobability.Fields(likethisone)thatregularlyspinoutresultswithoutshowingtheyhaveworkedhardorhaveeventriedtosubjecttheirownanalysistoseverescrutinyarepseudoscientific.Pseudoscientistsarefraudsandshouldbetreatedassuch.Sciencewriterswhoexploitthefashionofdumpingonpvaluesonlygivethemexcuses.



DavidLovell 2014021502:52AMThanksforraisingawarenessonthisRegina.IwouldbeveryinterestedtoseeafollowuparticleorcommentsaboutFalseDiscoveryRate(FDR)proceduresusedinsituationswheremultiplecomparisonsaremade.I'mnotastatisticianmyintuitionisthatFDRfurthermaskstheshortcomingsofNullHypothesisSignificanceTesting.AsfarasIunderstandit,FDRamountstosettingamorestringentpvalueatwhichoneregardsdatatobestatisticallysignificant.FDRproceduresaboundinbioinformaticsandotherareasofmodernquantitativebiosciencewheremeasurementsareplentiful.IsmyskepticismofFDRwarranted?

BenWise 2014021403:11PMMorrisDeGrootatCMUmadethispointwaybackinthe1980's.PValuesand"significance"measuretheprobabilityofdatagiventhehypothesis,nottheprobabilityofthehypothesisgiventhedata.Thisisexactlybackwards,as(toquoteDeGroot)"Ialreadyknowtheprobabilityofthedata:1,becauseIjustobservedit!".ItiseasytofindactualcaseswhereP(D|H)=0.99andP(D|H)=0.99,butP(H|D)=0.01.InEnglish,thehypothesishasahighPValueandisextremely"significant"butisalmostcertainlywrong.Nowondersomanymedicalstudiesareoverturnedbylaterstudies:theywerehighlysignificant,butnotveryprobable.AcommentwasmadebelowthatBayesianmethodsmustbeusedcarefully.IjustrepeatDeGroot'sresponse(whichIheardwhenhewasconfrontedwiththesamecriticism):itisbettertodotherightcalculationcarefullythantodothewrongoneeasily.

deborahmayo 2014021503:42AMpvaluesareNOTlikelihoods,however,theypermitcomputationsthatBayesianlikelihoodsalonecannot.Theyallowevaluatingtheprobabilitythatthetestingprocedurewouldhaveresultedinalessimpressivedeparture(fromthenull)undertheassumptionthenullistrue,andalsoundertheassumptionofvaryingdiscrepanciesfromthenull.It'sasmallpartofthepanoplyofmethodsthatuseerrorprobabilities.Guesswhat?Bayesiansaretheoneswhoonlyuselikelihoodsconditionalontheobservedvalue!Sonoerrorprobabilisticassessmentsarepossible.Oh,butthere'saprioryousay?Noerrorcontrolthereeitherjustwhatsomeonebelieves,andveryfewscientistswanttomixtheirpriorbeliefsintothestudy.Thepointoftheresearchistotestclaimsnotbegthequestionbyimputingpriorbeliefs!

HuwLlewelyn 2014021412:59PMThecommonsensequestionfacedbythosewhointerpretdatainapublicway(e.g.doctors,engineersandresearchscientists)isWhatdoIpredictfromthisobservation?Theanswercanbethat(1)theobservationwillprobablynotbereplicatedandisprobablyspurious(2)thatitsuggestsasimplepredictionorapredictionlinkedtopossiblenarrativeormathematicalmodels(e.g.adiagnosis,aworkingengineeringmodel,ageneralscientifichypothesis,theoryorlaw)thatwillinturnmakemanyotherusefulpredictions.Thefirsthurdletoovercomeiswhetherornottheobservationwillprobablybereplicated.Thiscandonebyshowingthatallthepossiblereasonsfornonreplicationareimprobable.Oneofthesecausesofnonreplicationisthatthenumberofobservationsistoolow(iftheobservationismadeupofanumberofdifferentobservations).Thisiswherestatisticalsignificancetestingcomesin(successfullyrepeatingtheentiresetofobservationsindependentlymakestheprobabilityoffurthernonreplicationduetothisreasonverylowofcourse).Therearemanyotherreasonsfornonreplicationtobeconsideredeg.poordocumentationorvaguewriting,dishonesty,poormethodology,contradictoryobservationsbysomeoneelse,etc,etc.Inorderfortheprobabilityofreplicationtobehigh,theprobabilitiesofallthesecausesofnonreplicationalsohavetobelow.Thereasoningprocessinevitablyhastobesubjectivebutthereisaformalbasisforitinprobabilitytheory(thatincorporatesBayesrule)toguideus(seealsoLlewelynH.Reasoninginmedicineandscience.OUPblog,September2013).

MarkBrewer 2014021304:51PMI'mgladthisarticlehasprovokeddiscussion.WhatIfindsurprisingisthefactthatthe"ProbableCause"infographicpresentsabeautifulargumentforaBayesianapproach,withoutactuallysayingso,orevenrealisingitisdoingso.

HT 2014021308:30PMIdon'tthinkthatreplacingfrequentistwiththeBayesianapproachistheanswer,noristhatthemessageofthearticle.



Bayesianstatisticscandemonstratetheshortcomingsoffrequentiststatisticsverywellinsomesituations(likeintheinfographic),butalsorequiresgreatcaretohandle.Itwouldbenaivetothinkthatresearcherswhoabusepvalueswouldnotdothesametopriorsandmodelspecification.

deborahmayo 2014021503:48AMThey'dnecessarilydoworseandfraudbustingwouldbedead.Why?Allcriticismsturnonbeingabletoevaluateerrorprobabilities(evenifonlyinformally),e.g.,showingthestudylikeMotyl'shasdonepracticallynothingtopreventtheworstkindofabuseandfraudulentuseofstatistics.Iagreeit'sanicepicture,butthearticleismisleadingindozensofways.Simonsohnisinterviewedbuttheauthordoesn'tbothertomentionthathepointsouthowBayesianstatisticsonlyintroducesmoreflexibilityintotheanalysis.It'squiteabiasedarticle,whichreallydefeatsthepurpose.

PaulHayes 2014021507:11AMSimonsohnwaswrongonthatpoint:http://doingbayesiandataanalysis.blogspot.co.uk/2011/10/falseconclusionsinfalsepositive.html

BobOHara 2014021410:10AMIndeed.Infact,wecouldjustreplacetheabuseofpvalueswiththeabuseofBayesfactorsandBayesianpvalues.

JohnVidale 2014021304:50PMVerygoodarticle,butitmissesthemarkintwoways,IMO.Thisisreallyaprimerforthepublictotakescienceheadlineswithagrainofsalt.First,theunderlyingreasonformisuseofstatisticsisthenaturaloptimismofscientistswethinkwewillfindwhatnoonehaspreviouslyfound,andthatourexperimentwastheonesensiblewaytoexploretheproblem.Thatis,weoverestimatetheapriorilikelihoodthatoursolutionwasright,andweunderestimatetheamountoffiddlingwe(andothers)havedoneleadinguptoourlatestresult.Second,scientistsvarygreatlyintheirfamiliaritywithstatisticsandbasiccommonsensetheyalwayshaveandalwayswill.Requiringtediouspublicationofalldata,studiesthroughsequentialpublications,applicationofmultiplestatisticaltestsineachstudymayamelioratesomeproblems,butwillimpedemanyothers.Ashasbeentrueforever,scientistslookingatdataneedtounderstandstatisticaltoolstousethemright,asasserted.Ialsodoubtitisanewphenomenonthatscientistsrecognizethefallibilityofthelatest,hotteststudy.Irecallseveraloftheireditorstellingmethatmany(most)ScienceandNaturepapersareincorrect.

AllenBryant 2014021304:04PMHavingtaughtStatisticsforanumberofyears,theissueofwhatthevalueofthePValueis,isn'treallythatimportant.Whatisimportantisaproperlystructuredhypothesistest.ThePvalueisameasureofwhatdegreeofconfidencewewishtoknowsomethingmightbetrueshouldthehypothesistestprovethatwecanrejectournullhypothesisinfavorofthealternativehypothesis.Ifresultsarenotreproducible,perhapsyourhypothesiscan'tberejectedandyouneedtocompletelyreconsideryourhypothesis.

BenWise 2014021403:23PMMorrisDeGroothadacommentonthislineofreasoningbackinthe1980's.Theonlywaytojudgewhenitis"properlystructured"isbycomparingittoBayesianreasoning,thatis,tomakesureahighP(D|H)occursonlywhenP(H|D)ishigh.ButifyouhavetodotheexactBayesiananalysisinordertomakesurethepvalueheuristic(ashecalledit)isdoingtherightthing,whynotjustkeeptheBayesiananalysisandskiptheheuristic?ThisistheapproachItaughtmygraduatestudentsI'drecommendDeGroot'sworkasaverybalancedapproachthatcombinesbothpracticalcommonsense(youneedpvaluestogetpublished,andtheyareeasiertocalculate)andtheoreticalrigor(whencantheybereliedupon).



ThomasDent 2014021301:43PMTheauthorshouldtakeherownadviceandshowsomevalidstatisticstobackupsweeping,offthecuffclaimsaboutwhat'mostscientists'or'mostresearchers'mightormightnotdo."MostscientistswouldlookathisoriginalPvalueof0.01andsaythattherewasjusta1%chanceofhisresultbeingafalsealarm."Define'mostscientists'.Whatevidenceisthereforthisclaim?Whatisyoursamplesize:howmanyscientistshavebeenobjectivelytestedonwhattheywouldsayinthesecircumstances?Whatistheeffectsize:howmanydidinfactsaythethingyouclaimed?Isitafairsampleofallscientists,oraresomedisciplinesorsomelevelsofseniorityorsomenationalitiesoverorunderrepresented?Howcanwebesuretheauthordoesn'tcherrypickconversationswheresomeoneappearsnottounderstandpvalues?Thisisnotajokeit'saveryseriouspoint.Therearescientistswhodounderstandpvaluesandputconsiderableeffortintousingthemcorrectlytheparticlephysicscommunityisoneexample.Articleslikethisonewhichblame'thepvalue'foreverythingpeopledowrongwithstatistics,asifthemethoditselfratherthantheusestowhichitisputwassomehowtherootofallevil,amounttounfairlysmearingresultsobtainedbyacorrectandrigoroususageofpvalues,i.e.withblinddataanalysisandhonestlyaccountedfortrialsfactors(the'lookelsewhereeffect').Tosaythatmuddledthinkingandselfdeceptionare*caused*byuseofpvaluesisabsurdpeoplewhoarepronetomuddledthinkingandselfdeceptionwillcarryonbeingsoregardlessofthestatisticalframework.Youmightaswellclaimthatabuseofsignificantfiguresiscausedbytheuseofthedecimalpoint.

BenWise 2014021403:16PMIthinkthedrivingfactorisnot"mostscientists"butthereviewersofmajorjournals.Itisessentiallyimpossibletogetapaperpublishedwithoutahighpvalue,whichdriveseveryoneelsetodesigntheirworktogeneratehighpvalues,whethertheyagreewiththemornot."Publishorperish".

AbhaySharma 2014021311:01AMOverselectionandoverreportingoffalsepositiveresultsareincreasinglyplaguingthepublishedresearchwithanalarmingrate(Nature485,1492012).Inthecurrentpractice,suchreportingisconsideredashonesterrorsnotamountingtomisconduct(Nature485,1372012).However,sinceintentionisthecoreofmisconduct,onemayverywellarguethatreportingofresultswithsystematicpositivebiasshouldalsobeplacedundertheambitofmisconduct.Scientificcommunityandpolicymakersneedtoconsiderthistoughoptionintheoverallinterestofscience.[Thisisapartofthecommentsmadeearlier(http://www.nature.com/news/policynihplanstoenhancereproducibility1.14586)].

MarkAlexander 2014021310:34AMInspiteofthearticle'scommentthatpvaluesfromresearchintophenomenaliketelepathyarelikelytobe"falsealarms",inpointoffact,someofthemostsignificantpvaluesinanyareaofresearchcomefrompreciselythisdirection.Ithinkoftheganzfeldstudies,whichhaveproducedmindbogglingprobabilityvaluesontheorderof10^18.Goodman'sformuladoesn'tdosuchavalueanydamage.Thecommentinthearticleespeciallyinsofarasitgroupssuchresearchwith"homeopathyandaliens"asalabelofderisionreflectsawidespreadbutregrettablelackofknowledgeaboutwhathasbeenachievedinthisarea.

MarkBrewer 2014021305:09PMI'mafraidthatpvaluesarealwaysgoingtobeflawed(quotingnumbersoftheorderof10^18justsmacksofdesperation)whenthebasicunderlying"science"isflawed.

MarkAlexander 2014021407:08AMNevertheless,thosepvaluesareobjectivelypresent.Ontheonehand,findingsinthisareaareroutinelydismissedbecause'extraordinaryclaimsrequireextraordinaryevidence'.Butthen,whenpvaluessuchasthesearepresented,itonly'smacksofdesperation'.Whatkindof"science"isthat?



PaulHayes 2014021502:12AMIt'sgoodscience.Interpretingaprobabilityofonly10^18thatyourtelepathyexperimentresultswerecausedby'chance'asevidencethattheywerecausedbytelepathy,givenwhatisalreadyknownfromrelevantpreviousresults,isbadscience.Verybadscience.http://blogs.discovermagazine.com/cosmicvariance/2008/02/18/telekinesisandquantumfieldtheory/#.Uv7L8XWbBiBhttp://wwwbiba.inrialpes.fr/Jaynes/cc05e.pdf

MarkAlexander 2014021809:43AMActuallyacquaintyourselfwiththeliteraturebeforeyoupassjudgement.Oratleastagoodchunkofit.Inspiteofmakingreferencetothe'qualityofpreviousresults',it'sclearthatyouhaven'tseenthemandthat'spreciselythepointI'mmaking.Yourfirstlinkrulesoutthesephenomenaasamatterofprinciple.Yoursecondlinkdoesn'timpactontheganzfeldresultsI'mcitinginanywaywhatsoever.It'ssimplyasecondhanddiscussionofwhyallsuchresultsmustsurelybemistaken.Inotherwords,allyou'vedoneiscitetwoclaimswhysuchresultsshouldbedismissedoutofhand.Andno,that'snotgoodscience.

ChandrikaBRao 2014021309:15AMMisinterpretationofpvalueseldomcomesfrombiostatisticians.Forthebiologist,afterhavingdonemanymonthsofworkgeneratingthedata,thefinalstatisticalanalysisseemstobeaminormatter,notdeservingmuchattention.Manybiologistsprefertodotheirownstatisticalanalysisratherthaninvolvinganotherpersonforthis'minor'work.Thecautiousorconservativeinterpretationofdataprovidedbythestatisticiansometimesdoesn'tgodownwellwithbiologistswantingdefinitiveconclusionsfromtheirhardwork.Manynonstatisticaljournalspublishpvalueswithoutanyassociatedinformationlike:whatwasthesamplesize,whatstatisticaltestwasdone,whathypothesiswasbeingtested,whatwasthepowerofthetestetc.,resultinginsloppystatisticalanalysisandsloppierreporting,makingtruethesaying,"therearelies,damnliesandstatistics".Nonstatisticianrefereesseldomaskadequatequestionsaboutthestatisticalmethodsusedandanalysisdone.Whenanarticlewithsubstandardstatisticalworkgetsacceptedforpublicationinagoodbiologyjournal,thebiologistnolongerfeelstheneedtotalkwithastatistician.Journalswithwordlimitsalsounconsciouslyencouragecuttingcornersonstatisticalanalysisreporting.Iamverysurprisedbytheheatof"Wemustcollaboratewithstatisticians,notletthemdecidewhatisgoodforpatients."comingfromGiovanniCodacciPisanelli.Arestatisticiansirrational,unforgivingogres,notcaringforthegoodofthepatients??

PeterGerardBeninger 2014021309:13AMI'mhappytoseethatthisissueisbeginningtoemergeontheradarofmorescientists,notablythereadersofNatureandScience.However,theoneswhopersistintheworst,andmostcommon,misuseoffrequentiststatisticsrarely,ifever,readthesejournals,andseemequallyoblivioustothevastnumberofpublications,inallfields,whichmakethesamepoints.Theirpapersconstitutethemajorityofmanymainstreamspecialtyjournals.Ihavetakenthisupdirectlywithsenioreditors,whosimplyreplythattheydotheirjobsbyrelyingonreviewersforqualitycontrol.Mysuggestionthateachreputablejournalshouldhaveafulltimestatisticianonboardtoreviewtheproceduresusedinall'provisionallyaccepted'papers,aswellasforallstatisticallycontestedpapers(asisthepraciceinthebestmedicaljournals),hassofarfallenondeafearsforallofthejournalsinmyownresearchfield.Thesituationwillonlyimproveifwepushthepublishershardenough.

JanePublic 2014021308:06AMBrian:IthinkIcanansweratleastpartofthisforyou.Ihadadiscussionaboutthiswithsomeonejusttheotherday.AlthoughIdon'tthinkhegotthepoint.Anyway,let'suseapurelyhypotheticalexampletoillustratethepoint.Let'ssaysomeonedecidestostudytheIQsofthestudentsatuniversities,andtestcorrelationsbetweenIQandvariousotherfactors.IQtestsareadministeredto10,000students,andtheresultsmoreorlessfollowtheexpectednormaldistribution,withameasurementerrorof+/2points.Sonowtheystartcomparingwithothermeasuredfactors.Andtheyfindsomethingverysurprising:thereisaverystronginversecorrelation(P=0.001...they'reVERYsureofthis),betweennipplesizeandIQ!(Hey...I'veseenmuchsillierthingsinstudiesbefore.)So...theygoontheeveningnewswiththeirstartlingdiscovery.Butwhatdoesthismean?Well,ifyouweretolookattheeffectsize,itturnsoutthatpeoplewithaureolaethatmeasured1.5cmacrosshadanIQthatwas0.02pointshigherthanstudentswhoseaureolaewere6cmacross.Sotheeffect0.02IQpointsisvery,verysmall.Eventhoughthereisstrongstatistical



evidenceofacorrelation,theactualeffectissosmallasnottoreallymatter.EvenworsetheeffectissmallerthanthemeasurementerrorfortheirIQtests.(Prettymuchinvalidatingtheirwork,ifanybodybotheredtocheck.)Sothis"statisticalsignificance",whileverystrong,hasaboutzero"significance"intherealworld.Althoughthisisasomewhatexaggeratedexample,thiskindofthingisnotthatunusual.AsIsay,Iwastryingtoexplainthistosomeonetheotherday,aboutexactlythiskindofannouncement:areportedstrongcorrelation,buttheeffectsizewastiny,andessentiallyburiedinthesmallprint.

Briancrawford 2014021210:21PMIlikedthearticlebuthaveaquickquestion.Whentheauthorsays"TopounceontinyPvaluesandignorethelargerquestionistofallpreytotheseductivecertaintyofsignificance,saysGeoffCumming,anemerituspsychologistatLaTrobeUniversityinMelbourne,Australia.Butsignificanceisnoindicatorofpracticalrelevance,hesays:Weshouldbeasking,'Howmuchofaneffectisthere?',not'Isthereaneffect?'"Howdoyoudecidewhatlevelofeffectisappropriatetoreport?Isitjustsubjectivedecision?Forexample,wouldaenrichmentofparticularsetofgenesof54%inonesamplecomparedwith46%inanotherbeenoughofaneffect?Eveniftheyareverysignificant?

BobOHara 2014021309:08AMYou'reaskingtherightquestion,butI(asastatistician)can'tansweritforyou:it'sbiologicaljudgement.Andthisisagoodthing.Afterall,youaredoingscience,notstatistics,soyourjudgementofwhatis'significant'shouldbebasedonscience.Ithinkifweallusedeffectsizesandconfidenceintervals,ahiddenbenefitwouldbethatitwouldmakeusthinkmoreabouttheactualscientificrelevanceofourresults.

BenWise 2014021403:47PMThereare(atleast)twodifferentsensesoftheword"significant"beingmixedtogether.Oneis"havingahighP(Data|Hypothesis)"andtheotheris"reasonabletoactupon".TheIQcorrelationexampleaboveisonethathashighP(D|H)buthasnopracticalimplicationsforanythinganyonewoulddecidetodoornot.Forexample,itdoesnothelponedecidewhethertoreleaseadrugontothemarketornot.Thesecondsenseof"significance"leadsonedirectlyintotherealmofdecisiontheoryandtheactualcostof(e.g.)TypeIandTypeIIerrors.Butdecisionsinvariablyinvolvetheweighingofcostsandbenefits,whichfalldifferentlyondifferentgroups,andsoinvolvealotofdebatesthatclassicalstatisticianstrytoavoid.Again,anicecompromiseisthereportactualprobabilityvalues,likeP(H|D),.Theymakeanice"decoupledinterface",inthattheycanbetakeneitherasasummarystatementofhowstronglythehypothesisisindicated,orasthestartofadecisiontheoreticanalysis.

LuarMorenoAlvarez 2014021209:47PMAlthoughthisisaverygoodarticle,itislimitedinapproachandreferencestosocialandbiologicalsubjects.Perhaps,inordertoachieveamoregeneralandtechnicalviewofthisimportantissue,adeeperreviewofworksfromStatisticsjournalswouldbedesirable.Thepaper'PValuePrecisionandReproducibility'ofBoos&Stefanskiin'TheAmericanStatistician'(2011),forexample,couldbeusefultotheenrichmentofthisdiscussion.

GiovanniCodacciPisanelli 2014021208:16PMThisarticleisalongawaitedreminderofwhatstatisticscando,andofwhattheycannotdo!ThePeanutsstripsaboutstatisticiansarefunnier...butnotasextensive.Inclinicaloncologypvaluesoftenaretheaimoftheclinicaltrial.Still,justusingacomputerspreadsheetprogrammeitiseasytoprovethatifyouenterenoughvaluesyoupwillbecomesignificantevenwhenthedifferencebetweentwomeansisminimal.Unfortunately"statisticallysignificant"isconsideredasynonymof"true",butveryoftenitratherseemstomean"clinicallyirrelevant"(atrialwitha0.7weeksdifferenceinprogressionfreesurvivalofpatientswithadvancedcancerhadasignificantp).Butwhatisevenmoredisturbingisthenumberofretractedpapers(forexampleongenesignatures)basedonthe"statisticalevaluation"ofresults...thatcouldnotbereproduced.Wemustcollaboratewithstatisticians,notletthemdecidewhatisgoodforpatients.



Nature ISSN00280836 EISSN14764687

2015NaturePublishingGroup,adivisionofMacmillanPublishersLimited.AllRightsReserved.partnerofAGORA,HINARI,OARE,INASP,CrossRefandCOUNTER

deborahmayo 2014021904:10PMAlongawaitedrepeatofacookbookarticlethatfollowstherecipeofsomany"frontpagenews"statexposesineverypurportedsciencemagforyears.andjustasshallow...PoorMotyl,it'ssohardtodogoodscience...

BobOHara 2014021309:05AMDon'tblamethestatisticians!We'vebeenbangingonaboutthisforyears,butpvaluesarejustsoentrenchedinthewayalotofscientiststhinkscienceshouldbedone.Theproblemisoneofinertia:pvaluesareacceptedasstandard,soscientiststeachtheirstudentsthatthisishowthingsshouldbedone,sothat'salltheylearn.

SteveSchwartz 2014021207:36PMAcriticalaspectofpvalues,andhypothesistestingunderafrequentistframeworkmoregenerally,thatisnotaddressedbythiscolumn,isthatthesetechniquesweredevelopedandoriginallyimplementedinthesettingwhererandomerroristheonly(oratleastdominant)reasonwhyaparticularstudymightnotyieldthecorrectanswer.Inthesettingofnonexperimentalresearch,wherethe"exposure"orstudyconditionhasnotbeenrandomlyassigned,whetherthestudyyieldsthetruerelationshiphasfarmoretodowithbiasesinmeasurementofkeyvariables,intheselectionofstudysubjects,andinaccountingforconfoundingorsimilarrelationshipsamongvariables.Neitherpvaluesnorconfidenceintervalsmeasurethesefeaturesofastudy.Butbecausepvaluesandconfidenceintervalsareeasytoproduce,andmeasuresofmanynonrandombiasesarenoteasytomake,thestatisticalindiceshavebecomethecoinofthescientificrealminresearchdesignssuchasobservationalstudieswheretheywerenotoriginallyintendedtobeused.

HT 2014021206:16PMThisarticleservesasawelcomereminderofthemanyfallaciesthatstillbelymodernscientificresearch.However,Ifeelitcoulduseabitmorebalanceinthecontextofreproducibility.IntheparagraphwhereMotyl'spvalueof0.01isrevisited,theprobabilityofreplicatingthisresultata'significant'(p

scientific method_ statistical errors _ nature news & comment

Documents

thegoldstan scientificmethod