bad science (2015)

“Torture numbers and they will tell you anything”*

Peter Kamerman Brain Function Research Group, University of the Witwatersrand, South Africa

Bad science

* Greg Easterbrook

Bad science Science under threat

UNIVERSITY OF THE WITWATERSRAND

Bad science Paper retractions are on the rise

Retracted biomedical research

Retracted in other scientific fields

Year of retraction

Num

ber o

f ret

ract

ed a

rticl

es

Grieneisen & Zhang, 2012 UNIVERSITY OF THE WITWATERSRAND

Bad science Almost half of retractions are for scientific misconduct

Van Noorden, 2011; Wagner & Williams, 2008 UNIVERSITY OF THE WITWATERSRAND

Bad science Biomedical publications are more likely to be retracted


Percent of all articles (%)

Per

cent

retra

ctio

ns (%

) Medicine

50

40

Bad science Fortunately, retractions are rare


% retracted in biomedical research

% retracted in other scientific fields

Year

Per

cent

age

of re

cord

s re

trcat

ed

“80% of non-randomized studies turn out to be

wrong, as do 25% of supposedly gold-

standard randomized trials, and as much as

10% of the platinum-standard large

randomized trials”

John Ioannidis (Health Research and Policy, Stanford School of Medicine)


Two broad categories:

•  Publication bias

•  Poor study design, execution and analysis

Bad science Where is it going wrong?


Publication bias Vanashing studies

UNIVERSITY OF THE WITWATERSRAND Hopewell et al., 2009

Negative trials (median: 0.4)

Positive trials (median: 0.7)

Proportion published

Publication bias Inflated estimates of effect size

UNIVERSITY OF THE WITWATERSRAND Finnerup et al., 2015

Effect size

Pre

cisi

on

Effect size

Trim-and-fill

~10%

Publication bias Drugs susceptable to bias

UNIVERSITY OF THE WITWATERSRAND Finnerup et al., 2015

* Number of participants in a negative trial to increase NNT to 11

*"

Poor study design, execution and analysis The experimental method

P value

Summary statistics

Tidy data

Raw data

Experimental design

Hypothesis testing

Basic data analysis

Data cleaning

Data collection

UNIVERSITY OF THE WITWATERSRAND Leek & Peng, 2015

Poor study design, execution and analysis The experimental method

P value

Summary statistics

Tidy data

Raw data

Experimental design

Hypothesis testing

Basic data analysis

Data cleaning

Data collection

Little scrutiny

Lots of scrutiny


The p-value has been likened to:

•  A mosquito (annoying and impossible to swat away);

•  The emperor's new clothes

(fraught with obvious problems that everyone ignores); •  A “sterile intellectual rake”

(ravishes science, but leaves it with no progeny)

The P value: Statistical Hypothesis Inference Testing

UNIVERSITY OF THE WITWATERSRAND Nuzzo, 2014; Lambdin, 2012

Poor study design, execution and analysis

“Statistics are like bikinis. What they reveal is

suggestive, but what they conceal is vital”

Aaron Levenstein

(Baruch College, CUNY)


The experimental method

P value

Summary statistics

Tidy data

Raw data

Experimental design

Hypothesis testing

Basic data analysis

Data cleaning

Data collection

Poor decisions

in data analysis



“The vast majority of data analysis is not

performed by people properly trained to

perform data analysis…[there is] a

fundamental shortage of data analytic skill”

Jeff Leek (Johns Hopkins Bloomberg School of Public Health)


•  Reactive rather than prospective analysis plan;

•  Not understanding basic principles underlying choice of statistical test;

•  Not viewing the data;

•  Not assessing or hiding variance and error estimates;

•  Not understanding what a P value means;

•  Not correcting for multiple comparisons;

•  Over-fitting models

Common errors in data analysis


Poor analysis

•  Retrospective registration of a trial on a trials database;

•  Primary end-points not clearly stated;

•  Analyses do not directly address the primary end-point(s);

What should you look out for?


Poor analysis

"

•  No CONSORT flow diagram

•  Analysis of per protocol vs intention-to-treat population;

•  Method of imputation not specified (e.g., LOCF, BOCF);

•  No correction for multiple comparisons;

What should you look out for?


Poor analysis

The experimental method

P value

Summary statistics

Tidy data

Raw data

Experimental design

Hypothesis testing

Basic data analysis

Data cleaning

Data collection Poor design and execution



•  No sample size calculation;

•  No or inappropriate randomization;

•  No concealment;

•  Study too short;

•  Biased sampling;

•  Biased/inappropriate measurements;

•  Not assessing potential confounders

Common errors in study design


Poor design and execution

Filters to apply:

Filter I: Are the methods valid?

Filter II: Are the results clinically important?

Filter III: Are the results important for my practice?

Bad science Interpreting the data

UNIVERSITY OF THE WITWATERSRAND American"Society"for"Reproduc4ve"Medicine,"2008"

Filters to apply:

Filter I: Are the methods valid?

•  Was the assignment of patients randomized?

•  Was the randomization concealed?

•  Was follow-up sufficiently long and complete?

•  Were all patients analyzed in the groups they were allocated to?



Filters to apply:

Filter I: Are the methods valid? Filter II: Are the results clinically important?

•  Was the treatment effect large enough to be clinically relevant?

•  Was the treatment effect precise?

•  Are the conclusions based on the question posed and are the results obtained?



Is it clinically important?

• Effect size (minimally important clinical difference)

•  Direction of change

•  Precision



Absolute measures

•  Absolute change from baseline

•  Numbers needed to treat (NNT)

Relative measures

•  Percentage change from baseline

•  Risk ratio /relative risk (RR)

•  Odds ratio (OR)

Bad science Typical measures of effect size in pain studies


Bad science Precision of the estimate


Trial& Mean&&pain&difference:&Drug&2&Placebo&

P&value" Change&from&baseline:&Drug&

95%&CI&&of&change&from&baseline:&Drug&

1" <1.7" <"0.001" <2.1" <2.4"to"<1.8"2" <0.5" 0.2" <1.5" <1.8"to"<1.2"3" <2.3" <"0.001" <3.6" <3.8"to"–"3.3"4" <0.3" 0.1" <3.4" <3.7"to"<3.2"Modelled:"delta"="1,"n=234"per"group,"common"SD"="2.2,"power"="0.9""

Bad science Precision of the estimate


Trial& Mean&&pain&difference:&Drug&2&Placebo&

P&value" Change&from&baseline:&Drug&

95%&CI&&of&change&from&baseline:&Drug&

1" <1.7" <"0.001" <2.1" <2.4"to"<1.8"2" <0.5" 0.2" <1.5" <1.8"to"<1.2"3" <2.3" <"0.001" <3.6" <3.8"to"<3.3"4" <0.3" 0.1" <3.4" <3.7"to"<3.2"Modelled:"delta"="1,"n=234"per"group,"common"SD"="2.2,"power"="0.9""

Filters to apply:

Filter I: Are the methods valid? Filter II: Are the results clinically important? Filter III: Are the results important for your practice?

•  Is the study population similar to the patients in your practice?

•  Is the intervention feasible in your own clinical setting?

•  What are your patient’s personal risks and potential benefits from the therapy?

•  What alternative treatments are available?



“The average human has one breast

and one testicle”

Desmond McHale (School of Mathematical Sciences, University College Cork, Ireland)