Download - Permanente vorming, Ziekenhuis Oost-Limburg 5 juni, 2019 · Gebruik en misbruik van statistiek in publicaties Permanente vorming, Ziekenhuis Oost-Limburg 5 juni, 2019 Geert Verbeke

Gebruik en misbruik van statistiek in publicaties

Permanente vorming, Ziekenhuis Oost-Limburg

5 juni, 2019

Geert Verbeke

Interuniversity Institute for Biostatisticsand statistical Bioinformatics

[email protected]

http://perswww.kuleuven.be/geert verbeke

I will NOT talk about . . .

• Mathematics

• Technical details

• Software

• Algorithms

• . . .

ZOL: 5 juni, 2019 1

I will focus on . . .

• Interpretation

• Frequently observed errors

• Some misconceptions

• Applications

• Publications

• Intuition

• . . .

ZOL: 5 juni, 2019 2

What is statistics about ?

• Wong et al. [1]:

The new englandjournal of medicine

established in 1812 february 17 , 2005 vol. 352 no. 7

The Risk of Cesarean Delivery with Neuraxial Analgesia Given Early versus Late in Labor

Cynthia A. Wong, M.D., Barbara M. Scavone, M.D., Alan M. Peaceman, M.D., Robert J. McCarthy, Pharm.D., John T. Sullivan, M.D., Nathaniel T. Diaz, M.D., Edward Yaghmour, M.D., R-Jay L. Marcus, M.D.,

Saadia S. Sherwani, M.D., Michelle T. Sproviero, M.D., Meltem Yilmaz, M.D., Roshani Patel, R.N., Carmen Robles, R.N., and Sharon Grouper, B.S.

ZOL: 5 juni, 2019 3

• Methods:

methods

We conducted a randomized trial of 750 nulliparous women at term who were in spon-

taneous labor or had spontaneous rupture of the membranes and who had a cervical di-

latation of less than 4.0 cm. Women were randomly assigned to receive intrathecal fen-

tanyl or systemic hydromorphone at the first request for analgesia. Epidural analgesia

was initiated in the intrathecal group at the second request for analgesia and in the sys-

temic group at a cervical dilatation of 4.0 cm or greater or at the third request for anal-

gesia. The primary outcome was the rate of cesarean delivery.

• Some results:

Table 2. Primary and Secondary Outcomes.*

OutcomesIntrathecal Analgesia

Systemic Analgesia P Value† Difference (95% CI)‡

Method of delivery

Cesarean — no./total no. (%) 65/366 (17.8) 75/362 (20.7) 0.31 ¡2.9 (¡9.0 to 3.0)

Instrumental vaginal — no./total no. (%) 59/301 (19.6) 46/287 (16.0) 0.13 3.6 (¡2.9 to 10.1)

ZOL: 5 juni, 2019 4

POPULATION

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

S

A

M

P

L

E

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Characteristic of population

Difference in % cesarean deliveriesin total population

Estimate in sample

Observed differencein % cesarean deliveries

in experiment

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••

STATISTICSRANDOM

ZOL: 5 juni, 2019 5

Confidence intervals & p-values

• Some results from Wong et al. [1]:




Method of delivery



• Confidence interval: Based on this experiment, we expect the true difference to bebetween -9% and 3%, with 95% certainty

• P-value: There is 31% chance to observe a difference of -2.9% if, in reality, there isno difference between both treatments

• Level of significance: α = 0.05, 0.01, 0.10.

ZOL: 5 juni, 2019 6

• Conclusion:

There is no significant difference in the rate of cesarean deliverybetween intrathecal analgesia and systemic analgesia (p = 0.31)

ZOL: 5 juni, 2019 7

Two types of errors

RealityNo effect Effect

Test resultNo effect No error Type II error

Effect Type I error No error

• Type I error: Incorrectly concluding there is an effect

• Type II error: Incorrectly concluding there is no effect

ZOL: 5 juni, 2019 8

• The probability of a type I error can easily be controled by choosing the level ofsignificance α sufficiently small:

P (Type I error) = α =

1%

5%

10%

• The probability of a type II error can only be controled by conducting sufficiently largeexperiments:

P (Type II error) = 1 − power =⇒

power calculation

sample size calculation

ZOL: 5 juni, 2019 9

• For example (Wong et al. [1]):

10 the worst pain imaginable) at the first and sec-

ond requests for analgesia. At the second request,

contrac-

in-

none,

vom-

Decisions regarding obstetrical management were

made by the obstetricians and midwives. Most of

the subjects had continuous external electronic fe-

statistical analysis

The study was designed to have 80 percent power

to detect a difference of 50 percent in the rate of ce-

sarean delivery, with a two-sided alpha level of 0.05.

The sample size required to detect this difference

was 350 subjects per group. The baseline cesarean

rate (17 percent) was estimated from institutional

data. The 50 percent increase in the cesarean-deliv-

ery rate was a conservative estimate based on the

increased rate associated with early neuraxial anal-

gesia observed previously in our institution.5 To ac-

count for expected losses in participation, 750 sub-

ZOL: 5 juni, 2019 10

Errors in statistics: Practical implications

. Non-representative samples

. Association versus causality

. Multiple testing

. Equivalence tests

. Significance versus relevance

ZOL: 5 juni, 2019 11

Non-representative samples

• Often, the sample used does not representthe population of interest:

. No random sample taken

. Withdrawal of participants

. Missing data

. . . .

• The data observed are then no longer a representation of the population

• Statistical methodology can still be applied but it is unclear how to interpret the results

ZOL: 5 juni, 2019 12

• Example (Boushey et al. [2]):

Figure 1. Enrollment and Outcome.

Reasons for exclusion during the run-in period were the need for inhaled budesonide therapy in 34 patients, excessive

symptoms in 30, too few symptoms in 30, withdrawal of consent by 31, loss to follow-up of 19, failure to meet adherence

criteria in 17, use of excluded medications by 6, presence of an excluded medical condition in 6, and other causes in 3.

The run-in and treatment phases both ended with a 10-to-14-day period of intense combined therapy, consisting of

0.5 mg of prednisone per kilogram per day, 800 µg of budesonide twice daily, and 20 mg of zafirlukast twice daily, plus

treatment with albuterol (540 to 720 µg), to eliminate any easily reversed causes of airflow obstruction affecting PEF

or FEV1.

10-Day courseof intensecombined

therapy

10-Day courseof intensecombined

therapy

Week 5213 26 39 48 540¡2

Run-inperiod

225Randomized

¡4

411 Enrolled

76 Assigned to 20 mg of zafirlukast twice daily+placebo inhaler

76 Assigned to placebo tablets+placebo inhaler6 Withdrew,

70 completed trial

14 Withdrew,62 completed trial

73 Assigned to 200 µg of budesonide twice daily+placebo tablets

6 Withdrew,67 completed trial

ZOL: 5 juni, 2019 13

• Of the 411 subjects enrolled in the study, only 67+62+70=199contribute to the final analysis

• Key question:

Are subjects included in the analysis systematically differentfrom those not included ?

• In order to judge whether the sample is representative for the population, one shouldcollect as much information as possible

ZOL: 5 juni, 2019 14

Association versus causality

• Suppose an experiment is set up to compare homeopathy (H) with placebo (P).

• Two groups of patients are selected. One receives H, the other receives P.

• (Double) blinding is necessary

• An observed difference between H and P does not necessarily imply H is (more)effective, not even under double blinding

• What if:

. H-group contains more females ?

. H-group is younger ?

. H-group contains ‘better’ patients ?

ZOL: 5 juni, 2019 15

Randomization is required whenever causal relations are to be shown:

Cause =⇒ Effect

ZOL: 5 juni, 2019 16

Example: Smoking and lung cancer

• Suppose one wants to study the relation between smoking and lung cancer.

• Randomization is ethically not possible.

• Therefore, the study design is often reversed:

. A group of cancer cases and a comparable group of non-cancer cases is selected

. All subjects are questioned about their smoking behavior in the past.

• Note that associations detected, should not be interpreted as causal

ZOL: 5 juni, 2019 17

Example: Smart phones and study results

• Newspaper article (De Morgen, January 9, 2018):dinsdag 09/01/2018

www.demorgen.be

Universiteiten onderzoeken welke factoren studieresultaten bepalen

Kotleven goed voor score,gescheiden ouders niet

• Study design: Studenten in de blok, opgelet. Het examenresul-taat dat je straks haalt, is niet enkel het gevolg vanhoe goed je gestudeerd hebt. Dat blijkt uit eenonderzoek van de universiteiten van Gent enAntwerpen. Eind 2016 vroegen zij 696 eerstejaars-studenten naar hun smartphonegebruik en naareen hele reeks andere parameters die een invloedzouden kunnen hebben op de studieresultaten.De uitkomst van de enquête werden gekoppeldaan de examenscore van januari 2017. Wat bleek: met het smartphonegebruik is er

ZOL: 5 juni, 2019 18

• Results:

aan de examenscore van januari 2017. Wat bleek: met het smartphonegebruik is er

een direct causaal verband. Studenten die hunsmartphone meer dan drie à vijf keer per lescheckten of vaker dan twee keer per uur tijdenshet studeren, haalden gemiddeld 1,1 punt op 20minder op hun examens dan studenten met eenlager dan gemiddeld smartphonegebruik. De studenten met een bovengemiddeld smart -

Deze factorengaan samenmet hogerestudiescores

►Motivatie

► Goede

resultaten in

het middelbaar

►Op kot zitten

► Studierichting

in het middelbaar

A

Buitenlandse origineMaar de onderzoekers zien ook een niet-causalesamenhang met andere factoren. Zo scoren stu-denten die een jaar later starten aan de universiteitgemiddeld 0,70 punten op 20 minder. Studentenvan wie de ouders gescheiden zijn, die van bui-tenlandse origine zijn of die het Nederlands nietals thuistaal hebben, halen eveneens lagere exa-menscores. Wie in meerdere hokjes past, moetdie minscores bij elkaar optellen. Wie én is blijven

Deze factorengaan samenmet lagerestudiescores

► Intensief

smartphone-

gebruik

► Gescheiden

ouders

► Nederlands

niet als thuistaal

► Blijven zitten

eerder in

schoolcarrière

► Buitenlandse

orgine

¿?

ZOL: 5 juni, 2019 19

Multiple testing

• Each time a test is performed, there is probability α of making a type I error

• For example, if α = 0.05, we can expect to incorrectly conclude an effectin 5 out of 100 times.

• Implication:

“The more tests one performs, the higher the probabilitythat something is detected by pure chance”

• This problem of multiple testing occurs very frequently in bio-medical sciences, invarious settings

ZOL: 5 juni, 2019 20

Example: Testing many relations

• Amin et al. [3], Table 2:Variable

Evacuation score Incontinence score

Median (IQR) P-value Median (IQR) P-value

Sex

Male 6 (2–8) 0.065 7 (1.8–9.3) 0.014

Female 7 (5–10) 8.5 (5–14.3)

Age group

< 69 years 6.5 (5–10) NS 7 (3–12.8) NS

‡ 70 years 6 (2–8) 8 (2.3–12)

Length of follow-up

1–2.9 years 7 (5–9) NS 8 (4–13) NS

3 or more years 5 (2–8) 7 (1.3–10.8)

Post-op leak

None 6 (3–9) 7 (2.5–12)

Minor 7 (5–17) NS 7 (4.5–15) NS

Major 10.5 (1.5–17.3) NS 10 (4.8–19.8) NS

Radiotherapy

None 6 (3–9) 7 (3–11)

Short course 9 (6–11) NS 12 (7–19) 0.083

Long course 5 (1–8) NS 8 (6.5–18.5) NS

Anastomotic height

£ 3 cm 7 (4.5–9) NS 11 (5–15) 0.001

‡ 3.1 cm 6 (3–8) 6 (2–9)

Part of colon used

Sigmoid 5 (2–9) NS 7 (2–13) NS

Descending 7 (4.5–9) 8 (4–12)

. 18 tests performed

. only 2 significant results

ZOL: 5 juni, 2019 21

Example: Subgroup analyses

• Kaplan et al. [4], Table 5:All brain tumors Benign Malignant

Cases Controls Cases Controls Cases Controls

Occupation OR (CI)a OR (CI)a OR (CI)a

Science professionals 3 6 2 3 1 3

1.2 (0.3–5.1) 1.5 (0.2–9.8) 0.9 (0.1–10.0)

Engineers, architects, and technicians 4 14 2 5 2 9

0.6 (0.2–2.0) 0.8 (0.1–4.5) 0.5 (0.1–2.7)

Health 3 9 2 8 1 1

0.7 (0.2–2.8) 0.5 (0.1–2.5) 2.0 (0.1–39.3)

Executives and managers, 8 17 8 14 0 3

1.1 (0.4–2.7) 1.2 (0.5–3.3) Inf

Teachers 6 15 4 10 2 5

0.9 (0.3–2.6) 0.8 (0.2–3.0) 1.1 (0.2–6.9)

Office workers 26 65 16 41 10 24

0.8 (0.5–1.3) 0.7 (0.4–1.5) 0.9 (0.4–2.1)

Telephone and radio operators, electricians 3 5 1 2 2 3

1.2 (0.3–5.2) 1.0 (0.1–11.4) 1.4 (0.2–8.7)

Sales workers 15 24 8 14 7 10

1.3 (0.6–2.5) 1.2 (0.5–3.0) 1.3 (0.4–3.7)

Housewife 34 60 24 39 10 21

1.2 (0.7–2.0) 1.4 (0.7–2.8) 0.8 (0.3–2.3)

Domestic help 12 17 11 13 1 4

1.4 (0.6–3.1) 1.8 (0.7–4.4) 0.4 (0.1–3.8)

Food 5 11 4 5 1 6

0.8 (0.3–2.5) 1.5 (0.4–6.1) 0.3 (0.1–2.5)

Agricultural workers 2 13 0 9 2 4

0.3 (0.1–1.3) 1.1 (0.2–6.1)

Metal workers 4 11 2 7 2 4

0.7 (0.2–2.2) 0.5 (0.1–2.7) 0.9 (0.2–5.3)

Drivers, motor vehicle operators, and service

occupations 17 15 8 10 9 5

2.8 (1.3–6.2) 2.0 (0.7–6.1) 4.3 (1.3–14.2)

Construction workers 5 7 1 5 4 2

1.3 (0.4–4.4) 0.3 (0.1–3.2) 3.9 (0.7–22.4)

Wood workers 3 3 2 1 1 2

2.0 (0.4–10.1) 4.1 (0.4–48.0) 0.9 (0.1–11.4)

Weavers and tailors 11 8 7 4 4 4

2.7 (1.1–7.1) 3.7 (1.1–13.5) 1.8 (0.4–8.0)

Printing workers 2 5 0 3 2 2

0.7 (0.1–3.9) 2.0 (0.3–15.1)

Painters 2 3 0 1 2 2

1.2 (0.2–7.2) 1.7 (0.2–12.9)

Unskilled workers 8 13 4 7 4 6

1.2 (0.5–3.0) 1.1 (0.3–4.1) 1.4 (0.4–5.2)

Other 11 28 4 7 7 19

0.8 (0.4–1.7) 1.2 (0.3–4.0) 0.6 (0.2–1.7)

. Tests based on C.I.’s for odds ratios

. C.I. containing 1 is equivalent to anon-significant test result

. 21 × 3 = 63 tests performed

. only 5 significant results

ZOL: 5 juni, 2019 22

Example: Searching for significant results

• This ‘scientific finding’ was printed in the Belgian newspapers:

• It was even stated that those who wake up before 7.21am have a statisticallysignificant higher stress level during the day than those who wake up after 7.21am.

ZOL: 5 juni, 2019 23

Multiple testing: Conclusion

• Significant results obtained by multiple testing are often overinterpreted

• If the number of tests is reported, the reader knows that such results need to beinterpreted with extreme care

• The problem arises when only the significant results are reported, and one does notknow how many tests were performed in total

• This leads to reporting results which turn out to be not reproducible

ZOL: 5 juni, 2019 24

Equivalence tests

• Results from Wong et al. [1]:




Method of delivery



• In case of a non-significant test result, one often concludes that both groups areidentical or equivalent

• An alternative interpretation is that the experiment did not have sufficient power toshow an effect which is present.

• If non-significance would imply equivalence, studies designed to show equivalencewould best be (extremely) small, such that there is no power to detect true differences.

ZOL: 5 juni, 2019 25

• Conclusion:

Non-significance should not be interpreted as equivalence

• Example (Shatari et al. [5]):

. Title:

ZOL: 5 juni, 2019 26

. Table 1:

No significantdifferences !

ZOL: 5 juni, 2019 27

. Results and conclusions (abstract):

ZOL: 5 juni, 2019 28

Significance versus relevance

• The power to detect some effect increases with the sample size

• This implies that any effect, no matter how small, will, sooner or later, be detected, ifthe sample is sufficiently large.

• Results from Wong et al. [1]:




Method of delivery



ZOL: 5 juni, 2019 29

• Suppose the observed rates of cesarean deliveries would have been 17.8% and 17.9%

• A p-value as small as 0.001 would be likely to be obtained, provided that the samplewould be sufficiently large.

• Obviously, a 0.1% difference is not relevant from a clinical point of view.

• Conclusion:

Statistical significance 6= Clinical relevance

• It is therefore important not to blindly overinterpret significant results withoutknowing the size of the effect

ZOL: 5 juni, 2019 30

General conclusion

Statistics provides researchers with a useful toolboxto draw conclusions from data collected

However, there are many pitfalls:

. Data collection

. Data analysis

. Interpretation of results

. Reporting of results

ZOL: 5 juni, 2019 31

The End !

ZOL: 5 juni, 2019 32

Bibliography

[1] C.A. Wong, B.M. Scavone, A.M. Peaceman, et al. The risk of cesarean delivery with neuraxial analgesia

given early versus late in labor. The New England Journal of Medicine, 352:655–665, 2005.

[2] H.A. Boushey, C.A. Sorkness, T.S. King, et al. Daily versus as-needed corticosteroids formild persistent

asthma. The New England Journal of Medicine, 352:1519–1528, 2005.

[3] A.I. Amin, O. Hallbook, A.J. Lee, R. Sexton, B.J. Moran, and R.J. Heald. A 5-cm colonic j pouch

colo-anal reconstruction following anterior resection for low rectal cancer results in acceptable evacuation

and continence in the long term. Colorectal Disease, 5:33–37, 2003.

[4] S. Kaplan, S. Etlin, I. Novikov, and B. Modan. Occupational risks for the development of brain tumours.

American Journal of Industrial Medicine, 31:15–20, 1997.

[5] T. Shatari, M.A. Clark, T. Yamamoto, A. Menon, C. Keh, J.Alexander-Williams, and M. Keighley. Long

strictureplasty is as safe and effective as short strictureplasty in small-bowel crohn’s disease. Colorectal

Disease, 6:438–441, 2004.

ZOL: 5 juni, 2019 33

Download - Permanente vorming, Ziekenhuis Oost-Limburg 5 juni, 2019 · Gebruik en misbruik van statistiek in publicaties Permanente vorming, Ziekenhuis Oost-Limburg 5 juni, 2019 Geert Verbeke

Top Related