the$crumbling$wall:$$ dataarchiving$and$reproducibility...

107
Tim Vines, University of Bri2sh Columbia The crumbling wall: data archiving and reproducibility in published science

Upload: others

Post on 29-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Tim  Vines,  University  of  Bri2sh  Columbia  

The  crumbling  wall:    data  archiving  and  reproducibility    

in  published  science  

Page 2: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Arianne  Albert,  Rose  Andrew,  Florence  Débarre,  Dan  Bock,  Michelle  Franklin,  Kim  Gilbert,  Nolan  Kane,  Jean-­‐

Sébas2en  Moore,  Brook  Moyers,  Sébas2en  Renaut,  Diana  Rennison,  Thor  Veen,  Tim  Vines,  and  Sam  Yeaman  

   

The  crumbling  wall:    data  archiving  and  reproducibility    

in  published  science  

Page 3: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 4: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 5: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 6: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 7: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 8: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’      

Page 9: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’    

•  Replica2on  tests  different  circumstances  

Page 10: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’    

•  Replica2on  tests  different  circumstances  

•  Reproducibility  checks  exis2ng  results  

Page 11: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  We  hope  bad  papers  will  be  discarded    

Page 12: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  We  hope  bad  papers  will  be  discarded    •  But  maybe  many  papers  are  ‘wrong’?  – We  need  to  quan2fy  this  problem…  

Page 13: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  Reproducibility  needs  the  original  data  

•  Then  we  need  to  repeat  the  analyses  

Page 14: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  

•  Reproducibility  needs  the  original  data  

•  Then  we  need  to  repeat  the  analyses  

•  Here  are  two  itera2ons  of  this  process…  

Page 15: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 16: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

How  does  the  availability  of  data  change  with  2me  since  publica2on?    

Vines  et  al.    Current  Biology  2014  

Page 17: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Introduc2on  

Page 18: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Michener  et  al.  (1997)  Nongeospa2al  metadata  for  the  ecological  sciences.  Ecol.  Appl.  7:330  

Important  metadata  lost/forgocen  

Career  change,  email  breaks  

Accidentaldata  loss  

Death  of  researcher  

Study  published  

Minor  details  lost/forgocen  

Time  

Page 19: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  How  fast  does  this  happen?  

Page 20: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0    5    10    15    20    25    30    35  Years  

Study  published  

Page 21: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0    1    2    3    4    5    6    7  Years  

Study  published  

Page 22: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  How  fast  does  this  happen?  

•  What  are  the  main  causes  of  data  loss?  

Page 23: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Career  change,  email  breaks  

Accidentaldata  loss  

Death  of  researcher  

Study  published  

Time  

Data  storage  defunct  

Page 24: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  How  fast  does  this  happen?  

•  What  are  the  main  causes  of  data  loss?  

•  Ask  for  datasets,  see  how  many  you  get…  

Page 25: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Methods  

Page 26: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Need  to  control  for  data  type  – morphological  data  from  animals  &  plants  – used  in  a  Discriminant  Func2on  Analysis  

 

 

Page 27: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Important  metadata  lost  

Study  published  

Minor  details  lost  

Time  

Page 28: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  516  studies  in  odd  years  1991  -­‐  2011  

•  Asked  for  data  by  email  – searched  for  emails  in  paper  and  online  – contacted  first,  last  &  corresponding  authors  

•  “We  want  to  try  repea2ng  your  DFA”  – part  of  study  on  reproducibility  and  paper  age  

Page 29: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Author  mo2va2on  :  – we’re  trapped  in  burning  building  vs  – we  want  to  print  it  out  for  wallpaper  

•  Our  request  is  fairly  common  prac2ce  – expect  20-­‐50%  for  2011  

Page 30: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Mo2va2on  sets  total  %  of  data  we  receive  

•  But  our  focus  is  on  how  %  changes  with  2me  – as  long  as  we  get  some  data  we’re  OK  

 

Page 31: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  If  data  were  gone,  we  asked  for  the  reason  

 

Page 32: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Results  

Page 33: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Probability  that  data  s2ll  extant  (i.e.  received  +  couldn’t  be  shared)  

Page 34: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(e

xtan

t dat

a)

Probability  that  data  s2ll  extant  (i.e.  received  +  couldn’t  be  shared)  

Page 35: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Odds  of  data  being  extant  fall  by  8%  per  yr  

•  Almost  all  gone  aoer  20  years  –  just  3  of  61  datasets  extant  for  1991  and  1993  

•  Why  were  we  unable  to  get  the  data?  – which  reasons  are  related  to  paper  age?  

Page 36: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(e

mai

l got

thro

ugh)

Probability  that  at  least  one  email  for  authors  on  the  paper  we  contacted  didn’t  bounce  

Page 37: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(r

espo

nse|

emai

l got

thro

ugh)

Given  that  at  least  one  email  didn’t  bounce,  probability  we  got  a  response  

Page 38: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(r

espo

nse|

emai

l got

thro

ugh)

Given  that  at  least  one  email  didn’t  bounce,  probability  we  got  a  response  

(mo2va2on  to  respond  is  unrelated  to  paper  age)    

Page 39: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(u

sefu

l res

pons

e|re

spon

se)

Given  that  we  got  a  response,  probability  we  heard  about  the  data  

Page 40: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(d

ata

exta

nt|u

sefu

l res

pons

e)

Given  that  we  heard  about  the  data,  probability  data  is  extant  

Page 41: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Conclusions  

Page 42: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Data  held  by  authors  disappears  fast  

•  Almost  all  gone  aoer  20  years  

•  Archiving  at  publica2on  really  is  crucial  

Vines  et  al.    Current  Biology  2014  

Page 43: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 44: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I:  Discriminant  Func2ons  

Page 45: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  We  received  101  files  from  authors  –  these  are  only  the  first  step  

•  Are  these  the  actual  data  from  the  paper?  

•  We  tried  to  repeat  their  DFA  

Page 46: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  What’s  a  Discriminant  Func2on  Analysis?  

– you  have  2  or  more  groups  of  something  – you  want  be  be  able  to  tell  the  groups  apart  –  the  groups  differ  in  e.g.  size  &  shape  – you  measure  a  few  things  

–  the  DF  says  what  aspect  of  size/shape  is  best  for  dis2nguishing  the  groups  

 

 

Page 47: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 48: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 49: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 50: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 51: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 52: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 53: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  metrics:  1.  the  percent  variance  explained  by  the  1st  axis  

 

 

Page 54: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 55: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  

 

Page 56: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 57: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  3.  the  percentage  of  individuals  correctly  assigned  

 

 

Page 58: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 59: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  3.  the  percentage  of  individuals  correctly  assigned  

•  We  tried  to  reproduce  these  metrics    

 

Page 60: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  What  happened  with  the  rest?  

Page 61: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 62: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 63: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 64: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 65: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 66: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 67: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 68: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 69: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Outcome   Percent  

Unclear  methods   4  

Insufficient  metadata   7  

Incorrect/incomplete  data   9  

[Subtotal]   [20]  

Reanalysis  a2empted:  

Results  don’t  match   21  

Some  metrics  match   18  

All  metrics  match   31  

Overall  Total   100  

Page 70: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

Page 71: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

•  How  far  off  were  we?  

Page 72: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

40

60

80

100

40 60 80 100Published PVE

Rea

naly

sed

PVE

Published  %  Variance  Explained  

Reanalyzed

 %  Variance  Explaine

d  

83%  of  23  reanalyses  within  5%  of  published  value  

Page 73: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

●●

●●

●●

0

10

20

0 10 20Published Coefficient

Rea

naly

sed

Coe

ffici

ent

01  

Published  Coefficient  

Reanalyzed

 Coe

fficien

t  70%  of  23  reanalyses  within  1  decimal    place  of  published  value  

Page 74: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

●●

●●

60

80

100

60 80 100Published PAC

Rea

naly

sed

PAC

Published  %  Correctly  Assigned  

Reanalyzed

 %  Correctly  Assigne

d  75%  of  56  reanalyses  within  5%  of  published  value  

Page 75: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

•  Strong  differences  between  metrics  

Page 76: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Conclusions  

Page 77: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  I  

•  Geung  the  data  is  the  biggest  obstacle  – accounts  for  80%  of  total  

•  Poor  cura2on  takes  out  only  4%    – 22%  of  received  datasets  

•  For  DFA,  reproducibility  is  quite  good  – but  depends  a  lot  on  the  metric  used  

Page 78: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 79: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Do  data  archiving  policies  work?  

Page 80: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  journals  now  have  data  archiving  policies  

•  four  flavours:  1.  no  policy  2.  recommend  3.  require  

Vines  et  al.  (2013)  FASEBJ  

Page 81: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  journals  now  have  data  archiving  policies  

•  four  flavours:  1.  no  policy  2.  recommend  3.  require  

a.  no  ‘data  availability’  statement  b.  ‘data  availability’  statement  

Vines  et  al.  (2013)  FASEBJ  

Page 82: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  focus  on  single  type  of  data  –  gene2c  data  used  in  STRUCTURE  

•  must  have  established  online  archive  –  in  this  case  Dryad  (or  supp.  mat.)  

•  found  229  papers  from  2011-­‐12  –  what  %  had  data  available?  

Page 83: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

% e

ligib

le p

aper

s w

ith d

ata

avai

labl

e on

line

0

20

40

60

80

100

Cons.

Gen.

Crop S

cienc

e

Genetica

TAG

BMC Evo

l. Biol

.BJLS

J. Here

dity

PLoS O

ne

J. Evo

lution

ary B

iolog

y

Evolution

Heredity

Molecu

lar E

colog

y

n=47 n=12 n=9 n=21 n=13 n=13 n=12 n=51 n=10 n=6 n=7 n=28

No archiving policy Recommends archiving Mandates archiving

no data statement data statement

Page 84: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Conclusions  

Page 85: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  journals  need  to  get  tough  

•  give  priority  to  papers  with  good  archiving?  

•  have  reviewers  assess  data  statement  

Page 86: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

   “Papers  with  exemplary  data  and  code  archiving  are  more  valuable  for  future  research,  and,  all  else  being  equal,  these  are  more  likely  to  get  accepted  for  publica2on”  

Page 87: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

How  journals  can  boost  data  sharing        The  journal  ecosystem  is  a  powerful  filter  of  scien2fic  literature,  promo2ng  the  best  work  into  the  best  journals.  Why  not  use  a  similar  mechanism  to  encourage  more  comprehensive  data  sharing?        Several  journals  have  introduced  policies  manda2ng  that  data  be  shared  on  a  public  archive  at  publica2on.  However,  these  have  met  with  limited  success,  perhaps  because  of  authors’  fear  of  losing  control,  being  scooped  in  subsequent  papers  or  having  errors  exposed.  Moreover,  compliance  with  data  sharing  policies  is  typically  only  checked  aoer  the  paper  is  accepted.      To  spur  excellence  in  data  sharing,  journals  must  recognise  that  becer  sharing  leads  to  stronger  papers,  and  judge  their  submissions  accordingly.  Ar2cles  with  feeble  sharing  efforts  should  either  improve  or  be  rejected.      A  focus  on  publishing  verifiable  research  correspondingly  boosts  journal  reputa2on,  and  signals  to  the  author  community  that  withholding  data  restricts  them  to  publica2on  in  less  pres2gious  journals.      Timothy  H.  Vines  University  of  Bri2sh  Columbia  

Page 88: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 89: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II:  gene2c  data  

Gilbert  et  al.  (2012)  Molecular  Ecology  

Page 90: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  Reproducing  simple  stats  (a  DFA)  was  OK  

•  modern  stats  are  more  sophis2cated  

•  most  involve  numerical  op2miza2on  – can  get  a  different  answer  each  2me  

Page 91: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis    

Page 92: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 93: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 94: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis  

•  this  uses  extensive  numerical  op2miza2on  

Page 95: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 96: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis  

•  this  uses  extensive  numerical  op2miza2on  

•  output  is  K,  the  number  of  dis2nct  clusters  

Page 97: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$
Page 98: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  Can  we  reproduce  their  value  of  K?  

•  4  studies  were  excluded    – no  data,  irregular  use  of  STRUCTURE  

•  Reanalyzed  remaining  30  datasets    

Page 99: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Outcome   No.  datasets   Percent  

Strange  use  of  STRUCTURE   2   6  

Missing  data   2   6  

Incorrect/incomplete  data   3   9  

Reanalysis  a2empted:  

K  didn’t  match   6   18  

K  matched   21   62  

Overall  Total   34   100  

Page 100: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  Can  we  reproduce  their  value  of  K?  

•  4  studies  were  excluded    – no  data,  irregular  use  of  STRUCTURE  

•  How  close  did  we  get?  

Page 101: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

0 1 2 3 4 5 6

01

23

45

6

Original chosen K value

Cho

sen

K va

lue

of re

anal

ysis

Author’s  value  of  K  

Reanalysis  value  of  K  

Page 102: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  Most  mismatches  from  poor  sooware  use  – stochas2c  methods  need  many  itera2ons  –  too  few  and  the  answer  is  unreliable  

Page 103: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Reproducibility  Part  II  

•  Most  mismatches  from  poor  sooware  use  – stochas2c  methods  need  many  itera2ons  –  too  few  and  the  answer  is  unreliable  

•  Poor  cura2on  was  less  of  a  problem  

Page 104: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Grand  Conclusions  

Page 105: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  STRUCTURE  reproducibility  >  DFA  – 65%  vs  50%  

•  Is  under  100%  reproducibility  unacceptable?  

•  Maybe  replica2on  is  more  important  

Page 106: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

•  Data  availability  is  the  biggest  problem  – without  it,  reproducibility  =  0  

•  We  need  stronger  data  archiving  policies  

•  May  mean  becer  science  as  well  – someone  will  check  your  data…  

Page 107: The$crumbling$wall:$$ dataarchiving$and$reproducibility ...summit.sfu.ca/.../13928/...reproducibility_in_published_science_Vines… · Reproducibility$ • Science$is$the$search$for$general$‘rules’$$

Thanks  to:  Arianne  Albert        Rose  Andrew  Florence  Débarre      Dan  Bock  Michelle  Franklin      Kim  Gilbert  Nolan  Kane          Jean-­‐Sébas2en  Moore  Brook  Moyers        Sébas2en  Renaut  Diana  Rennison        Loren  Rieseberg  Thor  Veen          Mike  Whitlock  Sam  Yeaman