the$crumbling$wall:$$ dataarchiving$and$reproducibility...

Post on 29-Sep-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tim  Vines,  University  of  Bri2sh  Columbia  

The  crumbling  wall:    data  archiving  and  reproducibility    

in  published  science  

Arianne  Albert,  Rose  Andrew,  Florence  Débarre,  Dan  Bock,  Michelle  Franklin,  Kim  Gilbert,  Nolan  Kane,  Jean-­‐

Sébas2en  Moore,  Brook  Moyers,  Sébas2en  Renaut,  Diana  Rennison,  Thor  Veen,  Tim  Vines,  and  Sam  Yeaman  

   

The  crumbling  wall:    data  archiving  and  reproducibility    

in  published  science  

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’      

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’    

•  Replica2on  tests  different  circumstances  

Reproducibility  

•  Science  is  the  search  for  general  ‘rules’    

•  Replica2on  tests  different  circumstances  

•  Reproducibility  checks  exis2ng  results  

Reproducibility  

•  We  hope  bad  papers  will  be  discarded    

Reproducibility  

•  We  hope  bad  papers  will  be  discarded    •  But  maybe  many  papers  are  ‘wrong’?  – We  need  to  quan2fy  this  problem…  

Reproducibility  

•  Reproducibility  needs  the  original  data  

•  Then  we  need  to  repeat  the  analyses  

Reproducibility  

•  Reproducibility  needs  the  original  data  

•  Then  we  need  to  repeat  the  analyses  

•  Here  are  two  itera2ons  of  this  process…  

How  does  the  availability  of  data  change  with  2me  since  publica2on?    

Vines  et  al.    Current  Biology  2014  

Introduc2on  

Michener  et  al.  (1997)  Nongeospa2al  metadata  for  the  ecological  sciences.  Ecol.  Appl.  7:330  

Important  metadata  lost/forgocen  

Career  change,  email  breaks  

Accidentaldata  loss  

Death  of  researcher  

Study  published  

Minor  details  lost/forgocen  

Time  

•  How  fast  does  this  happen?  

0    5    10    15    20    25    30    35  Years  

Study  published  

0    1    2    3    4    5    6    7  Years  

Study  published  

•  How  fast  does  this  happen?  

•  What  are  the  main  causes  of  data  loss?  

Career  change,  email  breaks  

Accidentaldata  loss  

Death  of  researcher  

Study  published  

Time  

Data  storage  defunct  

•  How  fast  does  this  happen?  

•  What  are  the  main  causes  of  data  loss?  

•  Ask  for  datasets,  see  how  many  you  get…  

Methods  

•  Need  to  control  for  data  type  – morphological  data  from  animals  &  plants  – used  in  a  Discriminant  Func2on  Analysis  

 

 

Important  metadata  lost  

Study  published  

Minor  details  lost  

Time  

•  516  studies  in  odd  years  1991  -­‐  2011  

•  Asked  for  data  by  email  – searched  for  emails  in  paper  and  online  – contacted  first,  last  &  corresponding  authors  

•  “We  want  to  try  repea2ng  your  DFA”  – part  of  study  on  reproducibility  and  paper  age  

•  Author  mo2va2on  :  – we’re  trapped  in  burning  building  vs  – we  want  to  print  it  out  for  wallpaper  

•  Our  request  is  fairly  common  prac2ce  – expect  20-­‐50%  for  2011  

•  Mo2va2on  sets  total  %  of  data  we  receive  

•  But  our  focus  is  on  how  %  changes  with  2me  – as  long  as  we  get  some  data  we’re  OK  

 

•  If  data  were  gone,  we  asked  for  the  reason  

 

Results  

Probability  that  data  s2ll  extant  (i.e.  received  +  couldn’t  be  shared)  

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(e

xtan

t dat

a)

Probability  that  data  s2ll  extant  (i.e.  received  +  couldn’t  be  shared)  

•  Odds  of  data  being  extant  fall  by  8%  per  yr  

•  Almost  all  gone  aoer  20  years  –  just  3  of  61  datasets  extant  for  1991  and  1993  

•  Why  were  we  unable  to  get  the  data?  – which  reasons  are  related  to  paper  age?  

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(e

mai

l got

thro

ugh)

Probability  that  at  least  one  email  for  authors  on  the  paper  we  contacted  didn’t  bounce  

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(r

espo

nse|

emai

l got

thro

ugh)

Given  that  at  least  one  email  didn’t  bounce,  probability  we  got  a  response  

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(r

espo

nse|

emai

l got

thro

ugh)

Given  that  at  least  one  email  didn’t  bounce,  probability  we  got  a  response  

(mo2va2on  to  respond  is  unrelated  to  paper  age)    

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(u

sefu

l res

pons

e|re

spon

se)

Given  that  we  got  a  response,  probability  we  heard  about  the  data  

0.00

0.25

0.50

0.75

1.00

5 10 15 20age of paper (years)

P(d

ata

exta

nt|u

sefu

l res

pons

e)

Given  that  we  heard  about  the  data,  probability  data  is  extant  

Conclusions  

•  Data  held  by  authors  disappears  fast  

•  Almost  all  gone  aoer  20  years  

•  Archiving  at  publica2on  really  is  crucial  

Vines  et  al.    Current  Biology  2014  

Reproducibility  Part  I:  Discriminant  Func2ons  

Reproducibility  Part  I  

•  We  received  101  files  from  authors  –  these  are  only  the  first  step  

•  Are  these  the  actual  data  from  the  paper?  

•  We  tried  to  repeat  their  DFA  

•  What’s  a  Discriminant  Func2on  Analysis?  

– you  have  2  or  more  groups  of  something  – you  want  be  be  able  to  tell  the  groups  apart  –  the  groups  differ  in  e.g.  size  &  shape  – you  measure  a  few  things  

–  the  DF  says  what  aspect  of  size/shape  is  best  for  dis2nguishing  the  groups  

 

 

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  metrics:  1.  the  percent  variance  explained  by  the  1st  axis  

 

 

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  

 

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  3.  the  percentage  of  individuals  correctly  assigned  

 

 

•  What’s  a  Discriminant  Func2on  Analysis?  

–  the  DFA  produces  three  useful  numbers:  1.  the  percent  variance  explained  by  the  1st  axis  2.  the  loading  coefficient  3.  the  percentage  of  individuals  correctly  assigned  

•  We  tried  to  reproduce  these  metrics    

 

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  What  happened  with  the  rest?  

Outcome   Percent  

Unclear  methods   4  

Insufficient  metadata   7  

Incorrect/incomplete  data   9  

[Subtotal]   [20]  

Reanalysis  a2empted:  

Results  don’t  match   21  

Some  metrics  match   18  

All  metrics  match   31  

Overall  Total   100  

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

•  How  far  off  were  we?  

40

60

80

100

40 60 80 100Published PVE

Rea

naly

sed

PVE

Published  %  Variance  Explained  

Reanalyzed

 %  Variance  Explaine

d  

83%  of  23  reanalyses  within  5%  of  published  value  

●●

●●

●●

0

10

20

0 10 20Published Coefficient

Rea

naly

sed

Coe

ffici

ent

01  

Published  Coefficient  

Reanalyzed

 Coe

fficien

t  70%  of  23  reanalyses  within  1  decimal    place  of  published  value  

●●

●●

60

80

100

60 80 100Published PAC

Rea

naly

sed

PAC

Published  %  Correctly  Assigned  

Reanalyzed

 %  Correctly  Assigne

d  75%  of  56  reanalyses  within  5%  of  published  value  

Reproducibility  Part  I  

•  We  started  with  101  studies  – 16  didn’t  contain  any  of  our  three  metrics  –  these  were  excluded  

•  Only  52  could  be  reproduced  – 10%  of  the  516  datasets  requested  

•  Strong  differences  between  metrics  

Conclusions  

Reproducibility  Part  I  

•  Geung  the  data  is  the  biggest  obstacle  – accounts  for  80%  of  total  

•  Poor  cura2on  takes  out  only  4%    – 22%  of  received  datasets  

•  For  DFA,  reproducibility  is  quite  good  – but  depends  a  lot  on  the  metric  used  

Do  data  archiving  policies  work?  

•  journals  now  have  data  archiving  policies  

•  four  flavours:  1.  no  policy  2.  recommend  3.  require  

Vines  et  al.  (2013)  FASEBJ  

•  journals  now  have  data  archiving  policies  

•  four  flavours:  1.  no  policy  2.  recommend  3.  require  

a.  no  ‘data  availability’  statement  b.  ‘data  availability’  statement  

Vines  et  al.  (2013)  FASEBJ  

•  focus  on  single  type  of  data  –  gene2c  data  used  in  STRUCTURE  

•  must  have  established  online  archive  –  in  this  case  Dryad  (or  supp.  mat.)  

•  found  229  papers  from  2011-­‐12  –  what  %  had  data  available?  

% e

ligib

le p

aper

s w

ith d

ata

avai

labl

e on

line

0

20

40

60

80

100

Cons.

Gen.

Crop S

cienc

e

Genetica

TAG

BMC Evo

l. Biol

.BJLS

J. Here

dity

PLoS O

ne

J. Evo

lution

ary B

iolog

y

Evolution

Heredity

Molecu

lar E

colog

y

n=47 n=12 n=9 n=21 n=13 n=13 n=12 n=51 n=10 n=6 n=7 n=28

No archiving policy Recommends archiving Mandates archiving

no data statement data statement

Conclusions  

•  journals  need  to  get  tough  

•  give  priority  to  papers  with  good  archiving?  

•  have  reviewers  assess  data  statement  

   “Papers  with  exemplary  data  and  code  archiving  are  more  valuable  for  future  research,  and,  all  else  being  equal,  these  are  more  likely  to  get  accepted  for  publica2on”  

How  journals  can  boost  data  sharing        The  journal  ecosystem  is  a  powerful  filter  of  scien2fic  literature,  promo2ng  the  best  work  into  the  best  journals.  Why  not  use  a  similar  mechanism  to  encourage  more  comprehensive  data  sharing?        Several  journals  have  introduced  policies  manda2ng  that  data  be  shared  on  a  public  archive  at  publica2on.  However,  these  have  met  with  limited  success,  perhaps  because  of  authors’  fear  of  losing  control,  being  scooped  in  subsequent  papers  or  having  errors  exposed.  Moreover,  compliance  with  data  sharing  policies  is  typically  only  checked  aoer  the  paper  is  accepted.      To  spur  excellence  in  data  sharing,  journals  must  recognise  that  becer  sharing  leads  to  stronger  papers,  and  judge  their  submissions  accordingly.  Ar2cles  with  feeble  sharing  efforts  should  either  improve  or  be  rejected.      A  focus  on  publishing  verifiable  research  correspondingly  boosts  journal  reputa2on,  and  signals  to  the  author  community  that  withholding  data  restricts  them  to  publica2on  in  less  pres2gious  journals.      Timothy  H.  Vines  University  of  Bri2sh  Columbia  

Reproducibility  Part  II:  gene2c  data  

Gilbert  et  al.  (2012)  Molecular  Ecology  

Reproducibility  Part  II  

•  Reproducing  simple  stats  (a  DFA)  was  OK  

•  modern  stats  are  more  sophis2cated  

•  most  involve  numerical  op2miza2on  – can  get  a  different  answer  each  2me  

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis    

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis  

•  this  uses  extensive  numerical  op2miza2on  

Reproducibility  Part  II  

•  34  datasets  from  the  previous  study  

•  all  have  a  STRUCTURE  analysis  

•  this  uses  extensive  numerical  op2miza2on  

•  output  is  K,  the  number  of  dis2nct  clusters  

Reproducibility  Part  II  

•  Can  we  reproduce  their  value  of  K?  

•  4  studies  were  excluded    – no  data,  irregular  use  of  STRUCTURE  

•  Reanalyzed  remaining  30  datasets    

Outcome   No.  datasets   Percent  

Strange  use  of  STRUCTURE   2   6  

Missing  data   2   6  

Incorrect/incomplete  data   3   9  

Reanalysis  a2empted:  

K  didn’t  match   6   18  

K  matched   21   62  

Overall  Total   34   100  

Reproducibility  Part  II  

•  Can  we  reproduce  their  value  of  K?  

•  4  studies  were  excluded    – no  data,  irregular  use  of  STRUCTURE  

•  How  close  did  we  get?  

0 1 2 3 4 5 6

01

23

45

6

Original chosen K value

Cho

sen

K va

lue

of re

anal

ysis

Author’s  value  of  K  

Reanalysis  value  of  K  

Reproducibility  Part  II  

•  Most  mismatches  from  poor  sooware  use  – stochas2c  methods  need  many  itera2ons  –  too  few  and  the  answer  is  unreliable  

Reproducibility  Part  II  

•  Most  mismatches  from  poor  sooware  use  – stochas2c  methods  need  many  itera2ons  –  too  few  and  the  answer  is  unreliable  

•  Poor  cura2on  was  less  of  a  problem  

Grand  Conclusions  

•  STRUCTURE  reproducibility  >  DFA  – 65%  vs  50%  

•  Is  under  100%  reproducibility  unacceptable?  

•  Maybe  replica2on  is  more  important  

•  Data  availability  is  the  biggest  problem  – without  it,  reproducibility  =  0  

•  We  need  stronger  data  archiving  policies  

•  May  mean  becer  science  as  well  – someone  will  check  your  data…  

Thanks  to:  Arianne  Albert        Rose  Andrew  Florence  Débarre      Dan  Bock  Michelle  Franklin      Kim  Gilbert  Nolan  Kane          Jean-­‐Sébas2en  Moore  Brook  Moyers        Sébas2en  Renaut  Diana  Rennison        Loren  Rieseberg  Thor  Veen          Mike  Whitlock  Sam  Yeaman  

   

top related