peter li at gcc2014: a journal’s experiences of reproducing published data analyses

25
A journal’s experiences of reproducing published data analyses Peter Li peter@gigasciencejournal. com

Upload: gigascience-bgi-hong-kong

Post on 28-Jan-2015

103 views

Category:

Technology


0 download

DESCRIPTION

Peter Li at the 2014 Galaxy Community Conference: A journal’s experiences of reproducing published data analyses, 1st July 2014

TRANSCRIPT

Page 1: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

A journal’s experiences of reproducing published data analyses

Peter [email protected]

m

Page 2: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Journal and databasefor large-scale data studies

Editor-in-Chief: Laurie GoodmanExecutive Editor: Scott Edmunds

Commissioning Editor: Nicole NogoyGigaDB: Chris Hunter, Jesse Xiao

GigaGalaxy: Peter Li

in conjunction with

Page 3: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

www.gigasciencejournal.com

Page 4: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

reproducibility

trust

understanding

Page 5: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Publication only Full replication

Not reproducible Gold standard

Data Code and dataLinked andexecutable

code and data

Publication +

Reproducibility spectrum

Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.

Page 6: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

gigadb.org

Page 7: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Paper DOI

Data set DOI

Linking of papers and data by citation of DOIs

Page 8: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Publication only Full replication

Not reproducible Gold standard

Data Code and dataLinked andexecutable

code and data

Publication +

Reproducibility spectrum

Adapted from Roger Peng (2011) Reproducible research in computational science. Science 334: 1226-1227.

Page 9: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Can the results in a GigaScience paper be replicated using Galaxy?

Page 10: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Pilot project

Page 11: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Replicate

Page 12: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Tools

http://gigadb.org/dataset/100044

Page 13: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Tools and data

http://gage.cbcb.umd.edu/data/index.html

Page 14: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Data in GigaGalaxy

Page 15: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Integration of SOAPdenovo2into GigaGalaxy

Page 16: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Short reads

Downloadedpipeline

Downloaded pipeline is missingtwo tools for reproducibility

KmerFreq_AR

Corrector_AR

SOAPdenovo2

GapCloser

Scaffold seqs

Short reads

Table 2 N50 &corrected N50

scores

Requiredpipeline

KmerFreq_AR

Corrector_AR

SOAPdenovo2

GapCloser

ExtractACGT

GAGE eval

Page 17: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Short reads

Table 2 N50 &corrected N50

scores

Requiredpipeline

KmerFreq_AR

Corrector_AR

SOAPdenovo2

GapCloser

ExtractACGT

GAGE eval

Need to add two

extra tools into

GigaGalaxy

Page 18: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

SOAPdenovo2 S. aureus pipeline

Page 19: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Species Tool Contigs Scaffolds

Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)

S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342

SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078

ALL-PATHS-LG 37 149.7 13 119.0 11 1477 1 1093

R. sphaeroides SOAPdenovo1 2241 3.5 400 2.8 956 106 24 68

SOAPdenovo2 721 18 106 14.1 333 2549 4 2540

ALL-PATHS-LG 190 41.9 30 36.7 32 3191 0 0

Published and Galaxy-reproduced statistics of genome assemblies of S. aureus and R. sphaeroides

Species Tool Contigs Scaffolds

Number N50 (kb) Errors N50 corrected (kb) Number N50 (kb) Errors N50 corrected (kb)

S. aureus SOAPdenovo1 79 148.6 156 23 49 342 0 342

SOAPdenovo2 80 98.6 25 71.5 38 1086 2 1078

ALL-PATHS-LG 37 149.7 13 117.6 10 1477 1 1093

R. sphaeroides SOAPdenovo1 2242 3.5 392 2.8 956 105 18 70

SOAPdenovo2 721 18 106 14.1 333 2549 4 2540

ALL-PATHS-LG 190 41.9 31 36.7 32 3191 0 3310

Pu

blish

ed

R

ep

rod

uced

Page 20: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses
Page 21: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

http://galaxy.cbiit.cuhk.edu.hk/u/gigascience/p/soapdenovo2-s-aureus

Page 22: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Observations

• Complete scientific reproduction is difficult– Time and effort required

• Requires help from authors• Do we need education and training in

scientific reproducibility?

Page 23: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses
Page 24: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

http://www.cf.ac.uk/socsi/contactsandpeople/harrycollins/image-36548-web.gif

Page 25: Peter Li at GCC2014: A journal’s experiences of reproducing published data analyses

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Peter LiHuayan Gao Chris HunterJesse Si ZheNicole NogoyLaurie GoodmanAmye Kenall (BMC)

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

Funding from:

Our collaborators:team: Case study: