scott edmunds talk at aist: overcoming the reproducibility crisis: and why i stopped worrying a...

51
Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods) Scott Edmunds 1 st July 2013

Upload: gigascience-bgi-hong-kong

Post on 28-Jan-2015

106 views

Category:

Science


3 download

DESCRIPTION

Scott Edmunds talk at the AIST Computational Biology Research Center in Tokyo: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods), July 1st 2014

TRANSCRIPT

Page 1: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Scott Edmunds1st July 2013

Page 2: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Harnessing Data-Driven Intelligence

Using networking power of the internet to tackle problems

Can ask new questions & find hidden patterns & connections

Build on each others efforts quicker & more efficiently

More collaborations across more disciplines

Harness wisdom of the crowds: crowdsourcing, citizen science, crowdfunding

Enables:

Enabled by:Removing silos, open licenses, transparency, immediacy

Page 3: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Dead trees not fit for purpose

18121665 1869

Page 4: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

The problems with publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication.

• If there is interest in data, only to monetise & re-silo

• Traditional publishing policies and practices a hindrance

Page 5: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 6: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Consequences: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 7: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Consequences: growing replication gap

1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

More retractions: >15X increase in last decadeAt current % > by 2045 as many papers published as retracted

Insufficient methods

Page 8: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

“Faked research is endemic in

China”

Global perceptions of Asian Research Million RMB rewards for high IF publications = ?

475, 267 (2011)

New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.htmlNature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.fullNature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html

Page 9: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

“Faked research is endemic in

China”

Global perceptions of Asian Research Million RMB rewards for high IF publications = ?

475, 267 (2011)

New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.htmlNature, 29th September 2010: http://www.nature.com/news/2010/100929/full/467511a.html Science, 29th November 2013: http://www.sciencemag.org/content/342/6162/1035.fullNature 20th July 2011: http://www.nature.com/news/2011/110720/full/475267a.html

“Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“

“There have been widespread complaints from scientists inside and outside China about this lack of transparency. ”

“Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”

Page 10: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

STAP paper demonstrates problems:

…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review

…to publish replication studies

Need:

Page 11: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

• Data• Software• Review• Re-use…

= Credit

}

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

New incentives/credit

Page 12: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

GigaSolution: deconstructing the paper

www.gigadb.orgwww.gigasciencejournal.com

Utilizes big-data infrastructure and expertise from:

Combines and integrates:Open-access journal

Data Publishing Platform

Data Analysis Platform

Page 13: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Rewarding open data

Page 14: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Submission Workflow

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Excel submission file

Submitter logs in to GigaDB website and uploads Excel submission

GigaDB

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

See: http://database.oxfordjournals.org/content/2014/bau018.abstract

Page 15: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

• Aspera = 10-100x faster up/download than FTP• Multi-omics/large scale biomedical data focus• Provide (ISA) curation & integration with other DBs

(e.g. INSDC, MetaboLights, PRIDE, etc.)

For more see: http://database.oxfordjournals.org/content/2014/bau018.abstract

Page 16: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

IRRI GALAXY

Beneficiaries/users of our work

Page 17: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data

Beneficiaries/users of our work

Page 18: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

NO

Diverse Data TypesCyber-centipedes & virtual worms

Page 19: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

More transparency: open peer review

BMC Series Medical Journals

Page 20: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

More transparency (and speed):pre-prints

Page 21: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10

Page 22: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Real-time open-review = paper in arXiv + blogged reviews

Reward open & transparent review

Page 23: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

GigaScience + Publons = further credit for reviewers efforts

Reward open & transparent review

Page 24: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Readers are interested in open review

Next step to link to ORCID

Page 25: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Cloud solutions?

Reward better handling of metadata…Novel tools/formats for data interoperability/handling.

Page 26: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Implement workflows in a community-accepted format

http://galaxyproject.org

Over 36,000 main Galaxy server users

Over 1,000 papersciting Galaxy use

Over 55 Galaxyservers deployed

Open source

Rewarding and aiding reproducibility

Page 27: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

galaxy.cbiit.cuhk.edu.hk

Page 28: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Visualizations & DOIs for workflows

Page 29: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)
Page 30: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

How are we supporting data reproducibility?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>26,000 accesses

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>20,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 31: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

7 referees downloaded & tested data, then signed reports

Reward open & transparent review

Page 32: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Post publication: bloggers pull apart code/reviews in blogs + wiki:

SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/

Reward open & transparent review

Page 33: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Page 34: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Implemented entire workflow in our Galaxy server, inc.:

• 3 pre-processing steps

• 4 SOAPdenovo modules

• 1 post processing steps

• Evaluation and visualization tools

Also will be available to download by >36K Galaxy users in

Page 35: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

Page 36: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Taking a microscope to peer review

Page 37: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

The SOAPdenovo2 Case studySubject to and test with 3 models:

DataData

Method/Experimental

protocol

Method/Experimental

protocol

FindingsFindings

Types of resources in an RO

Wfdesc/ISA-TAB/ISA2OWL

Wfdesc/ISA-TAB/ISA2OWL

Models to describe each resource type

Page 38: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)
Page 39: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.

Page 40: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Case Study: Lessons Learned• Most published research findings are false. Or at least have

errors.

• Is possible to push button(s) & recreate a result from a paper

• On a semantic level (via nanopublications) can still have minor errors in text (interpretation not data)

• Reproducibility is COSTLY. How much are you willing to spend?

• Much easier to do this before rather than after publication

Page 41: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Aiding reproducibility of imaging studies

OMERO: providing access to imaging data

View, filter, measure raw images with direct links from journal article.

See all image data, not just cherry picked examples.

Download and reprocess.

Page 42: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

OMERO: Aiding reproducibility, adding value

Page 43: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

The alternative...

...look but don't touch

Page 44: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Step 1: get your code out there• Put everything in a code repository. Even if it is ugly (see CRAPL)

• Version control. Make sure you document exact version in the paper (big problem with lots of our papers).

• If system environments are important, consider VMs

http://matt.might.net/articles/crapl/

Page 45: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Beyond Commenting Code: Step 2: Open lab books, dynamic documents• Facilitate reuse and sharing with tools like: Knitr, Sweave,

iPython Notebook

Sweave

• Working towards executable papers…

Page 46: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

E.g.

Page 47: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

E.g.

Page 48: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Some testimonials for KnitrAuthors (Wolfgang Huber)“I do all my projects in Knitr. Having the textual explanation, the associated code and the results all in one place really increases productivity, and helps explaining my analyses to colleagues, or even just to my future self.”

Reviewers (Christophe Pouzat) “It took me a couple of hours to get the data, the few custom developed routines, the “vignette” and to REPRODUCE EXACTLY the analysis presented in the manuscript. With few more hours, I was able to modify the authors’ code to change their Fig. 4. In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewer’s job much more fun!

Page 49: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Full reproducibility

Levels of reproducibility

Dynamic results

Usability (e.g. Galaxy Toolshed)

Rich metadata, documentation

Basic code/data Availability

Page 50: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Give us data, papers & pipelines*

What else can you do?

[email protected] [email protected] [email protected]

Contact us:

* APC’s currently generously covered by BGI until 2015

www.gigasciencejournal.com

Page 51: Scott Edmunds talk at AIST: Overcoming the Reproducibility Crisis: and why I stopped worrying a learned to love open data (& methods)

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Peter LiHuayan Gao Chris HunterJesse Si ZheNicole NogoyLaurie GoodmanAmye Kenall (BMC)

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

CBIITFunding from:

Our collaborators:team: Case study: