scott edmunds @ balti & bioinformatics: new models in open data publishing

45
Sc Sc 0000-0001-6444-1436 @SCEdmunds [email protected] NEW MODEL Open data publi shing Scott Edmunds Balti Bioinformatics

Upload: gigascience-bgi-hong-kong

Post on 14-Jul-2015

908 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

ScSc

0000-0001-6444-1436

@SCEdmunds

[email protected]

NEW MODEL

Open data

publishing

Scott Edmunds

Balti Bioinformatics

Page 2: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The problems with publishing

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Lack of transparency, lack of credit for anything other than 350-year old style “dead tree” publication

• Traditional publishing policies and practices a hindrance (licensing & access, embargoes, Ingelfinger, closed doors, anti-granularity & forking)

Page 3: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The consequences: growing replication gap

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 4: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Consequences: increasing number of retractions>15X increase in last decade

At current % > by 2045 as many papers published as retracted

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950

Page 5: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

STAP paper demonstrates problems:

Nature Editorial, 2nd July 2014:

“We have concluded that we and the referees could not have detected the problems that fatally undermined the papers. The referees’ rigorous reports quite rightly took on trust what was presented in the papers.”

http://www.nature.com/news/stap-retracted-1.15488

Page 6: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

STAP paper demonstrates problems:

…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review

…to publish replication studies

Need:

Page 7: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

• Review• Data• Software• Models• Pipelines• Re-use…

= Credit

}

Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)

New incentives/credit

Page 8: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Not just carrots…

“The data discovery index (DDI) enabled through bioCADDIE is to do for data what PubMed (and PubMed Central) did for the literature.”

Page 9: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Things we need to reward

Page 10: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Methods

Answer

Metadata

softwareAnalysis

(Pipelines)

Workflows/Environments

Idea

Study

Rewarding the

DOI, etc.Publication

Publication

Publication

Data

Page 11: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Open peer review1. Transparency

Page 12: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The only drawback?

End reviewer 3 Downfall parody videos, now!

1. TransparencyOpen peer review

Page 13: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Publons + AcademicKarma = credit for reviewers efforts

http://publons.com/

1. Transparency/open peer review

http://academickarma.org/

Page 14: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

1. Transparency

Reward pre-prints

Page 15: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

http://tmblr.co/ZzXdssfOMJfy

arXiv + blogged reviews = real-time open-review

1. Transparency

Page 16: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

arXiv + blogged reviews = real-time open-review

1. Transparency

Page 17: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

2. DataReward Open Data

Page 18: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data

2. (Big) Data

Page 19: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

2. DataReward Intermediate Data

Page 20: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Nanopore MinION E. Coli genome released via GigaDB 10-Sep-2014

Curated & converted to ISA-tab, & worked with EBI to get raw data there

Data Note submitted & preprint version out 26th September

Peer reviewed & published 20th October

2. DataReward Faster Data Release

http://www.gigasciencejournal.com/content/3/1/22

Page 21: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Real time sequencing era needs real time publication!

• Used as test data for “minoTour”: real time data analysis tools for minION data

• Nanopore data already used in (CC0 GitHub based) teaching materials

• Next stop…Erratums, Updates & more (see later)

1. mioTour http://minotour.nottingham.ac.uk/2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly

2. DataReward Faster Data Release

Page 22: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

OMERO: providing access to imaging data

Already used by JCB.

View, filter, measure raw images with direct links from journal article.

See all image data, not just cherry picked examples.

Download and reprocess.

2. DataReward Imaging Data

Page 23: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The alternative...

...look but don't touch

2. DataReward Imaging Data

Page 24: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

3. Software

https://www.change.org/p/everyone-in-the-research-community-we-must-accept-that-software-is-fundamental-to-research-or-we-will-lose-our-ability-to-make-groundbreaking-discoveries

Page 25: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

galaxy.cbiit.cuhk.edu.hk

4. WorkflowsReward Sharing of Workflows

Page 26: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Visualisations & DOIs for workflows

http://www.gigasciencejournal.com/series/Galaxy 26

Page 27: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

• Can facilitate reproducibility, reuse & sharing with tools like: Knitr, Sweave, iPython Notebook

5. Open DocumentsReward Open/Dynamic Workbooks

Page 28: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

E.g.

Page 29: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

E.g.

Page 30: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

5. Virtual Machines

?http://ivory.idyll.org/blog/vms-considered-harmful.html

Page 31: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

http://dx.doi.org/10.5524/100106http://www.gigasciencejournal.com/content/3/1/23

5. Virtual Machines

Page 32: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Taking a microscope to the publication process

Page 33: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

33

Page 34: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

How reproducible can we get?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>33,000 accesses& 270 citations

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>36,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

34

Page 35: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Post publication: bloggers pull apart code/reviews in blogs + wiki:

SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/

Reward open & transparent review

Page 36: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Page 37: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

SOAPdenovo2 workflows implemented in

galaxy.cbiit.cuhk.edu.hk

Implemented entire workflow in our Galaxy server, inc.:

• 3 pre-processing steps

• 4 SOAPdenovo modules

• 1 post processing steps

• Evaluation and visualization tools

Page 38: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

Page 39: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The SOAPdenovo2 Case studySubject to and test with 3 models:

DataData

Method/Experimental protocolMethod/Experimental protocol

FindingsFindings

Types of resources in an RO

Wfdesc/ISA-TAB/ISA2OWLWfdesc/ISA-

TAB/ISA2OWL

Models to describe each resource type

See: http://biorxiv.org/content/early/2014/12/08/011973

Page 40: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing
Page 41: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.

Page 42: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Lessons Learned• Most published research findings are false. Or at

least have errors

• Is possible to push button(s) & recreate a result from a paper

• Reproducibility is COSTLY. How much are you willing to spend?

• Much easier to do this before rather than after publication

Page 43: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

The cost of staying with the status quo?

• Ioannidis estimate that 85% of research resources are wasted.

• Each retraction estimated to cost $400,000.

Page 44: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Make your data, software &

other ROs open (CC0, OSI)

Get credit for your reviewing

Publish your research objects

(with us!)

In Summary

[email protected]

www.gigasciencejournal.com

@gigasciencefacebook.com/GigaScience

Page 45: Scott Edmunds @ Balti & Bioinformatics: New Models in Open Data Publishing

Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)

Thanks to:

@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Peter LiChris HunterJesse Si ZheRob DavidsonNicole NogoyLaurie GoodmanAmye Kenall (BMC)

Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)

www.gigadb.orggalaxy.cbiit.cuhk.edu.hk

www.gigasciencejournal.com

CBIITFunding from:

Our collaborators:team: Case study:

45