reproducibility - the myths and truths of pipeline bioinformatics
DESCRIPTION
In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.TRANSCRIPT
Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon Cockell
Bioinformatics Special Interest Group
19th July 2012
Repeatability and Reproducibility
• Main principle of scientific method
• Repeatability is ‘within lab’• Reproducibility is
‘between lab’• Broader concept
• This should be easy in bioinformatics, right?• Same data + same code =
same results• Not many analyses have
stochastisicity
http://xkcd.com/242/
Same data?
• Example• Data deposited in SRA• Original data deleted by
researchers• .sra files are NOT .fastq• All filtering/QC steps lost• Starting point for
subsequent analysis not the same – regardless of whether same code used
Same data?
• Data files are very large• Hardware failures are
surprisingly common• Not all hardware failures
are catastrophic• Bit-flipping by faulty RAM
• Do you keep an md5sum of your data, to ensure it hasn’t been corrupted by the transfer process?
Same code?
• What version of a particular software did you use?
• Is it still available?• Did you write it yourself?• Do you use version
control?• Did you tag a version?• Is the software
closed/proprietary?
Version Control
• Good practice for software AND data
• DVCS means it doesn’t have to be in a remote repository
• All local folders can be versioned• Doesn’t mean they have
to be, it’s a judgment call
• Check-in regularly• Tag important “releases”
https://twitter.com/sjcockell/status/202041359920676864
Pipelines
• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation
• Process captured by underlying pipeline architecture
http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
Tools for pipelining analyses• Huge numbers
• See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
• Only a few widely used:
• Bash• old school
• Taverna• build workflows from public webservices
• Galaxy• sequencing focus – tools provided in ‘toolshed’
• Microbase• distributed computing, build workflows from ‘responders’
• e-Science Central• ‘Science as a Service’ – cloud focus• not specifically a bioinformatics tool
Bash
• Single-machine (or cluster) command-line workflows
• No fancy GUIs • Record provenance & process
• Rudimentary parallel processing
http://www.gnu.org/software/bash/
Taverna
• Workflows from web services
• Lack of relevant services• Relies on providers
• Gluing services together increasingly problematic
• Sharing workflows through myExperiment• http://
www.myexperiment.org/
http://www.taverna.org.uk/
Galaxy
• “open, web-based platform for data intensive biomedical research”
• Install or use (limited) public server
• Can build workflows from tools in ‘toolshed’
• Command-line tools wrapped with web interface
https://main.g2.bx.psu.edu/
Galaxy Workflow
Microbase
• Task management framework• Workflows emerge from interacting ‘responders’• Notification system passes messages around• ‘Cloud-ready’ system that scales easily• Responders must be written for new tools
http://www.microbasecloud.com/
e-Science Central
• ‘Blocks’ can be combined into workflows
• Blocks need to be written by an expert
• Social networking features
• Good provenance recording
http://www.esciencecentral.co.uk/
The best approach?• Good for individual analysis
• Package & publish
• All datasets different• One size does not fit all• Downstream processes often depend on results of upstream ones
• Note lack of QC• Requires human interaction – impossible to pipeline• Different every time• Subjective – major source of variation in results• BUT – important and necessary (GIGO)
More tools for reproducibility• iPython notebook
• http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html• Build notebooks with code embedded• Run code arbitrarily• Example: https://pilgrims.ncl.ac.uk:9999/
• Runmycode.org• Allows researchers to create ‘companion websites’ for papers• This website allows readers to implement the methodology
described in the paper• Example:
http://www.runmycode.org/CompanionSite/site.do?siteId=92
The executable paper
• The ultimate in repeatable research
• Data and code embedded in the publication
• Figures can be generated, in situ, from the actual data
• http://ged.msu.edu/papers/2012-diginorm/
Summary• For work to be repeatable:
• Data and code must be available• Process must be documented (and preferably shared)• Version information is important• Pipelines are not the great panacea
• Though they may help for parts of the process• Bash is as good as many ‘fancier’ tools (for tasks on a single machine or
cluster)
Inspirations for this talk• C. Titus Brown’s blogposts on repeatability and the
executable paper• http://ivory.idyll.org/blog
• Michael Barton’s blogposts about organising bioinformatics projects and pipelines• http://bioinformaticszen.com/