reproducibility, research objects and reality, leiden 2016

74
Reproducibility, Research Objects and Reality Professor Carole Goble The University of Manchester, UK Software Sustainability Institute, UK ELIXIR UK, FAIRDOM Association e.V. [email protected] University of Leiden, The Netherlands, 24 November 2016

Upload: carole-goble

Post on 17-Feb-2017

192 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Reproducibility, Research Objects and Reality, Leiden 2016

Reproducibility, Research Objects and Reality

Professor Carole GobleThe University of Manchester, UKSoftware Sustainability Institute, UKELIXIR UK, FAIRDOM Association e.V.

[email protected]

University of Leiden, The Netherlands, 24 November 2016

Page 2: Reproducibility, Research Objects and Reality, Leiden 2016

Acknowledgements• Dagstuhl Seminar 16041 , January 2016

– http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=16041• ATI Symposium Reproducibility, Sustainability and Preservation , April

2016– https://turing.ac.uk/events/reproducibility-sustainability-and-preservation/– https://osf.io/bcef5/files/

• C Titus Brown• Juliana Freire• David De Roure• Stian Soiland-Reyes• Barend Mons• Tim Clark• Daniel Garijo• Norman Morrison• Katy Wolstencroft

Phil BourneNatalie StanfordJacky SnoepStuart OwenMarco RoosKristina HettneAlan WilliamsSean BechhoferIan ForeRafael Jimenez…. And many more

Michael CrusoePaul GrothNiall Beard

Page 3: Reproducibility, Research Objects and Reality, Leiden 2016

Context: Computational Science

http://tpeterka.github.io/maui-project/From: The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd

1. Observational, experimental

2. Theoretical3. Simulation4. Data intensive

Page 4: Reproducibility, Research Objects and Reality, Leiden 2016

Motivation: Knowledge Turningresearch infrastructures

• Computational tools• Sharing platforms• Knowledge Exchange• Reproducible

research• Software and data

practices• Policies

[Josh Sommer, for the picture]

Page 5: Reproducibility, Research Objects and Reality, Leiden 2016
Page 6: Reproducibility, Research Objects and Reality, Leiden 2016

Reproducibility Rampancy

Page 7: Reproducibility, Research Objects and Reality, Leiden 2016

NIH Rigor and Reproducibilityhttps://www.nih.gov/research-training/rigor-reproducibility

Plenty of guidelines

cos.io/top

Page 8: Reproducibility, Research Objects and Reality, Leiden 2016

Plenty of principles

Page 9: Reproducibility, Research Objects and Reality, Leiden 2016

https://wellcomeopenresearch.org/ Nature Scientific Data

Data as a first class citizen + Data Citation

Scholarly Communications Providers

Page 10: Reproducibility, Research Objects and Reality, Leiden 2016

Software as a first class citizen + Software Citation

Page 11: Reproducibility, Research Objects and Reality, Leiden 2016

Funders

http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/

Page 12: Reproducibility, Research Objects and Reality, Leiden 2016

republic of science*

regulation of science

institution cores / libraries / public services

*Merton’s four norms of scientific behaviour (1942)

Page 13: Reproducibility, Research Objects and Reality, Leiden 2016

FAIRFindable

Accessible

Interoperable

ReusableIntelligible

Reproducible

Citable

Track & Countable

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf

Page 14: Reproducibility, Research Objects and Reality, Leiden 2016

Research Infrastructure for FAIR Management and Sharing ofData, Operating Procedures, ModelFor Systems and Synthetic Biology Projects

Research Infrastructure for FAIR Data for Life Sciences in Europe

Data-Driven Science

Page 15: Reproducibility, Research Objects and Reality, Leiden 2016
Page 16: Reproducibility, Research Objects and Reality, Leiden 2016

designcherry picking data, random seed reporting, non-independent bias, poor positive and negative controls, dodgy normalisation, arbitrary cut-offs, premature data triage, un-validated materials, improper statistical analysis, poor statistical power, stop when “get to the right answer”, software misconfigurations misapplied black box softwarereportingincomplete reporting of software configurations, parameters & resource versions, missed steps, missing data, vague methods, missing softwareEmpirical StatisticalComputational

V. Stodden, IMS Bulletin (2013)

Reproducibility and reliability of biomedical research: improving research practice

Page 17: Reproducibility, Research Objects and Reality, Leiden 2016

“When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.”

Carroll, Through the Looking Glass

re-compute

replicatererun

repeat

re-examine

repurpose

recreate

reuse

restorereconstruct review

regeneraterevise

recycle

redo

robustness tolerance

verification compliance validation assurance

remix

Page 18: Reproducibility, Research Objects and Reality, Leiden 2016

Scientific publications goals: (i) announce a result(ii) convince readers its correct.

Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.

Papers in computational science should describe the results and provide the complete software development environment, data and set of instructions which generated the figures.

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Jill Mesirov

David Donoho

Page 19: Reproducibility, Research Objects and Reality, Leiden 2016

Computational Complex Assemblies

Remote Calls

Page 20: Reproducibility, Research Objects and Reality, Leiden 2016

“Micro” Reproducibility

“Macro” Reproducibility

Fixivity

Validate

Verify

Trust

Page 21: Reproducibility, Research Objects and Reality, Leiden 2016

Repeatability:“Sameness”Same result1 Lab1 experiment

Reproducibility:“Similarity”Similar result> 1 Lab> 1 experimentwhy the

differences?

https://2016-oslo-repeatability.readthedocs.org/en/latest/repeatability-discussion.html

Validate

Verify

Page 22: Reproducibility, Research Objects and Reality, Leiden 2016

Method Reproducibilitythe provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated.

Result Reproducibility (aka replicability)obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possibleGoodman, et al Science Translational Medicine 8

(341) 2016

Validate

Verify

Page 23: Reproducibility, Research Objects and Reality, Leiden 2016

ProductivityTrack differences

Validate

Verify

Page 24: Reproducibility, Research Objects and Reality, Leiden 2016

reviewers want additional workstatistician wants more runsanalysis needs to be repeatedpost-doc leaves, student arrivesnew/revised datasetsupdated/new versions of algorithms/codessample was contaminatedbetter kit - longer simulationsnew partners, new projects

Personal & Lab Productivity

Public GoodReproducibility

Page 25: Reproducibility, Research Objects and Reality, Leiden 2016

Computational “Datascopes”

Methodstechniques, algorithms, spec. of the steps, models

Materialsdatasets, parameters, algorithm seedsExperim

ent

Instrumentscodes, services, scripts, underlying libraries, workflows, ref datasets

Laboratorysw and hw infrastructure, systems software, integrative platformscomputational environment

Setup

Page 26: Reproducibility, Research Objects and Reality, Leiden 2016

“Datascope” Practicalities

MethodsMaterialsExperim

ent

InstrumentsLaboratory

Setup

Change Dependenciesscience, methods, datasetsquestions stay, answers change

breakage, labs decay, services, techniques and instruments change, updated datasets, services, codes, hardwaresoftware entropy

one offs, streams,stochastics, sensitivities,scale, non-portable data

supercomputer accessnon-portable softwarelicensing restrictionsunreliable resources and third party codescomplexity

Blackboxes

blackbox software

hidden manual steps

blackbox software

hidden manual steps

Page 28: Reproducibility, Research Objects and Reality, Leiden 2016

Active Instrument Byte level

preservation

Reproduce by RunningReproduce by Reading

Archived RecordPrepare to repair

ELNs

Markup LanguagesReporting Guidelines

Common Formats

Community vocabularies

Page 29: Reproducibility, Research Objects and Reality, Leiden 2016

Record AllAutomate AllContain AllExpose All

FindableAccessibleInteroperableReusable

Page 30: Reproducibility, Research Objects and Reality, Leiden 2016

provenance

portability preservation

robustnessversioning

access descriptionstandards

common APIslicensing

standards,common metadata

change variation sensitivity

discrepancy handling

packaging, containers

FAIR RACE shades of reproducibility

dependenciesstepsids

Page 31: Reproducibility, Research Objects and Reality, Leiden 2016

A robust infrastructure for biological information.

bio.tools

Page 32: Reproducibility, Research Objects and Reality, Leiden 2016

https://usegalaxy.org/

Workflow DescriptionWorkflows PreservationWorkflow PortabilityWorkflow Interoperability

Page 33: Reproducibility, Research Objects and Reality, Leiden 2016

Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons

Third Party ServicesScattered resources

Page 34: Reproducibility, Research Objects and Reality, Leiden 2016

Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons

Third Party ServicesScattered resources

Rich descriptionsPrepare to Repair

Page 35: Reproducibility, Research Objects and Reality, Leiden 2016

Standards-based metadata framework for bundling resources with context

Citable Reproducible Packaging

Metadata for bundling resources scattered and stored somewhere else

Page 36: Reproducibility, Research Objects and Reality, Leiden 2016

Container

Research Object in a nutshell

Packaging content & links: Zip files, BagIt, Docker

images

Catalogues & Commons Platforms: FAIRDOM, myExperiment

Page 37: Reproducibility, Research Objects and Reality, Leiden 2016

Manifest Constructi

on

Aggregates link things

togetherAnnotations

about things & their

relationships

Container

Research Object in a nutshell

Manifest Descripti

onDependencies

what else is needed

Versioning its evolution

Checklists what should be there

Provenance

where it came from

Identificationlocate things

regardless whereid

Packaging content & links: Zip files, BagIt, Docker

images

Catalogues & Commons Platforms: FAIRDOM, myExperiment

Page 38: Reproducibility, Research Objects and Reality, Leiden 2016

Manifest Constructi

on

Aggregates link things

togetherAnnotations

about things & their

relationships

Container

Research Object Profile for Workflows…

Manifest Descripti

onIdentificationlocate things

regardless where

Minimum informationfor one content type

Common properties

among content types

Page 39: Reproducibility, Research Objects and Reality, Leiden 2016

Research Object Profile for Workflows…

Manifest Descripti

on

Minimum informationfor one content type

Common properties

among content types

Page 40: Reproducibility, Research Objects and Reality, Leiden 2016

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003Hettne KM, et al (2014), Structuring research methods and data with the research object model: genomics workflows as a case study. J. Biomedical Semantics 5: 41

Workflow Research Object Bundles exchange, portability and maintenance

BagIt

workflows packaged into various containers for

sharing

Checksum

Page 41: Reproducibility, Research Objects and Reality, Leiden 2016

Workflow and Workflow Management System Zoo

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

Page 42: Reproducibility, Research Objects and Reality, Leiden 2016

bio.tools

A community led standard way of expressing and running workflows and command line tools using containers

Ontologies for describing tools and their inputs and outputs

Metadata framework for the manifest versioning, file integrity, more metadata about the workflow

Workflow fragment containers

Page 43: Reproducibility, Research Objects and Reality, Leiden 2016

FindableAccessibleInteroperableReusable

DataOperationsModels

Systems and Synthetic Biology Projects

Page 44: Reproducibility, Research Objects and Reality, Leiden 2016

Funder: Legacy!

Partners

Page 45: Reproducibility, Research Objects and Reality, Leiden 2016

Project Support

Community Actions

Platforms, Tools

Web-based Portal Public Commons

50+ projects5 programmes400+ people

22 independentinstallations

Page 46: Reproducibility, Research Objects and Reality, Leiden 2016

Systems Approach…Multiple, interrelated assets, Multiple, dispersed repositories

Literature

SOPS

STANDARDSversioning,

tracking:provenance, parameters,

citation

Operations

Data Mode

ls

Page 47: Reproducibility, Research Objects and Reality, Leiden 2016

FAIR Data and Metadata Standards that help to improve understanding and exchange….

Nicolas Le Novère, Babraham Institute, UK.

…researchers do not always use them....

Page 48: Reproducibility, Research Objects and Reality, Leiden 2016

… model reuse and reproducibility tricky…

Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053

Page 49: Reproducibility, Research Objects and Reality, Leiden 2016

Systems Approach…teams, processes, multi-partner, multi-discipline, legacy

P1. BaCell-SysMOThe transition from growing to non-growing Bacillus subtilis cells - A systems biology approach

P2. COSMICSystems Biology of Clostridium acetobutylicum - a possible answer to dwindling crude oil reserves

P3. SUMOSystems Understanding of Microbial Oxygen Responses Escherichia coli

P4. KOSMOBACIon and solute homeostasis in enteric bacteria Escherichia coli

P5. SysMO-LABComparative Systems Biology: Lactic Acid Bacteria: Lactococcus lactis, Enterococcus faecalis, Streptococcus pyogenes

P6. PSYSMOSystems analysis of biotech induced stresses: towards a quantum increase in process performance in the cell factory Pseudomonas putida

P7. SCaRABSystems Biology of a genetically engineered Pseudomonas fluorescens with inducible exo-polysaccharide

production: analysis of the dynamics and robustness of metabolic networks

P8. MOSESMicroOrganism Systems

Biology: Energy and Saccharomyces cerevisiae

P9. TRANSLUCENT Gene interaction networks and models of cation homeostasis in Saccharomyces cerevisiaeP10. STREAM

Global metabolic switching in Streptomyces coelicolor P11. SulfoSYSSilicon cell model for the central carbohydrate metabolism of the archaeon Sulfolobus solfataricus under temperature

variation

P12. SysMO-DBData management groupFunders

Researchers

Publishers

Page 50: Reproducibility, Research Objects and Reality, Leiden 2016

Who is working with wh

ich organism?

What methods are been used to determine enzyme activity?

Under which experimental conditions are

my

partners working on for the measurement

of glucose

concentration?What is the provenance of the parameters for this version of the model?What SOP was used for this

sample?

Where is the validation data for this model?

Is there any group generating kinetic data?

Is this data available?

Track versions of my model

Whats the relationship between the data and model?

Which data belong to which publications?

FAIR

Page 51: Reproducibility, Research Objects and Reality, Leiden 2016

A Commons

fairdomhub.org

Page 52: Reproducibility, Research Objects and Reality, Leiden 2016
Page 53: Reproducibility, Research Objects and Reality, Leiden 2016

Investigation

Study Analysis

Data

Model

SOP(Assay)

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description

Page 54: Reproducibility, Research Objects and Reality, Leiden 2016

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.

Just Enough Results ModelCommon elements

Page 55: Reproducibility, Research Objects and Reality, Leiden 2016

….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.

Uploaded into theFAIRDOM Store

Linked to entry in Public Archive

Linked to entry in Project store

Page 56: Reproducibility, Research Objects and Reality, Leiden 2016

... aggregating cataloguemetadata across repositories, retain context-> reproduce, reuse

Local Stores

ExternalDatabases

Publishing services

Secure Stores

Model Resources

Page 57: Reproducibility, Research Objects and Reality, Leiden 2016

… in situ reproducible modelsmetadata annotation against standards

model validation, comparison and simulation

SBML Model simulation

Model comparison

Model versioning

Reproducing simulations

[Jacky Snoep, Dagmar Waltemate, Martin Peters, Martin Scharm]

Page 58: Reproducibility, Research Objects and Reality, Leiden 2016

…. Nested Packages

context and credit

Page 59: Reproducibility, Research Objects and Reality, Leiden 2016

Research Objects• Link • Nest• Span • Bundle• Snapshot

Systematic, Standards-based metadata framework for logically and physically bundling resources with context• Exchange• Reproduce• Release packages

Page 60: Reproducibility, Research Objects and Reality, Leiden 2016

Reproducible Exchange and Publishingand better credit

reviewer

Author List: Joe Bloggs; Jane DoeTitle: My Investigation Date: September 2016DOI: https://doi.org/10.15490/seek##

information travels with the data and models

Page 61: Reproducibility, Research Objects and Reality, Leiden 2016

How do we do? Pretty well.Reproducibility window. But that’s ok!

• Can’t contain everything– Pesky Internet in a Box

• Can’t automate everything– Pesky people

• Can’t fix everything– Pesky science

Page 62: Reproducibility, Research Objects and Reality, Leiden 2016

Asthma Research e-Laboratory

Release builds of pharmacological knowledge warehouse

Exchanging large datasets

Page 63: Reproducibility, Research Objects and Reality, Leiden 2016

Samiul Hasan, GSKBiocuration need in Pharma: Drivers from a Translational Bioinformatics Perspective, Poster S161st EASYM Conference, Berlin 2016

Reality

Page 64: Reproducibility, Research Objects and Reality, Leiden 2016

Preparation pain. Goldilocks paradox.

[Norman Morrison]

replication hostility no funding, time, recognition, place to publishresource intensive access to the complete environment

Page 65: Reproducibility, Research Objects and Reality, Leiden 2016

“Data Parasites”“Data Flirters”

“Share Drift”FamilyFriendsPotential FriendsAcquaintancesStrangersRivals

Reciprocity

Page 66: Reproducibility, Research Objects and Reality, Leiden 2016

Using FAIRDOM my own lab colleagues saw what I was doing and called to collaborate!

Jurgen HannstraVrije Universiteit Amsterdam, Netherlands

Trust …

Page 67: Reproducibility, Research Objects and Reality, Leiden 2016
Page 68: Reproducibility, Research Objects and Reality, Leiden 2016
Page 69: Reproducibility, Research Objects and Reality, Leiden 2016

Half of researchers make research data available so they can be used by another.

Most not experienced any direct benefits nor experienced many bad effects.

Caveat: shared but usable?fake sharing

funder requirements

fear data will be misused or

misinterpreted

journal requirementsgood research practice

facilitate collaborationsenable validation and

replicationhigher citation rates

time and effort

new collaborations

extra funding for cost of data prep

enhance their academic reputationfeedback on how other researchers were using their data

taken into account in funding

taken into account in career

jeopardise future publications

its not ready to sharescrutiny scruples

answering questions

I won’t get credited

Page 70: Reproducibility, Research Objects and Reality, Leiden 2016

Metadata in by side effectTooling for annotations and checklist templates for different types of assay data.

Embed ontologies into Excel templates

Excel spreadsheets enriched with ontology annotations

Upload, extract metadata and register

http://www.rightfield.org.uk

Spreadsheet Ramps!!

Page 71: Reproducibility, Research Objects and Reality, Leiden 2016

Sharing by side effect …. libertarian paternalism

[Kristian Garza]

Page 72: Reproducibility, Research Objects and Reality, Leiden 2016

Finding and Citing by side effect

• Schema.org• Structured

markup in web pages

• Supported by Content Management Systems

• Harvested by search engines

• Builds snippets and sidebars

Bioschemas.org

Page 73: Reproducibility, Research Objects and Reality, Leiden 2016

Datarepository

Datarepository

TrainingResource

Bioschemas Bioschemas Bioschemas

Search engine Bio RegistriesBiosharingOLS, TeSSbio.tools

UKCRC TissueDirectory

bioCADDIE DATAMED

PDBe UniProtInterpro Molgenis Pfam

Gene3DBiosamplesBiobank websitesBRENDA HPA

TransPlantEGA Beacons

EBI-SearchGoogle

Finding and Citing by side effectBioschemas.org

Page 74: Reproducibility, Research Objects and Reality, Leiden 2016

Big co-operative data-driven science makes reproducibility

desirable but also means dependency and change are to be

expected

Words matter.50 Shades of Reproducibility.

form vs functionReproducibility is not a end.

Beware zealots.

Amplify Side effectsThink Research Objects!