research objects for improved sharing and reproducibility
TRANSCRIPT
Research Objects for improved
sharing and reproducibility
Dagstuhl Perspective Workshop on the intersection
between Computer Sciences and Psychology
Oscar Corcho
@ocorcho, http://slideshare.net/ocorcho
Ontology Engineering Group
Universidad Politécnica de Madrid
(and the Research Object community group)
Some memos from our futuristic scenario
• Don’t publish,
release (ack: Carole
Goble), reloaded
(ack. Paul Groth)
• Don’t just read a
paper, but also view
it, play with it, and
whatever else
• Convert passive
papers into active
scientific storytellers
and alert systems
3
A few quotes from this week
• Data (and method) sharing
• Dietrich: The method for investigation is not clearly
described
• Eric: Provide links between articles and datasets
(interlinking of scholarly content)
• William: methods are normally reduced to a tiny
piece of text
• Reproducibility
• Working group on “the present”: Crisis of
replicability is driving increased concern and
interest
• Eric: 70% of science articles are not reproducible
4
One of the many origins of “Don’t Publish, Release”
• A day in Granada… (January, 2012)
• Let’s get some of the interesting discussions on the Force11
Dagstuhl meeting into practice
6
ScientistLive RO Live RO
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group)
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group
and for paper reviewers)
Librarian/Curator
Scientist
My supervisor calls
me to report my work
My supervisor calls
me again and we
decide to publish our
RO+paper
<<versionOf>>
Archived RO
<<copy, filterand curate>>
Identified by a URI
Good metadata
and curation
Mostly public
Reviews
received and
final version
published
<<versionOf>>
A new PhD
student
continues my
work
<<copy>>
One of the origins of “Don’t Publish, Release”
How do you usually structure your experiment?
• In a set of folders?
• These could be profiles for how you normally
structure your research
• Dropbox? Google Drive? GitHub?
• Overleaf+figshare? Whatever???8
Multi-various products, platforms, resources
First class citizens - id, manage, credit, track, profile, focus
A Framework to Bundle, Port and Link (scattered) resources, related experiments. Metadata Objects that carry Research Context. Units of exchange.
Research Objects
http://www.researchobject.org
Identity
Aggregation
Interpretation:
The objects
How they are linked together
RO main principles
manifest
Refer to aggregations and their contents
Describe group & constituents
External ids
Local filesAttribution:
Who , when, where, why?
Metadata
Description
Aggregations
Resource maps
Proxies
Annotation first class and stand-off
Identity persistence and resolution, Names
Citation
Identity
Annotation
Aggregation
DOIs
URIsHandles
ORCID
W3C
OADMOAI-
ORE
manifestPoint of extendability
RO main principles: technologies
RO Model Ontology
• Defines core concepts of research objects, identity, aggregation, annotation. Used in the manifest
• http://w3id.org/ro/
14
Export, archive, publish and transfer ROs.
File format for storage and distribution of ROs as a ZIP archive
Includes an RO’s manifest, annotations and some or all of its aggregated resources
Basis for more specific file formats
Backwards compatible: its zip
Programmatic access: JSON and JSON-LD manifest, API
https://researchobject.github.io/specifications/bundle/
https://w3id.org/bundle/ doi:10.5281/zenodo.10440
https://researchobject.github.io/specifications/bundle/
https://w3id.org/bundle/ doi:10.5281/zenodo.10440
Research Objects: Scopes and Tooling
• http://www.researchobject.org/scopes/
• Farr Commons: http://www.farrcommons.org/
• ISA and FAIR-DOM http://fair-dom.org/
• SEEK http://seek4science.org/
• COMBINE
• BagIt (soon)
• White-labelled sci-domain-independent software
• http://rohub.linkeddata.es/
• http://www.rohub.org/
• http://www.researchobject.org/specifications/
• Core Ontologies and extensions
• RO managers/APIs/bundling (Ruby, Java, Python)
• Latex2RO
• LDP4RO
20
Publishing may be as easy as…
• Providing the URL
of the Research
Object to the
publisher, with a
release tag, to start
the review process
(if extra review
needed)
21
The Research Method in different disciplines
28
INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT
IN V
IVO
/VIT
RO
IN S
ILIC
O
29
The Research Method in different disciplines
Lab book
Digital Log
Laboratory Protocol (recipe)
Workflow
Experiment
The Research Method in different disciplines
30
INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT
IN V
IVO
/VIT
RO
IN S
ILIC
O
Some problems in lab protocols
some of them present insufficient granularity,
the instructions can be imprecise or ambiguous due to the use of natural language.
• Incubate thecentrifuge tubes in a water bath.
• Incubate the samples for 5 min with gentleshaking.
• Rinse DNA briefly in 1-2 ml of wash.
• Incubate at -20C overnight.
Currently…
Semi-structured information
Unstructured information
How to formalize the information from laboratory protocols as a knowledge base?
Ontologies + NLP tools
SMART Protocols - document
The Protocol as a document
sp:application of the protocol
sp:advantage of the protocol
sp:limitation of the protocol
sp:provenance of the protocol
sp:purpose of the protocol
sp:introduction section
sp:buffer list
sp:equipment and supplies list
sp:kit list
sp:primer list
sp:reagent list
sp:software list
sp:solution list
sp:materials section
exact:caution
sp:critical step
sp:hint
sp:pause point
sp:storage condition
sp:timing
sp:troubleshooting
sp:methods section
sp:experimental
protocol
iao:document iao:document part
iao:textual entity iao:data set
owl:subClassOf
ro:hasPart
ro:partOf
owl:subClassOf owl:subClassOf owl:subClassOf
ro:hasPart ro:hasPart
ro:hasPart ro:partOf
ro:partOf
ro:partOf
owl:subClassOf owl:subClassOf
exact:alert message
owl:subClassOf
Rhetorical and structural components (e.g. introduction, materials, and methods);
Information like application of the protocol, advantages and limitations, list of reagents, critical steps.
SMART Protocols - wf
sp:basic step of
DNA extraction
p-plan:Step
p-plan:Variable
sp:cell disruption
sp:plant tissue
Basic Steps of DNA Extraction
sp:DNA purification
obi:DNA extract
p-plan:hasInputVariable
p-plan:hasOutputVariable
p-plan:hasOutputVariable
owl:subClassOf
sp:digestion
reaction
sp:powdered tissue
owl:subClassOf owl:subClassOf
owl:subClassOf
p-plan:hasInputVariable
sp:digested
contaminant
p-plan:hasInputVariable p-plan:hasOutputVariable
owl:subClassOf owl:subClassOf owl:subClassOf owl:subClassOf
bfo:isPrecededBy bfo:isPrecededBy
Representation of the workflow aspects in protocols
implicit order in the instructions, following the input output structure.
SMART Protocols documentation
• SMART Protocols ontology is available here:
• http://vocab.linkeddata.es/SMARTProtocols/
• Giraldo O, García-Castro A, Corcho O. SMART
Protocols: SeMAntic RepresenTation for
Experimental Protocols. LISC2014
SMART Protocols in action
sp= smart protocols, ro= relation ontology
sp:experimental
protocol
sp:DNA extraction
protocol
sp:advantages
sp:sample
owl:subClassOf
rdf:type
sp:title of the protocol
sp:author entry
rdf:type
sp:hasAuthorsp:hasTitle
rdf:type
ro:partOf
ro:partOf
sp:applicationof the protocol
ro:partOf
rdf:type
rdf:type
The Research Method in different disciplines
38
INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT
IN V
IVO
/VIT
RO
IN S
ILIC
O
Vocabularies and methodologies for representing and publishing workflows
39
Interactive Browsing
(Pubby frontend)
Programatic access(external apps)
Wings workflow generation
OPM/PROVconversion
Publication Share Reuse
Core
Portal
WINGS on local laptop
Workflow Template
WorkflowInstance
PROVexport
Core
Portal
WINGS on shared host
Workflow Template
WorkflowInstance
PROVexport
Core
Portal
WINGS on web server
Workflow Template
WorkflowInstance
PROVexport
LinkedData
Publication
Users
Other
workflow
environments
RDF TripleStore
Workflow Provenance
Workflow PlanMethodology for workflow publishing
Repository of linked workflows:http://www.opmw.org/sparql
http://purl.org/net/p-plan
http://www.opmw.org/ontology/
Daniel Garijo and Yolanda Gil. 2011. A new approach for publishing workflows: abstractions, standards, and linked data. (WORKS '11). ACM, New York, NY, USA, 47-56.
Daniel Garijo and Yolanda Gil. Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data. In Proceedings of the 2nd International Workshop on Linked Science 2012, Boston, 2012.
Definition of workflow abstractions
40
Catalog of common independent workflow abstractions (motifs)
Data-oriented motifs: What kind of manipulations does the workflow have?
Workflow-oriented motifs: How does theworkflow perform its operations
Analysis from 260 different workflowsfrom 10 domains analyzed belonging to5 different workflow systems
http://purl.org/net/wf-motifs#
Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble, Common motifs in scientific workflows: An empirical analysis, Future Generation Computer Systems, Volume 36, July 2014, Pages 338-351
Finding and evaluating common abstractions
41
https://github.com/dgarijo/FragFlow
http://purl.org/net/wf-fd
Graph mining techniques
Workflow fragmentrepresentationand linkage
Workflow fragmentFiltering techniques
Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris A.Gutman,Ivo D. Dinov, Paul Thompson, and Arthur W. Toga. FragFlow: Automated Fragment Detection in ScientificWorkflows. In The 10th IEEE International Conference on e-Science, Guaruja, 2014
How to preserve Workflows/Research Objects?
42
Three main ways/levels:
•Descriptive reproducibility
•Documentation
•Workflow execution reproducibility
•Can we run the workflow?
•Workflow results reproducibility
•Can we get the same results?
Checklists!
•Corcho et al: Checklist for workflow conservation.
•http://dx.doi.org/10.6084/m9.figshare.1285011
•40 different aspects
•Documentation
•Goals
•Results
•Metadata
•Corcho et al: Checklist for a workflow conservation plan
•http://dx.doi.org/10.6084/m9.figshare.1285012
•Based on the DCC’s data management plan
The Research Method in different disciplines
44
INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT
IN V
IVO
/VIT
RO
IN S
ILIC
O
PegasusMontage
SoyKB
Epigenomics
CLOUD
Reproducibility of Computational Scientific Experiments
45
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
Dispel4PyInternal Extinction
Seismic Cross Correlation
MakeflowBlast
Some results
• Pegasus Montage Workflow
• Astronomy workflow
• Construct large image mosaics of the sky
• Montage Software distribution
• 59 binaries
• Target IaaS Cloud Providers
• Amazon EC2 & Futuregrid
• Vagrant
47
RO available at http://pegasus.isi.edu/publications/reppar
Lessons learned for Anna
• Research Objects as a
concept
• Identity, annotation,
aggregation
• Adapted to the
tools/infrastructure for each
domain
• With some tooling available
already
• It’s not just data preservation
but also methods
• Lab protocols
• Computational workflows
• Understand what
reproducibility means for you48
Research Objects for improved
sharing and reproducibility
Dagstuhl Perspective Workshop on the intersection
between Computer Sciences and Psychology
Oscar Corcho
@ocorcho, http://slideshare.net/ocorcho
Ontology Engineering Group
Universidad Politécnica de Madrid
(and the Research Object community group)
Acknowledgements
• The Semantic e-Science team at UPM
• Carlos Badenes
• Daniel Garijo
• Olga Giraldo
• Rafael González-Cabero
• Idafen Santana
• The Wf4Ever team
• Carole Goble, José Manuel Gómez Pérez, Raúl Palma, Jun Zhao, Stian Soiland-Reyes, Khalid Belhajjame, José Enrique Ruíz, Marco Roos, Lourdes Verdes-Montenegro, Norman Morrison, Sean Bechoffer, Graham Klyne, Matt Gamble, and a large etcetera
• The Research Object community group
• http://www.researchobject.org/
50