research objects for improved sharing and reproducibility

47
Research Objects for improved sharing and reproducibility Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology Oscar Corcho @ocorcho, http://slideshare.net/ocorcho Ontology Engineering Group Universidad Politécnica de Madrid (and the Research Object community group)

Upload: oscar-corcho

Post on 28-Jan-2018

645 views

Category:

Science


1 download

TRANSCRIPT

Research Objects for improved

sharing and reproducibility

Dagstuhl Perspective Workshop on the intersection

between Computer Sciences and Psychology

Oscar Corcho

@ocorcho, http://slideshare.net/ocorcho

Ontology Engineering Group

Universidad Politécnica de Madrid

(and the Research Object community group)

My motivation

2

Some memos from our futuristic scenario

• Don’t publish,

release (ack: Carole

Goble), reloaded

(ack. Paul Groth)

• Don’t just read a

paper, but also view

it, play with it, and

whatever else

• Convert passive

papers into active

scientific storytellers

and alert systems

3

A few quotes from this week

• Data (and method) sharing

• Dietrich: The method for investigation is not clearly

described

• Eric: Provide links between articles and datasets

(interlinking of scholarly content)

• William: methods are normally reduced to a tiny

piece of text

• Reproducibility

• Working group on “the present”: Crisis of

replicability is driving increased concern and

interest

• Eric: 70% of science articles are not reproducible

4

Act 1

Data and

method sharing5

One of the many origins of “Don’t Publish, Release”

• A day in Granada… (January, 2012)

• Let’s get some of the interesting discussions on the Force11

Dagstuhl meeting into practice

6

ScientistLive RO Live RO

RO snapshot

<<copy>>

Identified by a URI

Some metadata

Some curation

Mostly private (for my group)

RO snapshot

<<copy>>

Identified by a URI

Some metadata

Some curation

Mostly private (for my group

and for paper reviewers)

Librarian/Curator

Scientist

My supervisor calls

me to report my work

My supervisor calls

me again and we

decide to publish our

RO+paper

<<versionOf>>

Archived RO

<<copy, filterand curate>>

Identified by a URI

Good metadata

and curation

Mostly public

Reviews

received and

final version

published

<<versionOf>>

A new PhD

student

continues my

work

<<copy>>

One of the origins of “Don’t Publish, Release”

How do you usually structure your experiment?

• In a set of folders?

• These could be profiles for how you normally

structure your research

• Dropbox? Google Drive? GitHub?

• Overleaf+figshare? Whatever???8

Scattered Assets

Multi-various products, platforms, resources

First class citizens - id, manage, credit, track, profile, focus

A Framework to Bundle, Port and Link (scattered) resources, related experiments. Metadata Objects that carry Research Context. Units of exchange.

Research Objects

http://www.researchobject.org

Identity

Aggregation

Interpretation:

The objects

How they are linked together

RO main principles

manifest

Refer to aggregations and their contents

Describe group & constituents

External ids

Local filesAttribution:

Who , when, where, why?

Metadata

Description

Aggregations

Resource maps

Proxies

Annotation first class and stand-off

Identity persistence and resolution, Names

Citation

Identity

Annotation

Aggregation

DOIs

URIsHandles

ORCID

W3C

OADMOAI-

ORE

manifestPoint of extendability

RO main principles: technologies

RO Model Ontology

• Defines core concepts of research objects, identity, aggregation, annotation. Used in the manifest

• http://w3id.org/ro/

14

Manifest – remote and local

on my machine

Export, archive, publish and transfer ROs.

File format for storage and distribution of ROs as a ZIP archive

Includes an RO’s manifest, annotations and some or all of its aggregated resources

Basis for more specific file formats

Backwards compatible: its zip

Programmatic access: JSON and JSON-LD manifest, API

https://researchobject.github.io/specifications/bundle/

https://w3id.org/bundle/ doi:10.5281/zenodo.10440

https://researchobject.github.io/specifications/bundle/

https://w3id.org/bundle/ doi:10.5281/zenodo.10440

Containers

19

Research Objects: Scopes and Tooling

• http://www.researchobject.org/scopes/

• Farr Commons: http://www.farrcommons.org/

• ISA and FAIR-DOM http://fair-dom.org/

• SEEK http://seek4science.org/

• COMBINE

• BagIt (soon)

• White-labelled sci-domain-independent software

• http://rohub.linkeddata.es/

• http://www.rohub.org/

• http://www.researchobject.org/specifications/

• Core Ontologies and extensions

• RO managers/APIs/bundling (Ruby, Java, Python)

• Latex2RO

• LDP4RO

20

Publishing may be as easy as…

• Providing the URL

of the Research

Object to the

publisher, with a

release tag, to start

the review process

(if extra review

needed)

21

Act 2

Reproducibility

22

Terminology

23

Inspired by [Goble, 2012]

Terminology

24

Inspired by [Goble, 2012]

Terminology

25

Inspired by [Goble, 2012]

Terminology

26

Inspired by [Goble, 2012]

Terminology

27

Inspired by [Goble, 2012]

The Research Method in different disciplines

28

INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

29

The Research Method in different disciplines

Lab book

Digital Log

Laboratory Protocol (recipe)

Workflow

Experiment

The Research Method in different disciplines

30

INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

Some problems in lab protocols

some of them present insufficient granularity,

the instructions can be imprecise or ambiguous due to the use of natural language.

• Incubate thecentrifuge tubes in a water bath.

• Incubate the samples for 5 min with gentleshaking.

• Rinse DNA briefly in 1-2 ml of wash.

• Incubate at -20C overnight.

Currently…

Semi-structured information

Unstructured information

How to formalize the information from laboratory protocols as a knowledge base?

Ontologies + NLP tools

SMART Protocols - document

The Protocol as a document

sp:application of the protocol

sp:advantage of the protocol

sp:limitation of the protocol

sp:provenance of the protocol

sp:purpose of the protocol

sp:introduction section

sp:buffer list

sp:equipment and supplies list

sp:kit list

sp:primer list

sp:reagent list

sp:software list

sp:solution list

sp:materials section

exact:caution

sp:critical step

sp:hint

sp:pause point

sp:storage condition

sp:timing

sp:troubleshooting

sp:methods section

sp:experimental

protocol

iao:document iao:document part

iao:textual entity iao:data set

owl:subClassOf

ro:hasPart

ro:partOf

owl:subClassOf owl:subClassOf owl:subClassOf

ro:hasPart ro:hasPart

ro:hasPart ro:partOf

ro:partOf

ro:partOf

owl:subClassOf owl:subClassOf

exact:alert message

owl:subClassOf

Rhetorical and structural components (e.g. introduction, materials, and methods);

Information like application of the protocol, advantages and limitations, list of reagents, critical steps.

SMART Protocols - wf

sp:basic step of

DNA extraction

p-plan:Step

p-plan:Variable

sp:cell disruption

sp:plant tissue

Basic Steps of DNA Extraction

sp:DNA purification

obi:DNA extract

p-plan:hasInputVariable

p-plan:hasOutputVariable

p-plan:hasOutputVariable

owl:subClassOf

sp:digestion

reaction

sp:powdered tissue

owl:subClassOf owl:subClassOf

owl:subClassOf

p-plan:hasInputVariable

sp:digested

contaminant

p-plan:hasInputVariable p-plan:hasOutputVariable

owl:subClassOf owl:subClassOf owl:subClassOf owl:subClassOf

bfo:isPrecededBy bfo:isPrecededBy

Representation of the workflow aspects in protocols

implicit order in the instructions, following the input output structure.

SMART Protocols documentation

• SMART Protocols ontology is available here:

• http://vocab.linkeddata.es/SMARTProtocols/

• Giraldo O, García-Castro A, Corcho O. SMART

Protocols: SeMAntic RepresenTation for

Experimental Protocols. LISC2014

SMART Protocols in action

sp= smart protocols, ro= relation ontology

sp:experimental

protocol

sp:DNA extraction

protocol

sp:advantages

sp:sample

owl:subClassOf

rdf:type

sp:title of the protocol

sp:author entry

rdf:type

sp:hasAuthorsp:hasTitle

rdf:type

ro:partOf

ro:partOf

sp:applicationof the protocol

ro:partOf

rdf:type

rdf:type

SMART Protocols in action

The Research Method in different disciplines

38

INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

Vocabularies and methodologies for representing and publishing workflows

39

Interactive Browsing

(Pubby frontend)

Programatic access(external apps)

Wings workflow generation

OPM/PROVconversion

Publication Share Reuse

Core

Portal

WINGS on local laptop

Workflow Template

WorkflowInstance

PROVexport

Core

Portal

WINGS on shared host

Workflow Template

WorkflowInstance

PROVexport

Core

Portal

WINGS on web server

Workflow Template

WorkflowInstance

PROVexport

LinkedData

Publication

Users

Other

workflow

environments

RDF TripleStore

Workflow Provenance

Workflow PlanMethodology for workflow publishing

Repository of linked workflows:http://www.opmw.org/sparql

http://purl.org/net/p-plan

http://www.opmw.org/ontology/

Daniel Garijo and Yolanda Gil. 2011. A new approach for publishing workflows: abstractions, standards, and linked data. (WORKS '11). ACM, New York, NY, USA, 47-56.

Daniel Garijo and Yolanda Gil. Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data. In Proceedings of the 2nd International Workshop on Linked Science 2012, Boston, 2012.

Definition of workflow abstractions

40

Catalog of common independent workflow abstractions (motifs)

Data-oriented motifs: What kind of manipulations does the workflow have?

Workflow-oriented motifs: How does theworkflow perform its operations

Analysis from 260 different workflowsfrom 10 domains analyzed belonging to5 different workflow systems

http://purl.org/net/wf-motifs#

Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble, Common motifs in scientific workflows: An empirical analysis, Future Generation Computer Systems, Volume 36, July 2014, Pages 338-351

Finding and evaluating common abstractions

41

https://github.com/dgarijo/FragFlow

http://purl.org/net/wf-fd

Graph mining techniques

Workflow fragmentrepresentationand linkage

Workflow fragmentFiltering techniques

Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris A.Gutman,Ivo D. Dinov, Paul Thompson, and Arthur W. Toga. FragFlow: Automated Fragment Detection in ScientificWorkflows. In The 10th IEEE International Conference on e-Science, Guaruja, 2014

How to preserve Workflows/Research Objects?

42

Three main ways/levels:

•Descriptive reproducibility

•Documentation

•Workflow execution reproducibility

•Can we run the workflow?

•Workflow results reproducibility

•Can we get the same results?

Checklists!

•Corcho et al: Checklist for workflow conservation.

•http://dx.doi.org/10.6084/m9.figshare.1285011

•40 different aspects

•Documentation

•Goals

•Results

•Metadata

•Corcho et al: Checklist for a workflow conservation plan

•http://dx.doi.org/10.6084/m9.figshare.1285012

•Based on the DCC’s data management plan

Some examples

43

Levels of reproducibility

Workflow conservation Plan

The Research Method in different disciplines

44

INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

PegasusMontage

SoyKB

Epigenomics

CLOUD

Reproducibility of Computational Scientific Experiments

45

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

Dispel4PyInternal Extinction

Seismic Cross Correlation

MakeflowBlast

Some results

• Pegasus Montage Workflow

• Astronomy workflow

• Construct large image mosaics of the sky

• Montage Software distribution

• 59 binaries

• Target IaaS Cloud Providers

• Amazon EC2 & Futuregrid

• Vagrant

47

RO available at http://pegasus.isi.edu/publications/reppar

Lessons learned for Anna

• Research Objects as a

concept

• Identity, annotation,

aggregation

• Adapted to the

tools/infrastructure for each

domain

• With some tooling available

already

• It’s not just data preservation

but also methods

• Lab protocols

• Computational workflows

• Understand what

reproducibility means for you48

Research Objects for improved

sharing and reproducibility

Dagstuhl Perspective Workshop on the intersection

between Computer Sciences and Psychology

Oscar Corcho

@ocorcho, http://slideshare.net/ocorcho

Ontology Engineering Group

Universidad Politécnica de Madrid

(and the Research Object community group)

Acknowledgements

• The Semantic e-Science team at UPM

• Carlos Badenes

• Daniel Garijo

• Olga Giraldo

• Rafael González-Cabero

• Idafen Santana

• The Wf4Ever team

• Carole Goble, José Manuel Gómez Pérez, Raúl Palma, Jun Zhao, Stian Soiland-Reyes, Khalid Belhajjame, José Enrique Ruíz, Marco Roos, Lourdes Verdes-Montenegro, Norman Morrison, Sean Bechoffer, Graham Klyne, Matt Gamble, and a large etcetera

• The Research Object community group

• http://www.researchobject.org/

50