research data catalogues and data interoperability in life sciences

54
www.elixir-europe.org Rafael C Jimenez, ELIXIR CTO April 3, 2017 BlueBRIDGE Workshop FAIR friendly research data catalogues: How far are we? Research data catalogues and data interoperability in life sciences @rafajido

Upload: blue-bridge

Post on 12-Apr-2017

24 views

Category:

Technology


1 download

TRANSCRIPT

www.elixir-europe.org

Rafael C Jimenez, ELIXIR CTO

April 3, 2017

BlueBRIDGE WorkshopFAIR friendly research data catalogues: How far are we?

Research data catalogues and data interoperability in life sciences

@rafajido

www.elixir-europe.org

Rafael C Jimenez, ELIXIR CTO

April 3, 2017

BlueBRIDGE WorkshopFAIR friendly research data catalogues: How far are we?

Research data catalogues and data interoperability in life sciences

@rafajido

Table of contents

• Registries and repositories

• Data repositories

• Datasets metadata and registries

• Data as a Service (DaaS) for EOSC

• Challenges in metadata indexing

• Schema.org and Bioschemas

• Extra slides

Registries and repositories

Registry vs Repository

• “… A registry is a list of items with pointers for where to find the items, like the index on a database table or the card catalog for a library. A repository stores the actual items, like a database table itself or a library’s shelves of books …” [1]

• “… registries hold references to things and repositories hold the things …” [2]

[1] http://stackoverflow.com/questions/2276124/what-is-the-difference-between-registry-and-repository-from-soa-point-of-view [2] http://best-practice-software-engineering.blogspot.com/2008/04/misc-registry-vs-repository.html

Data index

Data registry

Data catalogue

Data repository

Database

Data resource

Examples

Registries

Repositories

OmicsDI

ArrayExpress

Examples - Publications

OmicsDI

ArrayExpress

Registries

Repositories

Examples - Datasets

OmicsDI

ArrayExpress

Registries

Repositories

Examples - Data repositories

OmicsDI

ArrayExpress

Registries

Repositories

Registry specificity

• eg. databases

Google re3data

Generic

Research

Life Sciences

Molecular interactions

Identifiers.org

Bio.tools

PSICQUIC

Examples of registries in life sciences

Databases

Datasets

Publications

Tools

Training

Events

Ontologies

Standards

Samples

OmicsDI

Identifiers.orgBio.tools

Biojs Biocontainers

PSICQUIC

Bio.tools

STM

Examples of registries in life sciences

Databases

Datasets

Publications

Tools

Training

Events

Ontologies

Standards

Samples

OmicsDI

Identifiers.orgBio.tools

Biojs Biocontainers

PSICQUIC

Bio.tools

STM

Data repositories

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702933/

Data resources in life sciences

1685Data Repositories

Archives

• Experimental results

• Deposition

• Data, information

• Big (raw data)

• Data do not change• Unless than reprocessed

• Accumulative evidence

• Curation

• Information, knowledge

• Small

• Entries evolve• Change, (de)merge,

deprecate

Knowledgebase (KBs)

Data repositories

Structured

• Archives, knowledgebase

• Well defined metadata

• Short head

• Difficult deposition

• Highly cited

• Highly reuse

• …

• Archives

• Generic metadata

• Long tail

• Easy deposition

• …

Unstructured

Data repositories

Datasets metadata and registries

http://www.ebi.ac.uk/Tools/omicsdi/

11 repositories, 82196 datasets

Scientific

• Author

• Description

• Experiment

• Sample

• Protocol

• Description

• Publication

• …

• Format

• Location

• Replicas

• Version

• Identifier

• Dependencies

• Authentication

• …

File

Dataset metadata

DATSDescriptive Metadata for Datasets

Example of a file replication strategy

CSC

BILS

Site B

Site C

EUDAT CDIELIXIR

PRIDEEMBL-EBI

B2SAFEiRods

B2SAFEiRods

B2SAFEiRods

B2SAFEiRods

PIDs to with associated metadataPersistent IDentifiers

27

ELIXIRcommunity center

ELIXIRData center 1

EUDATData center 1

CSCPRIDEBILS

Data as a Service (DaaS)

Dataset index

Scientific File

PID

Dataset index

Scientific File

PID

Dataset index

Scientific File

PID

EarthLife …

ScienceSchemas ScienceSchemas ScienceSchemas

Data as a Service (DaaS)

Services

Compute Storage Transfer …

Dataset index

Service Oriented Architecture

• Software design methodology

• Defines a federated environment

• Contributes to standardization and modularization

Find

Publish

Access

Contract

Client

Registry

Data repositories

Dataset index

Compute, storage, transfer, …

How?

Issues & opportunities

• Data repositories• Many, diverse, disperse

• Limited capacity

• Fragile sustainability

• Reluctant to changes unless a direct benefit

• Data registries• Data from repositories not always programmatically accessible

• Low coverage if consider the wide context

• Struggle with updates

• Lack of links between scientific data and file metadata

Recommendations

• Prioritise high impact low cost• Findability

• Minimum metadata to find data

• Common types (datasets, samples)

• Minimal service changes

• Clear benefits

• High coverage strategy

• Combination of scientific and file metadata

• Leverage on existing• Metadata registries

• Data repositories

Thanks for your attention

Challenges in metadata indexing

Many, disperse, diverse

Severalrepositories

DifferentinterfacesVariable results

Indexing, integration, …?

+ -

Availability

Web pages

API

SPARQL

File

Different formats for the same data

38

MolecularInteractions

Data

PSI-XML

PSI-MITAB

BioPax

RDF

Cytoscape

DAS • Comprehensive

• Simple

• Generic

• Domain specific

• Structured

Why so many resources?• Diverse data types

• Many communities

• Different ways to structure data

• Control, reputation, publication

• New funding normally just encourage to do something new

39

1 2 3 4funding

time (years)

development

Are they sustainable?

* Merali Z. et all. Databases in peril. Nature 2005.

• Just 20% has a sustained future*

Schema.org and bioschemas

<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>

Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .

Recipe

Nutrition

Calories

Ingridients

Title

<div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope

itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>

Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .

Structured data markup for web pages

RDFaJSON-LDMicrodata With markup

Structured data markup for web pages

Schema.org for biologicalrelated information

Minimum propertiesfor finding data

Bioschemas

BioSchemas

Specification on top of schema.org

Layer of constrains + documentation + extensions

Specification

Data modelMinimum information

Controlled vocabularies

Cardinality

Documentation

Examples

New (properties | types)

Minimum informationControlled vocabularies

Cardinality

Data model

New properties

Access to structured data

Google, Yahoo, Bing, Yandex, …

Datarepository

Datarepository

Datarepository

Search engineRegistry

Bioschemas Bioschemas Bioschemas

Samples, Phenotypes, Datasets, Beacons, Protein annotations, Data repositories, …

Pfam

HPAEGA Gene3D

BBMRI-NL BiobankPDBe

BRENDAUniProt

BeaconsBrassica IP COPaKB

Biocatalogue

BiosharingBiosamples

Identifiers.org

Beacon network

UKCRC TC

Bio.tools DataMed

Extra slides

Unique monthly submissionSequence Read Archive (SRA)

Unique monthly usageSequence Read Archive (SRA)

SOA for Molecular interactions via PSICQUIC

Find Publish

Query

Client

Registry

Servers

Standard

PSICQUICService A

PSICQUICService B

PSICQUICService C

MIQL

Servicemetadata

Biologicalinformation

SOA: Service Oriented Architecture

SOA for beacons

Find Publish

Client

Registry

Servers

BEACONService A

BEACONService B

BEACONService C

Servicemetadata

Query

Standardspecification

Biologicalinformation

SOA: Service Oriented Architecture

Kafkas S, Kim JH, and McEntyre JR Database Citation in Full Text Articles (May 2013) PLoS One 10.1371/journal.pone.0063184

European Nucleotide Archive

Protein Data Bank

DNA Variations (SNPs)

Gene Expression Studies DOIs (long tail)

Data Citation in Europe PMC full text articles