research data catalogues and data interoperability in life sciences
TRANSCRIPT
www.elixir-europe.org
Rafael C Jimenez, ELIXIR CTO
April 3, 2017
BlueBRIDGE WorkshopFAIR friendly research data catalogues: How far are we?
Research data catalogues and data interoperability in life sciences
@rafajido
www.elixir-europe.org
Rafael C Jimenez, ELIXIR CTO
April 3, 2017
BlueBRIDGE WorkshopFAIR friendly research data catalogues: How far are we?
Research data catalogues and data interoperability in life sciences
@rafajido
Table of contents
• Registries and repositories
• Data repositories
• Datasets metadata and registries
• Data as a Service (DaaS) for EOSC
• Challenges in metadata indexing
• Schema.org and Bioschemas
• Extra slides
Registry vs Repository
• “… A registry is a list of items with pointers for where to find the items, like the index on a database table or the card catalog for a library. A repository stores the actual items, like a database table itself or a library’s shelves of books …” [1]
• “… registries hold references to things and repositories hold the things …” [2]
[1] http://stackoverflow.com/questions/2276124/what-is-the-difference-between-registry-and-repository-from-soa-point-of-view [2] http://best-practice-software-engineering.blogspot.com/2008/04/misc-registry-vs-repository.html
Data index
Data registry
Data catalogue
Data repository
Database
Data resource
Registry specificity
• eg. databases
Google re3data
Generic
Research
Life Sciences
Molecular interactions
Identifiers.org
Bio.tools
PSICQUIC
Examples of registries in life sciences
Databases
Datasets
Publications
Tools
Training
Events
Ontologies
Standards
Samples
OmicsDI
Identifiers.orgBio.tools
Biojs Biocontainers
PSICQUIC
Bio.tools
STM
Examples of registries in life sciences
Databases
Datasets
Publications
Tools
Training
Events
Ontologies
Standards
Samples
OmicsDI
Identifiers.orgBio.tools
Biojs Biocontainers
PSICQUIC
Bio.tools
STM
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702933/
Data resources in life sciences
1685Data Repositories
Archives
• Experimental results
• Deposition
• Data, information
• Big (raw data)
• Data do not change• Unless than reprocessed
• Accumulative evidence
• Curation
• Information, knowledge
• Small
• Entries evolve• Change, (de)merge,
deprecate
Knowledgebase (KBs)
Data repositories
Structured
• Archives, knowledgebase
• Well defined metadata
• Short head
• Difficult deposition
• Highly cited
• Highly reuse
• …
• Archives
• Generic metadata
• Long tail
• Easy deposition
• …
Unstructured
Data repositories
http://www.ebi.ac.uk/Tools/omicsdi/
11 repositories, 82196 datasets
Scientific
• Author
• Description
• Experiment
• Sample
• Protocol
• Description
• Publication
• …
• Format
• Location
• Replicas
• Version
• Identifier
• Dependencies
• Authentication
• …
File
Dataset metadata
Example of a file replication strategy
CSC
BILS
Site B
Site C
EUDAT CDIELIXIR
PRIDEEMBL-EBI
B2SAFEiRods
B2SAFEiRods
B2SAFEiRods
B2SAFEiRods
PIDs to with associated metadataPersistent IDentifiers
27
ELIXIRcommunity center
ELIXIRData center 1
EUDATData center 1
CSCPRIDEBILS
Dataset index
Scientific File
PID
Dataset index
Scientific File
PID
Dataset index
Scientific File
PID
EarthLife …
ScienceSchemas ScienceSchemas ScienceSchemas
Data as a Service (DaaS)
Services
Compute Storage Transfer …
Dataset index
Service Oriented Architecture
• Software design methodology
• Defines a federated environment
• Contributes to standardization and modularization
Find
Publish
Access
Contract
Client
Registry
Data repositories
Dataset index
Compute, storage, transfer, …
Issues & opportunities
• Data repositories• Many, diverse, disperse
• Limited capacity
• Fragile sustainability
• Reluctant to changes unless a direct benefit
• Data registries• Data from repositories not always programmatically accessible
• Low coverage if consider the wide context
• Struggle with updates
• Lack of links between scientific data and file metadata
Recommendations
• Prioritise high impact low cost• Findability
• Minimum metadata to find data
• Common types (datasets, samples)
• Minimal service changes
• Clear benefits
• High coverage strategy
• Combination of scientific and file metadata
• Leverage on existing• Metadata registries
• Data repositories
Many, disperse, diverse
Severalrepositories
DifferentinterfacesVariable results
Indexing, integration, …?
Different formats for the same data
38
MolecularInteractions
Data
PSI-XML
PSI-MITAB
BioPax
RDF
Cytoscape
DAS • Comprehensive
• Simple
• Generic
• Domain specific
• Structured
Why so many resources?• Diverse data types
• Many communities
• Different ways to structure data
• Control, reputation, publication
• New funding normally just encourage to do something new
39
1 2 3 4funding
time (years)
development
Are they sustainable?
* Merali Z. et all. Databases in peril. Nature 2005.
• Just 20% has a sustained future*
<div> <h1>Classic potato salad</h1> <div> Nutrition facts: <span>144 kcal</span>, </div>
Ingredients: - <span>800g small new potato</span> - <span>3 shallot</span> . . .
Recipe
Nutrition
Calories
Ingridients
Title
<div itemscope itemtype="http://schema.org/Recipe"> <h1 itemprop="name">Classic potato salad</h1> <div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div>
Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .
Structured data markup for web pages
RDFaJSON-LDMicrodata With markup
BioSchemas
Specification on top of schema.org
Layer of constrains + documentation + extensions
Specification
Data modelMinimum information
Controlled vocabularies
Cardinality
Documentation
Examples
New (properties | types)
Access to structured data
Google, Yahoo, Bing, Yandex, …
Datarepository
Datarepository
Datarepository
Search engineRegistry
Bioschemas Bioschemas Bioschemas
Samples, Phenotypes, Datasets, Beacons, Protein annotations, Data repositories, …
Pfam
HPAEGA Gene3D
BBMRI-NL BiobankPDBe
BRENDAUniProt
BeaconsBrassica IP COPaKB
Biocatalogue
BiosharingBiosamples
Identifiers.org
Beacon network
UKCRC TC
Bio.tools DataMed
…
…
SOA for Molecular interactions via PSICQUIC
Find Publish
Query
Client
Registry
Servers
Standard
PSICQUICService A
PSICQUICService B
PSICQUICService C
MIQL
Servicemetadata
Biologicalinformation
SOA: Service Oriented Architecture
SOA for beacons
Find Publish
Client
Registry
Servers
BEACONService A
BEACONService B
BEACONService C
Servicemetadata
Query
Standardspecification
Biologicalinformation
SOA: Service Oriented Architecture