global rdf descriptors for germplasm data

38
Global RDF Descriptors for Germplasm Data Vassilis Protonotarios Agricultural Biotechnologist, PhD Agro-Know, Greece RDA 3° Plenary Meeting, Dublin, Ireland Agricultural Data Interoperability Group Meeting

Upload: vassilis-protonotarios

Post on 27-Jan-2015

117 views

Category:

Education


0 download

DESCRIPTION

Presentation delivered in the context of the Agricultural Data Interoperability WG meeeting, during the RDA 3rd Plenary Meeting in Dublin, Ireland. 26/3/2014. The presentation is mostly focused on the work done by the agINFRA project towards proposing a methodology for the definition of Germplasm descriptors as RDF, based on the existing work of experts in the field and making use of the existing effort in this direction.

TRANSCRIPT

Page 1: Global RDF Descriptors for Germplasm Data

Global RDF Descriptors for Germplasm Data

Vassilis ProtonotariosAgricultural Biotechnologist, PhD

Agro-Know, Greece

RDA 3° Plenary Meeting, Dublin, IrelandAgricultural Data Interoperability Group Meeting

Page 2: Global RDF Descriptors for Germplasm Data

Background

Page 3: Global RDF Descriptors for Germplasm Data

Connecting the pieces

agINFRA Germplasm Working Group

Agricultural Data Interoperability IG

Germplasm Data Analysis

Agricultural linked data layer

Page 4: Global RDF Descriptors for Germplasm Data

The agINFRA project

• A project funded under the FP7 program of EC• Consortium with expertise on– Technology / infrastructures– Data / data management

Combined to facilitate agricultural data sharingMore info at:

www.aginfra.eu

Page 5: Global RDF Descriptors for Germplasm Data

The agINFRA project

• Aims to enhance the interoperability between the agricultural data sources

– Data sharing by• Metadata aggregation & linking data• Design and deploy the linked ag-data framework

– Methodology for linking data– Provide the infrastructure needed• Both cloud- and grid-based services• Tools, APIs etc.

Page 6: Global RDF Descriptors for Germplasm Data

agINFRA major data types

agINFRA

Bibliographic

Agri Statistics & Economics

Educational

Germplasm

Soil data

Profiles

Raw data

Other?

Page 7: Global RDF Descriptors for Germplasm Data

Focusing on germplasm

Local Databases

National DatabasesAggregators

GENESYSEURISCO

GBIF

Italian

Italian University

Italian research center

Chinese Chinese research center

Data flow

Page 8: Global RDF Descriptors for Germplasm Data

Focusing on germplasm

Local Databases

National DatabasesAggregators

GENESYSEURISCO

Italian

Italian University

Italian research center

Chinese Chinese research center

Page 9: Global RDF Descriptors for Germplasm Data

The issue ?

• Heterogeneity!– Data types– Data formats– Data management workflows– Standards used– Metadata exposure options– ….

• Lack of connectivity with other data sources

Page 10: Global RDF Descriptors for Germplasm Data

The agINFRA Germplasm Working Group

Page 11: Global RDF Descriptors for Germplasm Data

The Germplasm Working Group

• Created in the context of the agINFRA project• Initially included agINFRA stakeholders– now expanded to host all stakeholders

• The group is NOT a group of experts on germplasm data!

Page 12: Global RDF Descriptors for Germplasm Data

The scope of the agINFRA Germplasm WG

• Enable/enhance interoperability between germplasm databases – By developing the services for

• exchanging their data and • delivering their data to other partners

• Focusing on three actions:1. Identify 2. Organize & Reuse3. Propose

Page 13: Global RDF Descriptors for Germplasm Data

agINFRA Germplasm WG objectives

• IDENTIFY: collect all information related to germplasm data– People/groups– Namespaces (metadata, KOS)– Standards– Workflows– Events

• ORGANIZE & REUSE: engage all stakeholders & available resources, analyze existing standards , facilitate collaboration

• PROPOSE: linked data framework to connect data sources

• facilitate data sharing between germplasm data sources

Page 14: Global RDF Descriptors for Germplasm Data

Germplasm related information

data management

workflows

metadata schemas

Working groups in

germplasm

Events (for connecting stakeholders)

KOS (ontologies,

thesauri, vocabularies

etc.)

Data exposure capabilities

Page 15: Global RDF Descriptors for Germplasm Data

Germplasm related information

data management

workflows

metadata schemas

Working groups in

germplasm

Events (for connecting stakeholders)

KOS (ontologies,

thesauri, vocabularies

etc.)

Data exposure capabilities

Page 16: Global RDF Descriptors for Germplasm Data

The Germplasm WG wiki

• Central point of reference

• Freely accessible (no login required)

http://wiki.aginfra.eu/index.php/Germplasm_Working_Group

Page 17: Global RDF Descriptors for Germplasm Data

Key outcomes of the group (1)

Dossier on Germplasm Information:– Major programs– Major information systems and services– agINFRA germplasm data sources (CGRIS & CRA)– Core standards for germplasm information – Plant nomenclature, taxonomies and ontologies– Plant genomic resources– Related references and links

• Freely available from the Germplasm Group wiki

Page 18: Global RDF Descriptors for Germplasm Data

Key outcomes of the group (2)

Page 19: Global RDF Descriptors for Germplasm Data

Key outcomes of the group (3)

• Speakers from key players in the biodiversity data field– GBIF, EURISCO, GENESYS, CGIAR, EGFAR, CRA etc.

• Aimed to provide the basis for the linked germplasm data framework

Page 20: Global RDF Descriptors for Germplasm Data

Existing work

Page 21: Global RDF Descriptors for Germplasm Data

DwC-G KOSs

• Germplasm Term Vocabulary • A vocabulary of terms for describing and annotating

germplasm information resources– http://purl.org/germplasm/germplasmTerm#TERM

• Germplasm Type vocabulary• List of controlled values for some of the germplasm terms

– http://purl.org/germplasm/germplasmType#TYPE

• Germplasm ontology• to digitize and provide persistent identifiers for the terms

contained within the PGR Descriptors publications– http://purl.org/germplasm/ontology

Page 22: Global RDF Descriptors for Germplasm Data

DwC-G linked data

Page 23: Global RDF Descriptors for Germplasm Data

DwC-SW

• An ontology using Darwin Core terms to make it possible to describe biodiversity resources in the Semantic Web.

https://code.google.com/p/darwin-sw

Page 24: Global RDF Descriptors for Germplasm Data

Bioversity Crop Descriptors

• Crop Descriptors– Provide an international format and a universally understood

language for plant genetic resources data. – They are targeted at farmers, curators, breeders, scientists and users

and facilitate the exchange and use of resources. – Information includes such details as the plant's height, flowering

patterns and ancestral history.• FAO/Bioversity Multi-crop Passport Descriptors (MCPD)

– Originally published in 2001– widely used as the international standard to facilitate germplasm

passport information exchange. – Now expanded to include emerging documentation needs, this new

version resulted from consultation with more than 300 scientists from 187 institutions in 87 countries.

Page 26: Global RDF Descriptors for Germplasm Data

Methodology: towards the RDF germplasm descriptors

Page 27: Global RDF Descriptors for Germplasm Data

Linked Data vocabularies

• Metadata vocabularies: Metadata sets, metadata element sets – they provide metadata elements to describe individual pieces of

information in the data sets.– Example: Dublin Core is a vocabulary that prescribes the property

dc:date for the publishing date of a document.

• Value vocabularies (KOS): Controlled vocabularies, authority data– they provide sets of values for (some of) the metadata elements.– Example: AGROVOC provides a set of values for agricultural topics

that can be used as values for the dc:subject property.

Page 28: Global RDF Descriptors for Germplasm Data

LOD guidelines (Berners Lee, 2006)

1.“Use URIs as names for things”– concepts / values in value vocabularies and classes and properties in description vocabularies, as well

as the vocabularies themselves, have to be identified by URIs.

2.“Use HTTP URIs so that people can look up those names”– the URIs for concept / values, classes and properties, as well as vocabularies, have to be resolved as

HTTP URLs.

3.“When someone looks up a URI, provide useful information”– the URLs for concepts, classes and properties, as well as vocabularies, have to return an HTML page

with useful information when requested by browsers, or RDF when requested by RDF software; besides, vocabularies should be available for querying behind a SPARQL endpoint.

4.“Include links to other URIs, so that more things can be discovered”– the URIs of concepts, classes and properties should whenever possible be linked to URIs in other

vocabularies, for instance as close match of another concept or sub-class of another class.

Page 29: Global RDF Descriptors for Germplasm Data

Proposed methodology

1. Analyze metadata schemas & KOSs used to describe germplasm resources

2. Define attributes & vocabularies that can be used to expose germplasm resources in linked data format.

3. Provide a set of recommendations for the exposure of germplasm resources as linked data

4. Embed the recommendations in the data infrastructure of agINFRA – to allow the exposure of germplasm resources as LOD.

Page 30: Global RDF Descriptors for Germplasm Data

The next steps

Page 31: Global RDF Descriptors for Germplasm Data

Application of the linked agricultural data framework in germplasm

1. Definition of base schema– Darwin Core for Germplasm to be used as base

schema• Already available in SKOS• Vocabularies published as linked data

– Germplasm Vocabularies• Germplasm Term Vocabulary • Germplasm Type Vocabulary

– Germplasm ontology

2. Publication of local classifications / lists for germplasm as LOD KOSs – if possible use DwC Types directly

Page 32: Global RDF Descriptors for Germplasm Data

Application of the linked agricultural data framework in germplasm

3. Linking of terms in new KOSs to terms in existing KOSs – e.g. DwC Types, AGROVOC

4. Link CAAS and CRA germplasm records using scientific name > AGROVOC

5. Collaboration with technical partners– technical specifications on how to write procedures that extract the

relevant data from the database and "triplify" them (i.e. both serialize them as RDF and use URIs instead of just strings whenever possible, also linking to AGROVOC URIs when possible).

Page 33: Global RDF Descriptors for Germplasm Data

…and more next steps (optional)

• Update the existing analysis with new data• Collect new user requirements• (re)define the mappings between metadata

schemas and KOSs (if needed)• Fine-tune the linked data approach

Page 34: Global RDF Descriptors for Germplasm Data

Time plan

Page 35: Global RDF Descriptors for Germplasm Data

Time plan

• June 2014: Germplasm vocabularies– Metadata model: Darwin Core SW + DwC-G as the

reference• Publish local classifications / lists for germplasm as LOD

KOSs (if possible use DwC Types directly)• Link terms in new KOSs to terms in existing KOSs (e.g.

DwC Types, AGROVOC)• Germplasm phenotypic values / classifications linked to

Phenotypic Ontology terms?

Page 36: Global RDF Descriptors for Germplasm Data

Time plan

• August 2014: Germplasm RDF– Expose some RDF output and API access for

germplasm datasets (basic DwC RDF, essentially basic passport descriptors).

– Mandatory data for interlinking: scientific name OR AGROVOC term

Page 37: Global RDF Descriptors for Germplasm Data

Time plan

• October 2014: Consuming data from agINFRA services and components– Link CGRIS and CRA germplasm records using

scientific name > AGROVOC

Page 38: Global RDF Descriptors for Germplasm Data

Source: http://verastic.com/social/why-do-people-not-say-thank-you.html

[email protected]