d4-4 web identification of feasible biomedbridges pilots ... · semantic web technology description...

20
Deliverable D4.4 Project Title: Building data bridges between biological and medical infrastructures in Europe Project Acronym: BioMedBridges Grant agreement no.: 284209 Research Infrastructures, FP7 Capacities Specific Programme; [INFRA-2011-2.3.2.] “Implementation of common solutions for a cluster of ESFRI infrastructures in the field of "Life sciences" Deliverable title: Identification of feasible BioMedBridges pilots for semantic web integration WP No. 4 Lead Beneficiary: 1: EMBL WP Title Technical integration Contractual delivery date: 30 June 2013 Actual delivery date: 17 July 2013 WP leader: Ewan Birney 1: EMBL

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

Deliverable D4.4

Project Title: Building data bridges between biological and medical

infrastructures in Europe

Project Acronym: BioMedBridges

Grant agreement no.: 284209

Research Infrastructures, FP7 Capacities Specific

Programme; [INFRA-2011-2.3.2.] “Implementation of

common solutions for a cluster of ESFRI infrastructures in

the field of "Life sciences"

Deliverable title: Identification of feasible BioMedBridges pilots for semantic

web integration

WP No. 4

Lead Beneficiary: 1: EMBL

WP Title Technical integration

Contractual delivery

date: 30 June 2013

Actual delivery date: 17 July 2013

WP leader: Ewan Birney 1: EMBL

Page 2: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

2 | 20

BioMedBridges Deliverable D4.4

Contributing partner(s): 1: EMBL

4: STFC

5: UDUS

7: TUM-MED

11: HMGU

13: VUMC

Contents

1   Executive  summary   3  

2   Project  objectives   3  

3   Detailed  report  on  the  deliverable   4  3.1   Background   4  3.2   Description  of  Work   5  3.2.1   Collaborative  Activities   5  3.2.2   General  Technical  Strategy   5  3.2.3   Development  of  Partner  Specific  Roadmaps   8  

3.3   Knowledge  Exchange  Workshop   10  3.4   Next  Steps   10  

Background   11  

List  of  Appendices   13  

Page 3: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

3 | 20

BioMedBridges Deliverable D4.4

Executive summary

The aim of this deliverable is to identify the semantic web services to be

implemented in D4.6 of BioMedBridges, and to plan how they will be

developed.

Project objectives

With this deliverable, the project has reached or the deliverable has

contributed to the following objectives:

No. Objective Yes No

1 Implement shared standards from work package 3 to allow for

integration across the BioMedBridges project

X

2 Expose the integration via use of REST based WebServices

interfaces optimised for browsing information

X

3 Expose the integration via use of REST based WebServices

interfaces optimised for programmatic access

X

4 Expose appropriate meta-data information via use of Semantic

Web Technologies

X

5 Pilot the use of semantic web technologies in high-data scale

biological environments

X

Page 4: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

4 | 20

BioMedBridges Deliverable D4.4

Detailed report on the deliverable

1.1 Background

The ESFRI life sciences infrastructures are gathering more and more data.

Within these data lie answers to questions such as: “Is the gene/region I have

identified in my animal model relevant to human health?” Integrating the data

to answer such questions takes many hours of scientists’ time. An even bigger

challenge is to answer questions that scientists have not asked – for example,

a clinical scientist who wants to read more about the contribution of

atorvastatin in treating diabetes might not search for mouse models, but

results for the knockout strain KK/Ay are relevant. These use cases are

encapsulated in the BioMedBridges use case WPs, and are the scientific driver

for the work described here. D4.4 provides the infrastructural roadmap by

which disparate data sources can be annotated, missing data identified, and

scientific discovery advanced.

Research data starts with the output from some instrument. It gains meaning

when it is annotated, with both scientific interpretation and provenance

information. Each discipline within the life sciences has its own standard

methods, and has defined standard schemata of the annotations that are

relevant in the discipline. However, there always turn out to be extra

annotations that are needed, because techniques evolve, research questions

change, and separate sub-disciplines begin to interact. Consequently, the

schemata used across the data landscape are constantly evolving. Strong

schema languages that coerce the data to match the schema have proved not

to work well in this environment, whether DDL, XSD, or UML. European e-

Infrastructure projects are increasingly turning to the “linked data” approach to

address such challenges with semantic web technologies.

The semantic web uses RDFS and OWL, schema languages that are

designed from the ground up to connect disparate data sources. This

approach may prove to be a good fit to emerging challenges in the life

sciences. If so, then future data repositories can be planned from the start with

integration as an aim. However, there are some challenges: this approach is

Page 5: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

5 | 20

BioMedBridges Deliverable D4.4

unproven, it is not clear whether the available tools scale to very large

datasets, and the developer community is not experienced in using them.

BioMedBridges will therefore test the suitability of a semantic web approach to

the task of integrating research data from the different ESFRI biomedical

infrastructures. The purpose of the work presented here is to plan this activity,

to generate actionable roadmaps with a common goal, and to ensure there is

sufficient expertise to deliver the infrastructural solutions needed.

The scope of the current plan is the integration of public services, and

integration of a single service needing authentication with public services.

Later in the project, we will combine this semantic web approach with Single

Sign On facilities to be provided by WP5 and the roadmap implementation

(D4.6) will be informed by the work of WP3 on choice and use of ontologies,

and on use and re-use of identifiers.

1.2 Description of Work

1.2.1 Collaborative Activities

The collaborative activities identified:

● Candidate datasets that may be linked together in order to support the

scientific use cases

● Specific plans for service implementations reflecting dependencies

between services and data resources

● A continuum of experience levels among the partners with respect to

semantic web technologies

● Risks surrounding availability and suitability of data/metadata

● Common understanding of the technology and a broad technical

strategy.

1.2.2 General Technical Strategy

The infrastructures represented in BioMedBridges are at different stages of

technical maturity and experience in their use and experience of semantic web

Page 6: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

6 | 20

BioMedBridges Deliverable D4.4

technology. For the purposes of this project we classified these into

Mature/Intermediate/New groups (Table 1) to manage the challenges in

working with semantic web technologies and address and to inform the

process of roadmap development in very different domains and scenarios.

Many issues were considered in the work towards this deliverable including:

● Expertise of personnel

● Access to data

● Institutional Infrastructure – access to servers, firewall set up, etc.

● Data volume

● Data privacy

● Available ontologies for semantic description.

Table 1. Classification of infrastructures based on semantic web experience D4.4

Each partner completed a roadmap for their domain and considered the issues

listed above in determining what data could be exposed, suitable technical

strategy for exposure and the timeline for delivery of the implementation for

D4.6.

Level of technical

maturity on

semantic web technology

Description

Mature Semantic Web Expertise from multiple projects, databases

and technology evaluation experience, e.g. ELIXIR has

provided an RDF service for the Gene Expression Atlas

Intermediate An existing resource and semantic web experience from few

projects, e.g. Instruct has added an RDF interface to PiMS

New Minimal prior semantic web expertise and a new and/or in the

process of developing a new resource

Page 7: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

7 | 20

BioMedBridges Deliverable D4.4

In order to leverage experience where it exists and minimise the risks inherent

in novel technology projects, a two-stage delivery is planned:

1 ELIXIR will establish mature services, basic best practices and

technical guidelines based on prior experience with modelling

semantic web data and benchmarking technology

2 Infrastructures will implement specific pilot projects in parallel,

according to their individual roadmaps, using the technical

guidelines and outcomes from the Knowledge Exchange workshop

(D12.1). Note that UMCG representing BBMRI will not receive

resources for D4.6, but has offered a contribution in kind and their

roadmap is included in this deliverable.

ELIXIR’s role in D4.6 will therefore include forming a blueprint for other

infrastructures to build upon. This first stage is intended to leverage the

experience gained from recent projects at EMBL-EBI, SIB, STFC and

OpenPhactsI that have delivered semantic web components at various scales

and in different institutional environments. The roadmap for this stage

(Appendix 1) will include:

● Examples of technical architectures suitable for operating semantic

web services. This is intended to illustrate the various software

components and usage considerations.

● Case studies of RDF transformation projects from Relational

Databases (RDBMS) such as UniProt, Expression Atlas and ChEMBL.

Several such services have already been made available in RDF form

or are currently undergoing this activity. The lessons learned from

these experiences will be presented in case studies.

● Comparison of triple store solutions from major vendors. This includes

performance benchmarking, along with ancillary technical

considerations such as data replication and data loading mechanisms.

● Technical guidance documents for best practice in areas such as URI

selection.

I http://www.openphacts.org

Page 8: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

8 | 20

BioMedBridges Deliverable D4.4

● Guidance for integrating with external data. This might include different

strategies for the use of or mapping to external ontologies.

● Development of a generic model for expressing dataset provenance.

Such metadata are not specific to any particular domain, and

developing a common model is therefore feasible and will minimise

duplication of effort.

By following this schedule and aligning partner specific roadmaps to this, the

pilot projects will be developed synchronously and knowledge will be shared

efficiently. This will enable more effective collaboration between infrastructures

in addressing common issues. To support this, a technical workshop will be

co-located with the BioMedBridges Annual General Meeting in March 2014. By

this time, partners will be working towards D4.6 and are therefore likely to

benefit from practical discussion of any technical or semantic issues related to

their implementation(s).

1.2.3 Development of Partner Specific Roadmaps

Partners identified discipline specific pilots that are dependent on the

integration of data or metadata between domains in order to support relevant

scientific use cases. These address a variety of integration tasks, ranging from

discoverability to bridging biological semantics. The pilots were discussed

amongst partners in scientific and technical contexts within WP4

teleconferences and in collaboration with the use case workpackages.

There exist several developmental paths towards the provision of semantic

web components, and the suitability of each will vary depending on the nature

of the datasets, the background technical infrastructure they depend on, and

how they will be integrated with other data. For example, small datasets that

have a well-defined relational database model might be best served through

an on-demand conversion to RDF. Others might require an independent

storage mechanism in order to handle greater complexity or achieve

acceptable performance. Despite these differences, significant commonalities

between strategies exist, but some partners may lack the experience in

semantic web technologies to develop a credible detailed roadmap for

implementation.

Page 9: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

9 | 20

BioMedBridges Deliverable D4.4

Two partners (INSTRUCT and ELIXIR) had a special role in the planning of

roadmaps. Having the most expertise, they were able to both inform and guide

the other partners by providing a high level ‘Common Semantic Roadmap’

(Appendix 2) which is visualised as a workflow in Figure 1. This roadmap

illustrates the tasks likely to be common between individual partners’ efforts

along with guideline timescales, and draws on recent experiences of

conducting similar RDF conversion activities within ELIXIR and INSTRUCT.

Each step in the workflow shown in Figure 1 represents discussion and

evaluation performed by all partners in the application of this technology to

their domain.

Figure 1. A process diagram of the steps involved in developing and deploying semantic web technologies as an infrastructural solution used to inform roadmap development for D4.4.

Page 10: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

10 | 20

BioMedBridges Deliverable D4.4

1.3 Knowledge Exchange Workshop

In concert with WP12, we held the “BioMedBridges Knowledge Exchange

Workshop: Practical Solutions with Ontologies” at EMBL-EBI on 4-5 March

2013, with presenters from EMBL-EBI, STFC and external experts. This

workshop introduced some of the most widely used ontologies for the life

sciences. It familiarized attendees with various approaches to using these

ontologies for data curation and integration. The content covered the role of

ontologies in the wider framework of interoperability, such as the semantic web

framework and the use of RDF to describe data. BioMedBridges partners

presented use cases to illustrate both proven and experimental approaches to

annotating biomedical data by combining existing ontologies on an 'as needs'

basis. Details of the workshop are available at

http://www.ebi.ac.uk/training/course/BMBontology and the outcomes will be

reported in detail under D12.1

1.4 Next Steps

The next stage of the work is to implement the plan contained in this

deliverable and the attached Appendix roadmaps, thus delivering the semantic

web components of D4.6. This is in progress and currently on schedule and

the roadmaps are hosted on the project document management system and

are updated as work progresses.

Page 11: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

11 | 20

BioMedBridges Deliverable D4.4

Background This deliverable relates to WP 4; background information on this WP as originally indicated in the description of work (DoW) is included below. WP 4 Title: Technical Integration Lead: Ewan Birney (EMBL) Participants: EMBL In work package 4 we will implement a federated access system to the diverse data sources in BioMedBridges. This will focus on providing access to data or metadata items which utilise the standards outlined in WP 3. Experience across the BioMedBridges partners is that executing a federated access system, in particular a federated query system, is complex for both technological and social reasons. Therefore we will be using an escalating alignment/engagement strategy where we focus on technically easier and semantically poorer integration at first and then progressively increase the sophistication of the services. In each iteration, we will be using biological use cases which are aligned to the capabilities of the proposed service, thus providing progressive sophistication to the suite of federated services. Our first iteration involves using established REST based technology to provide userbrowsable visual integration of information. This will be useful for both summaries of data rich resources (such as Elixir) and summaries of ethically restricted datasets where only certain meta-data items are public (such as BBMRI, ECRIN and EATRIS). We will then progress towards lightweight distributed document and query lookups, where the access for ethically restricted data will incorporate the results of WP 5. Finally at the outset of the project we will explore exposure of in particular meta-data sets via RDF compatible technology, such as SPARQL, and the presence of the technology watch WP 11 will provide recommendations for other emerging technologies to use, aiming for the semantically richest integration.

Work package number

WP4 Start date or starting event: month 1

Work package title Technical Integration

Activity Type RTD

Participant number

1:E

MB

L

4:S

TFC

5:U

DU

S

6:FV

B

7:TU

M-M

ED

9:E

rasm

usM

C

11:H

MG

U

13:V

UM

C

Person-months per participant 69 40 38 0 37 15 32 3

7

Objectives

1. Implement shared standards from WP 3 to allow for integration across

Page 12: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

12 | 20

BioMedBridges Deliverable D4.4

the BioMedBridges project 2. Expose the integration via use of REST based WebServices interfaces

optimised for browsing information 3. Expose the integration via use of REST based WebServices interfaces

optimised for programmatic access 4. Expose appropriate meta-data information via use of Semantic Web

Technologies 5. Pilot the use of semantic web technologies in high-data scale

biological environments. Description of work and role of participants We will provide a layered, distributed integration of BioMedBridges data using latest technologies. A key aspect to this integration will be the internal use of standards, developed in WP 3 which will provide the points of integration between the different data sources. The use of common sample ontologies (WP 3) will provide integration between biological sample properties, such as cell types, tissues and disease status, in particular bridging the Euro-BioImaging, BBMRI, Elixir and Infrafrontier projects. The use of Phenotype based ontologies will provide individual and animal level characterisation which, when these can be associated with genetic variation, will provide common genotype to phenotypic links, and this will be used to bridge the ECRIN, EATRIS, INSTRUCT, BBMRI, Infrafrontier and Elixir Projects. The use of environmental sample descriptions and geolocation tags will bridge between EMBRC, ECRIN, ERINHA, EATRIS and Elixir. The use of chemical ontologies will help bridge between EU-OPENSCREEN, ECRIN, Euro-BioImaging, INSTRUCT and Elixir. By applying these standards in the member databases (themselves often internally federated) we will create a data landscape that theoretically can be traversed, data-mined and exploited. To expose this data landscape for easy use, we will deploy a variety of different distributed integration technologies; these technologies are organised in a hierarchy where the lowest levels are the semantically poorest, but easiest to implement, whereas the highest levels potentially expose all information in databases which are both permitted for integration (some are restricted for ethical reasons, see WP 5) and can be described using common standards. We will develop software with aspects appropriate for the distributed nature of this project taken from agile engineering practices, such as rapid iterations between use cases and partial implementation. In particular we will be using the enablement/alignment strategy (Krcmar H., Informationsmanagement, Springer) to ensure that the use cases that drive the project are aligned to feasible capabilities that can be delivered. The work package will be implemented in a collaborative manner across the BMSs, with frequent physical movement of individuals. The proposed technologies are: 1. REST-based “vignette” integration, allowing presentation of information

from specific databases in a human readable form. An example is shown in Figure 1. These resources allow other web sites to “embed” live data links with key information into other websites. This infrastructure would then be used to provide browsers that, on demand, bridge between the different BioMedBridges groups – for example, information which can be organised

Page 13: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

13 | 20

BioMedBridges Deliverable D4.4

around a gene or a chemical compound would be presented across the BioMedBridges project.

2. Web service based “query” integration, where simple object queries across

distributed information resources can be used to explore a set of linked objects using the dictionaries and ontologies present. Each request will return a structured XML document.

3. Scaleable semantic web based technology. We are confident that semantic

based technology can work for the rich but low data volume meta data (eg, sample information) which we will expose using semantic web technologies such as RDF and SPARQL. However, it is unclear whether this scales to the very large number of data items or numerical terms in the BioMedBridges databases (such as SNP sets or numerical results from Clinical trials) We will pilot a number of semantic web based integration of datasets, using RDF based structuring of datasets In the latter phases of the project we will look to align these solutions to other broader standards in the eScience community, taking input from the Technology Watch (WP11) group; we hope in many cases our technology choice which has been already informed by alignment to future eScience technology (e.g. RDF/SPARQL) so this may only require appropriate registration/publication of our resources. Where unforeseen but useful technologies are developed we will build systematic connections from these BioMedBridges federation technologies to other federation technologies.

List of Appendices

Appendix  1:  Roadmap  for  a  Supporting  Blueprint  ………………………...14  

Appendix  2:  Common  Road  Map  For  Semantic  Web  Work  ……………15  

Appendix  3:  Road  Map  for  Semantic  Web  Work  by  D4.6  Partners  ..16  

Appendix  4:  Road  Map  for  In  Kind  Contribution  ..............................20  

Page 14: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

14 | 20

BioMedBridges Deliverable D4.4

Appendix 1: Roadmap for a Supporting Blueprint This roadmap details the milestones for stage 1 of the delivery plan, as executed by ELIXIR. Background: Central to ELIXIR will be a solid platform for the integration and exchange of data between the ELIXIR hub, ELIXIR nodes, BioMedBridges partners and wider European stakeholders. RDF technology has the potential to provide a method of data integration that is open, inclusive and flexible. As part of ELIXIR, EMBL-EBI is therefore already engaged in the implementation of RDF transformation pilots for some datasets. Objective: The focus of ELIXIR’s plan is to develop a blueprint to aid less experienced partners in the provision of RDF infrastructure endpoints, lowering the barrier to entry and accelerating adoption. This will include guidelines for boilerplate components and the various factors impacting on interoperability, such as an example architectural model and recommendations for software and/or hardware. Additional integration paths: via UniProt to INSTRUCT; Infrafrontier

Task Timeframe

RDF transformation of UniProt, Expression Atlas, ChEMBL databases

September 2013

SPARQL endpoint(s) for Expression Atlas, ChEMBL datasets September 2013

Develop guidelines for constructing URIs October 2013

Develop guidelines for selection of external cross reference URIs

October 2013

Develop a generic data model for representing dataset provenance

October 2013

Develop case study that supports the guidelines (E.g. Gene Expression Atlas, Reactome, ChEMBL)

November 2013

Comparison report for triple store software November 2013

Exemplar architecture overview of a software stack for semantic data provision

December 2013

Guidelines for which kinds of data to represent semantically March 2014

Develop guidelines for selecting an ontology June 2014

Develop guidelines for ontology mapping June 2014

Page 15: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

15 | 20

BioMedBridges Deliverable D4.4

Appendix 2: Common Road Map For Semantic Web Work Background: This describes a generic series of steps that may be involved in making an existing dataset available via Semantic Web technologies. Objective: Converting data into RDF and making it available to be integrated with other datasets via SPARQL.

Roadmap milestone

Task Timeframe

1 Collect and document specific example queries from users

November - December 2013

2 Write an RDF schema by hand, in N3 (or Turtle), for a data model with less than 100 items (classes+attributes+associations)

Jan 2014

3 Write a program that converts a large formal schema in DDL, annotated Java classes, or XSD to N3

Feb-March 2014

4 SemWeb pilot workshop at AGM March 2014

5 Optimize RDF schema for the above example queries (may involve application of OWL rules)

April-June 2014

6 Create translation rules from one RDF schema to SIO (subClassOf, subPropertyOf, seeAlso)

July 2014

7 Install, test and benchmark a SPARQL endpoint. Aug-Sep 2014

8 Write a program to dump the contents of a particular database into a triple store.

Oct-Nov 2014

9 Test program Dec 2014

10 Write up work for 4.6 deliverable report Jan-March 2015

Page 16: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

16 | 20

BioMedBridges Deliverable D4.4

Appendix 3: Road Map for Semantic Web Work by D4.6 Partners

1: BBMRI (TUM-MED) Background: The BBMRI.eu catalogue provides a comprehensive overview of the European Biobanking landscape. It is based on questionnaires that cover many aspects of biobanking (Topics of interest, Disease groups, Origin and use of samples, Study design and recruitment, Data sources, Confidentiality, Consent, Access, Principal variables of, sample storage conditions, etc.) In particular, it also contains data about the number of collected samples and their material types. The BBMRI.eu catalogue was originally developed during the BBMRI preparatory phase. Objective: To develop a semantic service endpoint for the BBMRI.eu catalogue that allows queries for disease groups to get biobanks with sample types and number of samples. Additional integration paths: Via BBMRI.eu, BioSD

Task Owner Timeframe

Semantic model describing BBMRI.eu catalogue TUM-MED Jan 2014

RDF transformation TUM-MED March 2014

Building a SPARQL endpoint for BBMRI.eu catalogue

TUM-MED Sep 2014

Query Interface TUM-MED June 2015

Page 17: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

17 | 20

BioMedBridges Deliverable D4.4

2: Instruct (STFC) Background: PiMS is a laboratory information management system for use in protein production laboratories; to manage the stages from the selected target protein to the production of soluble protein. PiMS development is part of a larger vision to provide a unified and extensible set of software tools for structural biology, offering seamless data transfer and a consistent user experience, from target selection to the interpretation of the structure. Objective: to provide seamless link between the experimental data in PiMS to other research infrastructures. Additional integration paths: UniProt

Task Owner Timeframe

Generate RDF Schema representing the relational database model of PiMS

STFC (NK) Completed

Mapping of entities in PiMS Schema to publicly available, well-used ontologies

STFC (NK) April 2014

Report on the feasibility of semantic web integration STFC (NK) April/May 2014

Integrate with UniProt STFC (NK) July 2015

SPARQL end-point STFC (NK) Dec 2014

Page 18: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

18 | 20

BioMedBridges Deliverable D4.4

3:Infrafrontier (HMGU) Background: Systemic phenotyped mice data is a valuable data source and especially large scale projects like IMPC (International Mouse Phenotyping Consortium) can foster research into human diseases. Nevertheless, the current technical implementation does not allow integration with other resources. Moreover, systemic phenotyped mice data in a RDF format would allow straightforward technical application of the outcomes from the PhenoBridge workpackage (WP7). Objective: Development of an RDF scheme for systemic phenotyped mice data. Integrate mice data with the resources Gene Expression Atlas, Metabolights and Reactome supported by outcomes of work package 7 (PhenoBridge). Integration paths: WP7

Task Owner Timeframe

Investigate existing approaches present in IMPC HMGU Q3 2013

Map the annotated systemic phenotyped mice data to an ontology

Completed

Investigate D2RQ to create a pilot RDF model of systemic phenotyped mice data

HMGU Q3 2013

Pilot RDF model. HMGU Q4 2013

Pilot RDF endpoint implementation HMGU Q1 2014

Semantic web pilot at AGM HMGU March 2014

Rich RDF model which ideally allow representation of parameter sets of systemic phenotyped mice data

HMGU Q4 2014

SPARQL endpoint for full systemic phenotyped mice data

HMGU Q4 2014

Integration of semantic systemic phenotyped mice data with ELIXIR RDF based databases and development of a query interface

HMGU Q1 2015

Contribute to delivery report HMGU Q2 2015

Page 19: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

19 | 20

BioMedBridges Deliverable D4.4

4: ECRIN (UDUS) Background: For researchers in biomedical sciences it is indispensable to get access on publications linked to their specific point of research. Not accessing but in many cases finding the right publications is often a time consuming problem which is not easy to solve when the link to the research area is not completely clear. Objective: To develop a pilot that will enable to search for disease related information in the context of acute myeloid leukemia from clinical trial registers (e.g. ClinicalTrials.gov) linked to scientific publications (e.g. PubMed). This includes the description of fitting algorithms for text mining and suitable software for the implementation and integration. Further possible bridges, e.g. databases storing information about chemical compounds or genes and genomes, will be investigated. Integration paths: WP7, WP8

Task Owner Timeframe

Brief analysis of recent projects dealing with text mining algorithms and their conversion into a database

UDUS July-September 2013

Develop guidelines for revealing connections between publications data from PubMed and clinical trials data from ClinicalTrials.gov

UDUS October–December 2013

Requirements Elicitation Workshop with WP7 and WP8

UDUS January 2014

Evaluation of Requirements and Use Cases UDUS February 2014

Requirements Validation Workshop with WP4 UDUS March 2014

Development of a software solution and architecture for the pilot integration of semantic web technologies (RDF)

UDUS April – December 2014

Development of a (graphical) query interface for the pilot

UDUS July 2014

Report on the software solution, architecture and the integration scenarios for the pilot.

UDUS Complete March 2015

Page 20: D4-4 web Identification of feasible BioMedBridges pilots ... · semantic web technology Description Mature Semantic Web Expertise from multiple projects, databases and technology

20 | 20

BioMedBridges Deliverable D4.4

Appendix 4: Road Map for In Kind Contribution

BBMRI (UMCG) Background: To have sufficient power in biobank studies researchers need to find suitable sample collections (i.e., biobanks with appropriate inclusion, exclusion criteria). Subsequently, researchers need to evaluate if the data available on these collections are (a) suitable for their research questions and (b) can be pooled successfully. All this requires rich semantic annotations. Objective: To develop a semantic service endpoint for biobanks to expose their data dictionaries (i.e. lists of variables available) and sample descriptions (i.e. characteristics of their collections, preferably including counts). In addition, develop a client that can consume these endpoints to rapidly explore suitable mappings across biobanks. Finally, learn to what extend the sources need to be semantically annotated and/or if it is sufficient to work with query expansion. Integration paths: from biobank to biobank via ICD-10, NCI thesaurus, etc. or custom application ontology.

Task Owner Timeframe

Minimal semantic model for describing biobank data dictionaries (OWL model)

UMCG Q4 2013

Pilot implementation of SPARQL endpoint for biobank data dictionaries using the OWL model for federated query across 6 countries

UMCG Q4 2013

Semantic search tool (user interface) of these data dictionaries based on ontology annotation of the search question (query expansion)

UMCG Q1 2014

Experimentally evaluate the added value of semantic annotation of the data dictionaries to aid the mapping task

UMCG Q3 2014

Produce a ‘semantic biobank in a box’ software for local biobanks to install and expose their local data dictionaries semantically.

UMCG Q2 2015

Contribute to delivery report UMCG Q2 2015