sti summit 2011 - ls4 ls khaos

15
Linked Data and Life Sciences Riga STI Summit 6,8 july 2011 José F. Aldana Montes

Upload: semantic-technology-institute-international

Post on 11-Nov-2014

308 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: STI Summit 2011 - LS4 LS Khaos

Linked Data and Life Sciences

Riga STI Summit 6,8 july 2011

José F. Aldana Montes

Page 2: STI Summit 2011 - LS4 LS Khaos

Life Sciences Linked Data

Producing Consuming

Page 3: STI Summit 2011 - LS4 LS Khaos

Producing Life Sciences Linked Data(Problems)

Almost all Linked Open Data in Life Sciences is provided by Bio2RDF

Most Linked Open Data is created and providedwithout the help of the original data provider who

Page 4: STI Summit 2011 - LS4 LS Khaos

Producing Life Sciences Linked Data(Problems)

Almost all Linked Open Data in Life Sciences is provided by Bio2RDF

• Data Base is a life’s work for a biologist and He/shewants to publish it– but not to lose the control

• An RDF dump of the DB is cheap– but supporting Queries and Data Analysis is expensive– where is the money comming from?

• They are very motivated to add value to the data– but they are still lacking up to date ICT skills

• Help is wanted to kill Bio2RDF

Page 5: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Number of Linked Data repositories will keep growing• Use of Linked Data in Life Sciences means Linking data

with existing tools which are de facto standards in certain subdomains:

• Pathways

• Proteins

http://sbmm.uma.es

Page 6: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Data Analysis Services not only queries but also Data Mining, Crawling, and Reasoning are need to engage community– BioMedical uses (Pharmaceuticals testing, drug screening)

Page 7: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Reasoning, removed to make data reuse possible, should be re-introduced in some cases over real complex ontologies with large sets of data– BioPax Level 3 (Level 4 under development)

• OWL Species: DL• DL Expressivity: SHIF(D)• Consistent: Yes

– BioPax Level 3 (4 officially identified databases, more DBs public data as BioPax Level 3 instances)

• Reactome Database– 1.54 GB

– 2 980 230 triples

– BioPax Level 2 (9 officially identified databases)

• Previously, data and ontologies should be cleaned up

Page 8: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Reasoning Services over real complex ontologies with large sets of data– Cost reduction in experiment design– Hypothesis demonstration/refutation– Privacy in reasoning with public + private data

Page 9: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Reasoning for classification problems– Disease classification / diagnosis– Protein identification– Pathway alignment

Page 10: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Digital Data Curation / cross-validation

Page 11: STI Summit 2011 - LS4 LS Khaos

Consuming Linked Data

• Domain oriented (customizable) user interfaces

Page 12: STI Summit 2011 - LS4 LS Khaos

Scalability Issues in Life Sciences

• Real scenarios with rich ontologies are starting to appear:– BioPax Level 3�4: complex OWL ontology (transitive, reflexive,

inverse and functional properties, restrictions in most of the classes, 70 classes)

– Big data sets in OWL format (from 20MB to 45GB of data)– Problems with the data:

• undetected Abox (even Tbox problems) inconsistencies because of the lack of scalable reasoners

• Lack of SPARQL endpoints to query these data

Page 13: STI Summit 2011 - LS4 LS Khaos

Summary: Are we losing the war?

• Producing Linked Data in Life Sciences: Some risks and some needs detected:– A motivating rewarding schema for the data owner– Some specific infrastructure (action, facility, institute, foundation,

private…) support could be useful• to engage data owners, • to aport tecnnical capability and • to share costs

Page 14: STI Summit 2011 - LS4 LS Khaos

Summary: Are we losing the war?

• Consuming Linked Data in Life Sciences Opportunities– Connecting Linking data with existing tools which are de facto

standards in certain LS subdomains• to multiply impact

– Not only Queries Services but also Data Analysis Services (Crawling, Mining, Reasoning, etc.) should be provided to the community

• but this is expensive for the average DB owner

– Data must be cleaned up, curate and cross-validated • main thread

– Domain is lacking specific user interfaces• this is related with the connection of LD to (de facto) standard tools

– In this domain makes sense to reason • but scalability is still an issue

Page 15: STI Summit 2011 - LS4 LS Khaos

Linked Data and Life Sciences

José F. Aldana [email protected]