labeling and enhancing life science links s. heymann*, f. naumann*, l. raschid +, p. rieger * *...

1
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid + , P. Rieger * * Humboldt Universität zu Berlin + University of Maryland Existing Life Science Links Why Enrich Links? How to Enrich Links We propose to enrich the current link implementation, so as to support more meaningful queries over enh-links. Enrichment should include semantic labels descriptors (matching an appropriate ontology), and a more precise identification of the link's source and target elements (within a data entry). One can then traverse paths and perform a comparison of paths that is meaningful to the biologist. An abundance of Web-accessible life sciences data sources contain data about scientific entities such as genes, sequences, proteins and citations. The sources are diverse in content and computational capability, they are richly interconnected to each other, and they have varying levels of overlap. The scientific exploration of relationships between objects involves the traversal of links and paths (concatenation of links). Existing links are poor with respect to both syntax and semantics. Links are syntactically poor since the origin and the target of the link are specified only at the level of the database entry or object. Links are semantically poor since they carry no explicit meaning. Acknowledgements: This research is partially supported by NSF Grants IIS0219909 and EIA0130422 (LR), and by DFG Grant FR1142/1-3 (SH). References: (1) T. Etzold, A. Ulyanov, P. Argos: SRS: Information Retrieval System for Molecular Biology Data Banks. Methods in Enzymology 266: 114-128, 1996. (2) S. Heymann, K. Tham, A. Kilian, G. Wegner, P. Rieger, D. Merkel, J.C. Freytag: Viator - A Tool Family for Graphical Networking and Data View Creation 28th International Conference on Very Large Data Bases 2002 Hong Kong, Proceedings pp. 1067-1070 Locuslink UniProt OMIM BRENDA LCOMPOUND Is translated to Has disease-related mutations Is an enzyme Is a substrate Locuslink UniProt OMIM BRENDA LCOMPOUND Existing Links: No Link Labels Links Enhanced with Link Labels enh-links: Links enhanced with Label, Origin of Link and Target of Link. Model and query language for Labeled Life Science Links A data model to capture enriched link semantics will include: LT: A set of link types LS: A set of published links implemented in the sources LL: Pairs of link types that represent a meaninful link concatenation LE: Link and path equivalencies Tools that support semi-automatic annotation and enrichment of existing links. A query language for a scientist to exploit LT, LS and LL in expressing navigational queries. A scientist friendly interface To specify properties LS, LL and LE. To rank the paths that satisfy some query. Links are added for various reasons: Represents the result of an experiment protocol to test a hypothesis. Data curators may add links following domain specific conventions. A link may have been predicted by some software. Biologists can usually infer the meaning of a link but search engines and mediators cannot. Mapping from logical classes/categories to physical Web accessible collections Primary repository of sequences >> GenBank, EMBL, DDBJ Annotated genome data >> ENSEMBL Hand curated protein sequences >> UniProt (=SwissProt PIR) Hand curated hereditary diseases >> OMIM Frames of reference >> GO, Taxonomy, HSAGENES ... >> ... Four NCBI data sources (red arrows) being nodes in a staedily growing convolute of coarsely cross-referenced primary and secondary Life Science data compilations (1). Interactive Navigation Aid: GeneViator Tool - available upon request (2).

Upload: cornelius-todd

Post on 13-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland

Labeling and Enhancing Life Science LinksS. Heymann*, F. Naumann*, L. Raschid+, P. Rieger*

* Humboldt Universität zu Berlin + University of Maryland

Existing Life Science Links Why Enrich Links? How to Enrich Links We propose to enrich the current link implementation, so as to support more meaningful queries over enh-links. Enrichment should include semantic labels descriptors (matching an appropriate ontology), and a more precise identification of the link's source and target elements (within a data entry). One can then traverse paths and perform a comparison of paths that is meaningful to the biologist.

An abundance of Web-accessible life sciences data sources contain data about scientific entities such as genes, sequences, proteins and citations. The sources are diverse in content and computational capability, they are richly interconnected to each other, and they have varying levels of overlap.

The scientific exploration of relationships between objects involves the traversal of links and paths (concatenation of links). Existing links are poor with respect to both syntax and semantics.

Links are syntactically poor since the origin and the target of the link are specified only at the level of the database entry or object.

Links are semantically poor since they carry no explicit meaning.

Acknowledgements: This research is partially supported by NSF Grants IIS0219909 and EIA0130422 (LR), and by DFG Grant FR1142/1-3 (SH).

References: (1) T. Etzold, A. Ulyanov, P. Argos: SRS: Information Retrieval System for Molecular Biology Data Banks. Methods in Enzymology 266: 114-128, 1996.(2) S. Heymann, K. Tham, A. Kilian, G. Wegner, P. Rieger, D. Merkel, J.C. Freytag: Viator - A Tool Family for Graphical Networking and Data View Creation 28th International Conference on Very Large Data Bases 2002 Hong Kong, Proceedings pp. 1067-1070

Locuslink

UniProt

OMIM

BRENDA

LCOMPOUND

Is translated to Has disease-related mutations

Is an enzyme

Is a substrate

Locuslink

UniProt

OMIM

BRENDA

LCOMPOUNDExisting Links: No Link Labels

Links Enhanced with Link Labels

enh-links: Links enhanced with Label, Origin of Link and Target of Link.

Model and query language for Labeled Life Science Links

A data model to capture enriched link semantics will include:

LT: A set of link typesLS: A set of published links implemented in the sourcesLL: Pairs of link types that represent a meaninful link concatenationLE: Link and path equivalencies

Tools that support semi-automatic annotation and enrichment of existing links.

A query language for a scientist to exploit LT, LS and LL in

expressing navigational queries.

A scientist friendly interfaceTo specify properties LS, LL and LE.To rank the paths that satisfy some query.

Links are added for various reasons:

Represents the result of an experiment protocol to test

a hypothesis.

Data curators may add links following domain specific

conventions.

A link may have been predicted by some software.

Biologists can usually infer the meaning of a link but search

engines and mediators cannot.

Current links cannot capture or differentiate these desirable properties.

Mapping from logical classes/categories to physical Web accessible collections

Primary repository of sequences >> GenBank, EMBL, DDBJAnnotated genome data >> ENSEMBL

Hand curated protein sequences >> UniProt (=SwissProt PIR)Hand curated hereditary diseases >> OMIM

Frames of reference >> GO, Taxonomy, HSAGENES... >> ...

Four NCBI data sources (red arrows) being nodes in a staedily growing convolute of coarsely cross-referenced primary and secondary Life Science data compilations (1).

Interactive Navigation Aid: GeneViator Tool - available upon request (2).