link analysis of life sciences linked data

14
Link Analysis of Life Science Linked Data 1 Wei Hu 1 , Honglei Qiu 1 , and Michel Dumontier 2 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Center for Biomedical Informatics Research, Stanford University @micheldumontier::ISWC 2015

Upload: michel-dumontier

Post on 15-Apr-2017

599 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Link Analysis of Life Sciences Linked Data

Link Analysis of Life Science Linked Data

1

Wei Hu1, Honglei Qiu1, and Michel Dumontier2

1State Key Laboratory for Novel Software Technology, Nanjing University, China2Center for Biomedical Informatics Research, Stanford University

@micheldumontier::ISWC 2015

Page 2: Link Analysis of Life Sciences Linked Data

Linked Data offers links between datasets, but they are often incomplete and may contain

errors.

@micheldumontier::ISWC 20152

Page 3: Link Analysis of Life Sciences Linked Data

Network Analysis• Network analysis has long been

used to study link structures – The structure of the Web– Network medicine: cellular

networks and implications

@micheldumontier::ISWC 20153

Power law is scale free

A graph demonstrates the small worldphenomenon, if its clustering coefficient issignificantly higher than that of a randomgraph on the same node set, and if the graphhas a shorter average distance.

BTC2010

The clustering coefficient quantifies how closeits neighbors are to be a clique. The averagedistance is the average shortest path lengthbetween all nodes in the graph.

Page 4: Link Analysis of Life Sciences Linked Data

Dataset link analysis (using RDF data model)

Entity link analysis (using cross-references)

Term link analysis (using ontology matching)

@micheldumontier::ISWC 20154

Page 5: Link Analysis of Life Sciences Linked Data

@micheldumontier::ISWC 2015

Linked Data for the Life Sciences

5

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• Release 3 (June 2014)• 35 datasets• 11B RDF triples• 1B entities• 2K classes• 4K properties

Page 6: Link Analysis of Life Sciences Linked Data

Dataset Links

@micheldumontier::ISWC 20156

Network Properties1. Well linked

2. Hubs and authorities

3. small-world phenomenonAverage distance = 2.77 vs 6Clustering coefficient = 0.22 vs

0.134. robust on systematic removal of nodes

Page 7: Link Analysis of Life Sciences Linked Data

Entity Link Analysis

How well do entities link to each other?• 76% entity links involve a special kind of RDF triples

– e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>– x-relations have under-specified semantics

• May be truly identical, may refer to another related entity …

• Degree distribution– Some do not follow power law

• Exponent is too large (close to 5)

7

BTC2010

@micheldumontier::ISWC 2015

Page 8: Link Analysis of Life Sciences Linked Data

symmetry of entity links varies between different pairs of datasets

• Over 99% of links are reciprocated in DrugBank-PharmGKB and OMIM-HGNC– Suggests link sharing and synchronization

• Only 58% of links in DrugBank-KEGG and 51% of OMIM-Orphanetlinks are reciprocal– Suggests incomplete mapping

• 28% of OMIM-Orphanet links are malposed– Suggests variation in model (omim:Phenotype to orphanet:Disorder)

8 @micheldumontier::ISWC 2015

Page 9: Link Analysis of Life Sciences Linked Data

Transitivity Analysis: Find mismatches and discover new links

@micheldumontier::ISWC 20159

Page 10: Link Analysis of Life Sciences Linked Data

Evaluation of Entity Matching

How accurate are current entity matching approaches?• Built a benchmark from the reciprocal links between similarly-typed

entities • Evaluated several entity matching approaches

– Label similarity: Levenstein, Jaro-Winkler, N-gram, Jaccard– Machine learning: Linear regression, logistic regression with 5 properties

• Many-to-one links are difficult to be discovered

10 @micheldumontier::ISWC 2015

Page 11: Link Analysis of Life Sciences Linked Data

Term Link Analysis

How similar are the topics in the data network?• Use ontology matching to generate term link graph

– Falcon-AO (linguistic matchers + structural matcher + synonyms)• Created 83K class mappings, 1.5K object property mappings, and 858 data

property mappings– Similarity threshold = 0.9– Top-5 popular labels for classes and properties

• Significant overlap in topics, does not follow power law as in broader SW

11 @micheldumontier::ISWC 2015

Page 12: Link Analysis of Life Sciences Linked Data

Correlation of Link Graphs

To what degree are each of the three link graphs are correlated?• Spearman’s rank correlation coefficient:

– Entity link graph dataset pairs: entity links / entities– Term link graph dataset pairs: term mappings / terms– Dataset link graph dataset pairs: shortest path length

• All positively correlated– Closer datasets in distance have more linked entities and terms– Number of linked entities contributes little to overlap of topics

12 @micheldumontier::ISWC 2015

Page 13: Link Analysis of Life Sciences Linked Data

Summary of Findings

• Dataset, entity and term link graphs do not necessarily share the samecharacteristics with the Hypertext / Semantic Web– Degree distribution of entity links does not follow power law– Data hubs

• A significant number of entities have been linked using x-relations, but their intended semantics differs– Classes are identical or equivalent entity links represent logical equivalence

• Symmetric and transitive entity links do exist, but their utility is weakeneddue to their small number– Meanings of entity links may shift during transitive closure

• Only matching the labels of entities may fail, while combining different properties and using simple learning algorithms achieve good accuracy

13 @micheldumontier::ISWC 2015

Page 14: Link Analysis of Life Sciences Linked Data

[email protected]

Website: http://dumontierlab.comPresentations: http://slideshare.com/micheldumontier

14 @micheldumontier::ISWC 2015