investigating term reuse and overlap in biomedical ontologies

60
Investigating Term Reuse and Overlap in Biomedical Ontologies International Conference on Biomedical Ontology Lisbon, 27 th -30 th July 2015 MAULIK R. KAMDAR, TANIA TUDORACHE AND MARK A. MUSEN Are we there yet?

Upload: maulik-kamdar

Post on 17-Jan-2017

405 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Investigating Term Reuse and Overlap in Biomedical Ontologies

Investigating Term Reuse and Overlap in Biomedical Ontologies

International Conference on Biomedical Ontology Lisbon, 27th -30th July 2015

M A U L I K R . K A M D A R , TA N I A T U D O R A C H E A N D M A R K A . M U S E N

Are we there yet?

Page 2: Investigating Term Reuse and Overlap in Biomedical Ontologies

C0011849Diabetes Mellitus

Diabetes Mellitus

Unified Medical Language System (UMLS)

SNOMEDCT ICD9CM

Page 3: Investigating Term Reuse and Overlap in Biomedical Ontologies

C0011849Diabetes Mellitus

Diabetes Mellitus

Unified Medical Language System (UMLS)

Open Biomedical Ontologies (OBO) Foundry

SNOMEDCT ICD9CM

Binding to RNA(GRO#BindingToRNA)GO:0003723

IRI xrefRNA Binding (GO:0003723)

Gene Expression Ontology (GEXO)

Gene Regulation Ontology (GEXO)Gene Ontology (GO)

Page 4: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Page 5: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Same IRI

Page 6: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Same IRI

Intent for Reuse

Page 7: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Xref mapping

Same IRI

Intent for Reuse

Page 8: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

September 2009

Page 9: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

September 2010

Page 10: Investigating Term Reuse and Overlap in Biomedical Ontologies

Key Findings

Page 11: Investigating Term Reuse and Overlap in Biomedical Ontologies

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Page 12: Investigating Term Reuse and Overlap in Biomedical Ontologies

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

Page 13: Investigating Term Reuse and Overlap in Biomedical Ontologies

BioPortal Import Plugin

Page 14: Investigating Term Reuse and Overlap in Biomedical Ontologies

DOG4DAG

Page 15: Investigating Term Reuse and Overlap in Biomedical Ontologies

Ontofox Web tool

Page 16: Investigating Term Reuse and Overlap in Biomedical Ontologies

Neurological Disease Ontology

Page 17: Investigating Term Reuse and Overlap in Biomedical Ontologies

Neurological Disease Ontology

OBIReuse of an Ontology

Page 18: Investigating Term Reuse and Overlap in Biomedical Ontologies

Neurological Disease Ontology

Reuse of TermsOGMS

Page 19: Investigating Term Reuse and Overlap in Biomedical Ontologies

Neurological Disease Ontology

NDO

Page 20: Investigating Term Reuse and Overlap in Biomedical Ontologies

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

Page 21: Investigating Term Reuse and Overlap in Biomedical Ontologies

BioPortal N-triples dump

Biomedical Ontologies

Terms, Labels, xrefs, CUIs

Xref ReuseIRI Reuse CUI Reuse

Clustering Determine Source Ontology

Term Overlap Analysis

509 ontologies

377 ontologies

Remove ontology views

5,718,276 class terms

Label normalization

Source-Target Ontology pairs

>35% reuse for ontology reuse

Page 22: Investigating Term Reuse and Overlap in Biomedical Ontologies

14.4% Naïve Term Overlap!

• Normalized String Matching on Term Labels

14.4%(823621)

Page 23: Investigating Term Reuse and Overlap in Biomedical Ontologies

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

Page 24: Investigating Term Reuse and Overlap in Biomedical Ontologies

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

Page 25: Investigating Term Reuse and Overlap in Biomedical Ontologies

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

Page 26: Investigating Term Reuse and Overlap in Biomedical Ontologies

315/377 ontologies xref link to no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

Xref Reuse

Page 27: Investigating Term Reuse and Overlap in Biomedical Ontologies

263/377 ontologies have no terms reused by other ontologies!

Reuse from a small set of ontologies only!>

IRI Reuse

Page 28: Investigating Term Reuse and Overlap in Biomedical Ontologies

286/377 ontologies have no terms xref linked by other ontologies!

Reuse from a small set of ontologies only!>

Xref Reuse

Page 29: Investigating Term Reuse and Overlap in Biomedical Ontologies

0-5% of total terms reused explicitly or using xref, with >150 ontologies showing 0% reuse. Average Term Reuse ~ 3%

Reuse from a small set of ontologies only with terms from >250 ontologies never reused

>100% term reuse from some ontologies! Why?

Page 30: Investigating Term Reuse and Overlap in Biomedical Ontologies

BFO GO IAO

OBI

PATO

CHEB

I

CL

NCB

ITAX

ON UO SO

UBER

ON

CARO

NCI

T

FMA

MP

SNO

MED

CT

0

10

20

30

40

50

60

70

80

90

100

Ontologies

Num

ber o

f Ont

olog

ies R

eusin

g Te

rms (

#)

>100% terms reused from some ontologies!

xref Reuse (No. of Ontologies

IRI Reuse (No. of Ontologies)

Page 31: Investigating Term Reuse and Overlap in Biomedical Ontologies

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

10

20

30

40

50

60

70

80

90

100

Ontologies

Num

ber o

f Ont

olog

ies R

eusin

g Te

rms (

#)

>100% terms reused from some ontologies!

% of Terms reused IRIs

% of Terms reused xref

BFO:101/39

Page 32: Investigating Term Reuse and Overlap in Biomedical Ontologies

… Reuse from a small set of popular or upper-level ontologies only with terms from >250 ontologies never reused

>100% terms reused w.r.t current version of the BFO, PATO, CARO, UO, SO ontologies! Needs rigorous analysis through term overlap …

Page 33: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

0 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 34: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared1-5 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 35: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

6-10 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 36: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

11-15 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 37: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

16-20 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 38: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs sharedCUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 39: Investigating Term Reuse and Overlap in Biomedical Ontologies

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs sharedCUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Page 40: Investigating Term Reuse and Overlap in Biomedical Ontologies

Minimum sharing of CUIs, especially across UMLS Procedural Terminologies- ICD10PCS, HCPCS and CPT

Several unique terms introduced as we migrate from ICD9CM -> ICD10CM, leading to decrease in Term reuse.

Should there actually be Term Reuse?

Page 41: Investigating Term Reuse and Overlap in Biomedical Ontologies

Overlap decreases using correct representations!

14.4%(823621)

• Normalized String Matching on Term Labels

13.2%(752,176)

• Removing Explicitly Reused Terms

10.8%(617509)

• Removing Terms Mapped to the same UMLS CUI

1.6% (93,650)

• Removing almost-similar terms (same identifier and source ontology but different representation)

Page 42: Investigating Term Reuse and Overlap in Biomedical Ontologies

Average 3% Term reuse across ontologies using any method, yet a 14.4% naïve Term overlap!

Term overlap decreases substantially on removing almost similar terms …

Examples for almost similar terms?

Page 43: Investigating Term Reuse and Overlap in Biomedical Ontologies

Version 1.0/Version1.1 Subcellular Anatomy Ontology (SAO)

Suggested Ontology for Pharmacogenomics (SOPHARM)

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

Ontology Engineers show an intent for reuse!

Page 44: Investigating Term Reuse and Overlap in Biomedical Ontologies

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

NCIT:C53037/NCIT:Cerebral_VeinCigarette Smoke Exposure (CSEO)Sage Bionetworks Synapse (SYN)

Ontology Engineers show an intent for reuse!

Page 45: Investigating Term Reuse and Overlap in Biomedical Ontologies

OBO:FMA_31396OBO:owlapi/fma#FMA_31396

OBO:owl/FMA#FMA_31396OBO:fma#Cartilage_of_inferior_surface …

Ontology Engineers show an intent for reuse!

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

Page 46: Investigating Term Reuse and Overlap in Biomedical Ontologies

http://purl.bioontology.org/ontology/MESHhttp://phenomebrowser.net/ontologies/mesh/mesh.owl

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

Ontology Engineers show an intent for reuse!

Page 47: Investigating Term Reuse and Overlap in Biomedical Ontologies

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

http://ihtsdo.org/snomedct/http://purl.bioontology.org/ontology/SNOMEDCT

Ontology Engineers show an intent for reuse!

Page 48: Investigating Term Reuse and Overlap in Biomedical Ontologies

Different versions, notations, namespaces• >100% Reuse of few source ontologies• Increase in Term Overlap

Incorrect representations without mappings do not provide advantages of Term Reuse!

Page 49: Investigating Term Reuse and Overlap in Biomedical Ontologies

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

Page 50: Investigating Term Reuse and Overlap in Biomedical Ontologies

Onto 1 Onto 2 Onto 3 Onto 4 Onto 5 Onto 6 Onto 7

Term 1 1 1 1 0 0 0 0

Term 2 0 0 0 1 1 0 0

Term 3 0 0 0 0 0 1 1

Term 4 1 1 0 0 1 0 0

Term 5 1 1 1 0 0 0 1

Term 6 0 0 0 1 1 1 0

Term 7 0 0 1 0 1 0 0

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

Page 51: Investigating Term Reuse and Overlap in Biomedical Ontologies

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

Page 52: Investigating Term Reuse and Overlap in Biomedical Ontologies

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

• Weighted Similarity Score between Term pairs– Shared Ontologies– Jaccard Semantic Similarity Score– CUI Hierarchy from UMLS Metathesaurus

Page 53: Investigating Term Reuse and Overlap in Biomedical Ontologies

Semantically-similar terms are reused together!

Semantic Similarity < 0.9

Cluster Size

Semantic Similarity > 0.9

Page 54: Investigating Term Reuse and Overlap in Biomedical Ontologies

Semantically-similar terms are reused together!Semantic Similarity > 0.9

Page 55: Investigating Term Reuse and Overlap in Biomedical Ontologies

Semantically-similar terms are reused together!Semantic Similarity > 0.9

Page 56: Investigating Term Reuse and Overlap in Biomedical Ontologies

Semantic-similar terms (Parent-child or siblings) are reused together …

Similarity Metric and BioPortal can be used to provide recommendations to ontology developers through a Web Protégé plugin!

Page 57: Investigating Term Reuse and Overlap in Biomedical Ontologies

Challenges to Term Reuse

• Substantial term overlap but less than 5% reuse.

• Lexically-similar terms may represent different concepts (e.g., anatomical concepts between ZFA and XAO).

• Lexically-different terms may represent same concepts (e.g. myocardium and cardiac muscle)

• Same terms use different IRI representations, and without explicit CUI or xref mappings.

• Lack of guidelines and semi-automated tools.

Page 58: Investigating Term Reuse and Overlap in Biomedical Ontologies

Future Work: WebProtégé Plugin

Term reuse recommendations using Item-based Collaborative Filtering method.Two-fold (A Posteriori and User-Centered) Evaluation

GO:0033036

GO:0008104

GO:1902432

GO:1903260

GO:0061472

GO:0090174

GO:0071850

GO:0044770

GO:0044839

GO:0045786

GO:0007050

GO:0044843

GO:1902969

GO:0036226

Page 59: Investigating Term Reuse and Overlap in Biomedical Ontologies

- Still far from achieving ideal term reuse, beyond upper level and popular ontologies

- Newer ontologies added in BioPortal- Without strict guidelines and semi-automated tools,

we will deviate more away …

The Road Ahead …

Page 60: Investigating Term Reuse and Overlap in Biomedical Ontologies

Acknowledgments

Musen Lab, StanfordBMI PhD Program, Stanford

US NIH Grants GM086587GM103316

[email protected]

http://stanford.edu/~maulikrk/data/OntologyReuse