investigating term reuse and overlap in biomedical ontologies

Post on 17-Jan-2017

405 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Investigating Term Reuse and Overlap in Biomedical Ontologies

International Conference on Biomedical Ontology Lisbon, 27th -30th July 2015

M A U L I K R . K A M D A R , TA N I A T U D O R A C H E A N D M A R K A . M U S E N

Are we there yet?

C0011849Diabetes Mellitus

Diabetes Mellitus

Unified Medical Language System (UMLS)

SNOMEDCT ICD9CM

C0011849Diabetes Mellitus

Diabetes Mellitus

Unified Medical Language System (UMLS)

Open Biomedical Ontologies (OBO) Foundry

SNOMEDCT ICD9CM

Binding to RNA(GRO#BindingToRNA)GO:0003723

IRI xrefRNA Binding (GO:0003723)

Gene Expression Ontology (GEXO)

Gene Regulation Ontology (GEXO)Gene Ontology (GO)

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Same IRI

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Same IRI

Intent for Reuse

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

Xref mapping

Same IRI

Intent for Reuse

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

September 2009

Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2.

OBO Reuse vs Overlap in 2010

September 2010

Key Findings

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

BioPortal Import Plugin

DOG4DAG

Ontofox Web tool

Neurological Disease Ontology

Neurological Disease Ontology

OBIReuse of an Ontology

Neurological Disease Ontology

Reuse of TermsOGMS

Neurological Disease Ontology

NDO

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

BioPortal N-triples dump

Biomedical Ontologies

Terms, Labels, xrefs, CUIs

Xref ReuseIRI Reuse CUI Reuse

Clustering Determine Source Ontology

Term Overlap Analysis

509 ontologies

377 ontologies

Remove ontology views

5,718,276 class terms

Label normalization

Source-Target Ontology pairs

>35% reuse for ontology reuse

14.4% Naïve Term Overlap!

• Normalized String Matching on Term Labels

14.4%(823621)

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

156/377 ontologies reuse no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

IRI Reuse

315/377 ontologies xref link to no terms from other ontologies!

<5% of Terms reused from other Ontologies!>

Xref Reuse

263/377 ontologies have no terms reused by other ontologies!

Reuse from a small set of ontologies only!>

IRI Reuse

286/377 ontologies have no terms xref linked by other ontologies!

Reuse from a small set of ontologies only!>

Xref Reuse

0-5% of total terms reused explicitly or using xref, with >150 ontologies showing 0% reuse. Average Term Reuse ~ 3%

Reuse from a small set of ontologies only with terms from >250 ontologies never reused

>100% term reuse from some ontologies! Why?

BFO GO IAO

OBI

PATO

CHEB

I

CL

NCB

ITAX

ON UO SO

UBER

ON

CARO

NCI

T

FMA

MP

SNO

MED

CT

0

10

20

30

40

50

60

70

80

90

100

Ontologies

Num

ber o

f Ont

olog

ies R

eusin

g Te

rms (

#)

>100% terms reused from some ontologies!

xref Reuse (No. of Ontologies

IRI Reuse (No. of Ontologies)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

10

20

30

40

50

60

70

80

90

100

Ontologies

Num

ber o

f Ont

olog

ies R

eusin

g Te

rms (

#)

>100% terms reused from some ontologies!

% of Terms reused IRIs

% of Terms reused xref

BFO:101/39

… Reuse from a small set of popular or upper-level ontologies only with terms from >250 ontologies never reused

>100% terms reused w.r.t current version of the BFO, PATO, CARO, UO, SO ontologies! Needs rigorous analysis through term overlap …

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

0 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared1-5 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

6-10 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

11-15 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs shared

16-20 Terminologies

CUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs sharedCUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

ICD1

0PCS

HCPC

SN

CBIT

AXO

NLO

INC

MES

HHL

7IC

D10C

MO

MIM

RXN

ORM CP

TPD

QM

EDDR

AIC

D9CM

NDD

FIC

PCIC

PC2P

MDD

BN

DFRT

SNO

MED

CTVA

NDF

CRIS

PRC

DM

EDLI

NE.

..SN

MI

COST

ART

WHO

-ART

Procedural Terminologies do not share CUIs!

CUIs sharedCUI Reuse

Nu

mb

er o

f Ter

ms

(Log

Sca

le)

Minimum sharing of CUIs, especially across UMLS Procedural Terminologies- ICD10PCS, HCPCS and CPT

Several unique terms introduced as we migrate from ICD9CM -> ICD10CM, leading to decrease in Term reuse.

Should there actually be Term Reuse?

Overlap decreases using correct representations!

14.4%(823621)

• Normalized String Matching on Term Labels

13.2%(752,176)

• Removing Explicitly Reused Terms

10.8%(617509)

• Removing Terms Mapped to the same UMLS CUI

1.6% (93,650)

• Removing almost-similar terms (same identifier and source ontology but different representation)

Average 3% Term reuse across ontologies using any method, yet a 14.4% naïve Term overlap!

Term overlap decreases substantially on removing almost similar terms …

Examples for almost similar terms?

Version 1.0/Version1.1 Subcellular Anatomy Ontology (SAO)

Suggested Ontology for Pharmacogenomics (SOPHARM)

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

Ontology Engineers show an intent for reuse!

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

NCIT:C53037/NCIT:Cerebral_VeinCigarette Smoke Exposure (CSEO)Sage Bionetworks Synapse (SYN)

Ontology Engineers show an intent for reuse!

OBO:FMA_31396OBO:owlapi/fma#FMA_31396

OBO:owl/FMA#FMA_31396OBO:fma#Cartilage_of_inferior_surface …

Ontology Engineers show an intent for reuse!

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

http://purl.bioontology.org/ontology/MESHhttp://phenomebrowser.net/ontologies/mesh/mesh.owl

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

Ontology Engineers show an intent for reuse!

Intent

Different Versions

BFO

NCIT

Different Notations

FMA

Different Namespaces

MESH

SNOMEDCT

http://ihtsdo.org/snomedct/http://purl.bioontology.org/ontology/SNOMEDCT

Ontology Engineers show an intent for reuse!

Different versions, notations, namespaces• >100% Reuse of few source ontologies• Increase in Term Overlap

Incorrect representations without mappings do not provide advantages of Term Reuse!

Key Findings

~3% Term Reuse Only popular or upper-

level ontologies reused 14.4% Term Overlap

Semantically-similar terms reused together

Similarity metric for a Recommender system

Onto 1 Onto 2 Onto 3 Onto 4 Onto 5 Onto 6 Onto 7

Term 1 1 1 1 0 0 0 0

Term 2 0 0 0 1 1 0 0

Term 3 0 0 0 0 0 1 1

Term 4 1 1 0 0 1 0 0

Term 5 1 1 1 0 0 0 1

Term 6 0 0 0 1 1 1 0

Term 7 0 0 1 0 1 0 0

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

Term-Ontology

Matrix

K-modes Clustering

Term-Term Affinity Matrix

Spectral Clustering

Understanding how Term Reuse Occurs

• Weighted Similarity Score between Term pairs– Shared Ontologies– Jaccard Semantic Similarity Score– CUI Hierarchy from UMLS Metathesaurus

Semantically-similar terms are reused together!

Semantic Similarity < 0.9

Cluster Size

Semantic Similarity > 0.9

Semantically-similar terms are reused together!Semantic Similarity > 0.9

Semantically-similar terms are reused together!Semantic Similarity > 0.9

Semantic-similar terms (Parent-child or siblings) are reused together …

Similarity Metric and BioPortal can be used to provide recommendations to ontology developers through a Web Protégé plugin!

Challenges to Term Reuse

• Substantial term overlap but less than 5% reuse.

• Lexically-similar terms may represent different concepts (e.g., anatomical concepts between ZFA and XAO).

• Lexically-different terms may represent same concepts (e.g. myocardium and cardiac muscle)

• Same terms use different IRI representations, and without explicit CUI or xref mappings.

• Lack of guidelines and semi-automated tools.

Future Work: WebProtégé Plugin

Term reuse recommendations using Item-based Collaborative Filtering method.Two-fold (A Posteriori and User-Centered) Evaluation

GO:0033036

GO:0008104

GO:1902432

GO:1903260

GO:0061472

GO:0090174

GO:0071850

GO:0044770

GO:0044839

GO:0045786

GO:0007050

GO:0044843

GO:1902969

GO:0036226

- Still far from achieving ideal term reuse, beyond upper level and popular ontologies

- Newer ontologies added in BioPortal- Without strict guidelines and semi-automated tools,

we will deviate more away …

The Road Ahead …

Acknowledgments

Musen Lab, StanfordBMI PhD Program, Stanford

US NIH Grants GM086587GM103316

maulikrk@stanford.edu

http://stanford.edu/~maulikrk/data/OntologyReuse

top related