instance-based ontology matching by instance enrichment

Ontology matching Instance-based OM IBOMbIE Experiments Comparison other OM Conclusions

Instance-Based Ontology MatchingBy Instance Enrichment

Balthasar A.C. Schopman–

supervisors:Antoine Isaac

Shenghui WangStefan Schlobach

Vrije Universiteit Amsterdam

June 29, 2009


Outline

1 Ontology matching

2 Instance-based OM

3 IBOMbIE

4 Experiments

5 Comparison other OM

6 Conclusions


Research questions

General research questions:

How do different algorithm design options ofIBOMbIE influence the final result?

How does the performance of IBOMbIE relate to other OMalgorithms?


Questions from the audience

Crucial questions: please interrupt me.Other questions: after presentation please.


Introduction

Ontology

Definition of an ontology1:

An ontology typically (1) defines a vocabulary relevant ina certain domain of interest, (2) specifies the meaning ofterms and (3) specifies relations between terms.

Ontologies:

controlled vocabulary

thesaurus

database schema

canonical semantic web ontology: a set of typed, interrelatedconcepts defined in a formal language

1by Euzenat and Shvaiko


Introduction

Ontology Matching (OM)

Ontologies ...

facilitate interoperability between parties

do not solve heterogeneity problem, but raise it to a higherlevel: the OM level

Elementary OM techniques:

terminological

structure-based

semantic-based

instance-based


Introduction

Instance-based OM (IBOM)

Variants IBOM:

1 use dually annotated instances (DAI)

2 create DAI

3 use extension of concepts (DAI not required)

General pros and cons:

Con: does not deduce specific relations

Con: suitable instances rarely available

Pro: focus on active part of ontology

Pro: able to deal with ambiguous linguistic phenomena:synonym, homonym


Intro

Definitions of ‘instance of’-relation

Example definitions:

Canonical semantic web definition

Library definition

someone:Peter

foaf:Person

rdf:type

"Peter"

foaf:name

someone:Nate

foaf:knows


Intro

Definitions of ‘instance of’-relation

Example definitions:

Canonical semantic web definition

Library definition

ontology / vocabulary

c1

c2

c3

...

...

object o1

c1

object o2

c2

c3

c1


Intro

Application

Two library scenarios: KB and TEL

match controlled vocabularies

data-sets: book catalogs

multi-lingual


IBOM

IBOM: measuring similarity

c2

c1


IBOM

Jaccard coefficient

Jaccard coefficient:

J(c1, c2) =|i1 ∩ i2||i1 ∪ i2|

quantifies the overlap of the extension of concepts→ relatedness between concepts

Con: no multi-sets


IBOM

Creating dually annotated instances (DAI)

Jaccard needs DAIIf DAI unavailable:

exact instance matching → merge annotations

approximate instance matching → enrich instances


Instance matching

Approximate instance matching

Instance similarity measures:

Lucene

vector space model (VSM)


Enriching instances

Basic instance enrichment (IE)

data-set D2data-set D1

i1

a b

i

i2

A B

ii

i

match


Enriching instances

Basic instance enrichment (IE)


i1

a b

i

i2

A B

ii

i

BA


Enriching instances

IE parameter: topN


i1

a bi3

D

i2

A B

i4

A C

i

i

1st

match

3rd

match

2nd

match


Enriching instances

IE parameter: topN


i1

a bi3

D

i2

A B

i4

A C

i

i

BA


Enriching instances

IE parameter: topN


i1

a bi3

D

i2

A B

i4

A C

i

i

BA

D


Enriching instances

IE parameter: topN


i1

a bi3

D

i2

A B

i4

A C

i

i

BA

D

A C


Enriching instances

IE parameter: similarity threshold (ST)


i1

a bi3

D

i2

A B

i4

A C

i

i

sim(i1,i2)= 0.8

sim(i1,i4)= 0.2

sim(i1,i3)= 0.4


Enriching instances



i1

a b i3

D

i2

A B

i4

A C

i

i

sim(i1,i2)= 0.8

sim(i1,i4)= 0.2

sim(i1,i3)= 0.4

BA


Enriching instances



i1

a b i3

D

i2

A B

i4

A C

i

i

sim(i1,i2)= 0.8

sim(i1,i4)= 0.2

sim(i1,i3)= 0.4

BA

D


Enriching instances



i1

a b i3

D

i2

A B

i4

A C

i

i

sim(i1,i2)= 0.8

sim(i1,i4)= 0.2

sim(i1,i3)= 0.4

BA

D

A C


Experimental questions

Experimental questions

Instance similarity measure

topN parameter

ST parameter

combining topN + ST parameters

performance as compared to other OM algorithms


Evaluation

Alignment evaluation

Methods:

Gold standard := good alignment

Reindexing

Measures:

Precision

Recall

f-measure


Results of experiments

Results: instance similarity measure - quality

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1e+06

perf

orm

ance

mapping rank

P VSMR VSMF VSM

P LuceneR LuceneF Lucene

(a) Gold standard

0

0.2

0.4

0.6

0.8

1

100 1000 10000 100000 1e+06

perf

orm

ance

mapping rank

P VSMR VSMF VSM

P LuceneR LuceneF Lucene

(b) Reindex

Virtually equal



Results: instance similarity measure - quality

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000 1e+06

over

lap

mapping rank

(c) Overlap

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

perf

orm

ance

mapping rank

precision VSMprecision Lucene

(d) Manual Evaluation

Edge to VSM



Results: instance similarity measure - run-time

amount time to enrich 100Kindexed instances (hrs:min)instances Lucene VSM

524K 1:04 0:171,457K 7:20 0:222,506K 26:15 0:32

(e) stats

0

200

400

600

800

1000

1200

1400

1600

4 6 8 10 12 14 16 18 20 22 24 26

incr

ease

run

-tim

e

indexed documents * 100K

VSMLucene

(f) figure it out

Optimizations VSM:

pre-calculate weights indexed documents

purge insignificant weights (35% + 50%)

word centered indexing approach



Results: instance similarity measure - run-time

amount time to enrich 100Kindexed instances (hrs:min)instances Lucene VSM

524K 1:04 0:171,457K 7:20 0:222,506K 26:15 0:32

(g) stats

0

200

400

600

800

1000

1200

1400

1600

4 6 8 10 12 14 16 18 20 22 24 26

incr

ease

run

-tim

e

indexed documents * 100K

VSMLucene

(h) figure it out

Optimizations VSM:

pre-calculate weights indexed documents

purge insignificant weights (35% + 50%)

word centered indexing approach



Results: topN parameter (TEL)

As N increases, quality of mappings decrease

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 10 100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

top1 (baseline)top2top3top4top5top6

(i) Gold standard

0

0.05

0.1

0.15

0.2

0.25

100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

top1 (baseline)top2top3top4top5top6

(j) Reindex



Results: similarity threshold parameter (KB)

Best performance with ST: ST=µ

Best performance: baseline (topN=1, ST=∞)

0

0.1

0.2

0.3

0.4

0.5

0.6

10 100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

baselineT=mean-1.5s

T=mean-sT=mean-.5s

T=meanT=mean+.5s

T=mean+sT=mean+1.5s

(k) Gold standard

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

baselineT=mean-1.5s

T=mean-sT=mean-.5s

T=meanT=mean+.5s

T=mean+sT=mean+1.5s

(l) Reindex



Results: combining parameters

Using both parameters performs good in TEL, not in KB...possibly due to:

more selective IBOMbIE pays off in TEL, because vocabularies+ instance annotations are more different than in KB scenario.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

baselinetopN=1 ST=mu-0.5s

topN=1 ST=mutopN=1 ST=mu+0.5stopN=2 ST=mu-0.5s


topN=3 ST=mu

(m) KB

0

0.05

0.1

0.15

0.2

0.25

0.3

100 1000 10000 100000 1e+06

f-m

easu

re

mapping rank

baselinetopN=1 ST=mu-0.5s


topN=2 ST=mutopN=2 ST=mu+0.5s

topN=3 ST=mutopN=3 ST=mu+0.5s

(n) TEL

(evaluation method: reindexing)


OAEI

Ontology alignment evaluation initiative (OAEI)

terminol- structure- semantic- instance-ogical based based based

DSSim X X X ✗

Lily X X X ✗

TaxoMap X X X ✗

IBOMbIE ✗ ✗ ✗ X

DSSim, Lily and TaxoMap:

consider KB ontologies “huge”

feature functionality to deal with large ontologies


OAEI

Performance comparison: quality

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 2000 4000 6000 8000 10000

perf

orm

ance

mapping rank

P IBOMbIE topN=1R IBOMbIE topN=1

P DSSimR DSSim

P LilyR Lily

P TaxoMapR TaxoMap


OAEI

Performance comparison: resources + coverage

matcher run-time amount mappings

DSSim 12:00 2930Lily ? 2797

TaxoMap 2:40 1851IBOMbIE 1:54 7000+

(Amount lexically equal concepts KB vocabulaires = 2,895)


Conclusions + discussion

IBOMbIE algorithm is quite promising:

Relatively low run-time

Able to deal with large ontologies

Amount + quality of mappings

Pros of IBOM

Able to align ontologies using disjunct data-sets

Basic instance enrichment appears best performing method.Possible cause: Jaccard coefficient does not support multi-sets.


Fin

Thank you... any questions ?


Vocabularies

vocabulary size

KB GTT 35KBrinkman 5K

TEL LCSH 340KRameau 155KSWD 805K



D1 D2

annotated annotatedwith with µ σ

KB O1 O2 0.297 0.106O2 O1 0.279 0.101

TEL O1 O2 0.260 0.097O2 O1 0.232 0.084

standard ST: µ

step-size: 12σ


VSM

Weights are components of vectors:

term frequency - inverse document frequency: TF-IDF

e.g. audiovisual features

tfidfw ,d = tfw ,d ∗ idfw

tfw ,d =

√nw ,d

|d |

idfw = log|D|

|d ∈ D : w ∈ d |VSM cosine similarity

cosine sim(d1, d2) =~d1 · ~d2

|~d1||~d2|=

∑ni=1 wi ,d1

wi ,d2√

∑

i w2i ,d1

√

∑

i w2i ,d2


Evaluation method: gold standard

Gold standard := good alignment

P = precision =|{reference} ∩ {retrieved}|

|{retrieved}|

R = recall =|{reference} ∩ {retrieved}|

|{reference}|

F = f − measure = 2 ∗ P ∗ R

P + R


Evaluation method: reindexing

instance i_dual

{a, b}

{x}

instance i_dual

{x, z}

{a, b}

reindex

o_1

a

b

o_2

x

y

zc

P =

∑dually annotated instances |{reference}∩{retrieved}||{retrieved}|

|{reindexed instances}|

R =

∑dually annotated instances |{reference}∩{retrieved}||{reference}|

|{dually annotated instances}|


IbOM by IM algorithm overview

Whole algorithm

Start: two data-sets Dx and Dy

1 Enrich instances of Dx with annotations of instances of DyFor every instance a:

1 Find N best matching instances {b} in Dy2 Add annotations of {b} to a

2 Enrich vice versa

3 Merge data-sets into one dually annotated data-set

4 Apply Jaccard measure

instance-based ontology matching by instance enrichment

Education

om conclusionsquestions

performance of ibombie

annotated instances

formal language1by euzenat

suitable instances

set of typed

audiencecrucial questions

heterogeneity problem