framework for matching and linking large ontologies

22
Kow Weng Onn, Michelle Lim Sien Niu, Dickson Lukose (MIMOS) Gudrun Johannsen, Johannes Keizer (UN FAO) Framework for Matching and Linking Large Ontologies

Upload: aims-agricultural-information-management-standards

Post on 04-Aug-2015

467 views

Category:

Documents


1 download

TRANSCRIPT

Kow Weng Onn, Michelle Lim Sien Niu,

Dickson Lukose (MIMOS)

Gudrun Johannsen, Johannes Keizer

(UN FAO)

Framework for Matching and

Linking Large Ontologies

Outline

• Introduction

• Objective

• Previous Work

• Proposed Framework

• Initial Experimental Results

• Future Work

• Q&A

2

Linked Open Data (2011)

3

AGROVOC Thesaurus

4

• Multilingual agricultural thesaurus • More than 40,000 concepts in up to 22 languages • Standard for document indexing • Information exchange and retrieval

Agriculture Linked Open Data (as of June 2012)

5

Vocabulary Domain Language Out-links (from AGROVOC)

EuroVoc General EU EN, ES, DE, FR, etc. (24 languages)

1,297

GEMET Environment EN, ES, DE, FR, etc. (29 languages)

1,191

LCSH General EN 1,093

NALT Agriculture EN, ES 13,390

STW Economy EN, DE 1,136

TheSoz Social Science EN, DE 846

RAMEAU General FR 686

DBpedia General EN, ES, DE, FR, etc. (97 languages)

993

DDC General EN, ES, DE, FR, etc. (12 languages)

409

Geopolitical Ontology

Geopolitical AR, ZH, FR, EN, ES, RU, IT 253

SWD General DE 5,965

GeoNames Geographical Database

67 languages 212

ASFA Thesaurus Aquatic Sciences

EN, FR, ES 1,812

FAO Biotechnology Glossary

Biotechnology AR, ZH, EN, FR, RU, ES, PL, SR, VI

791

Total 30,074

Why link AGROVOC Concepts?

• Allows access to document repositories

and other agricultural data

• Achieve interoperability of data in the

agricultural domain

• Allows linkage between same concepts in

different languages and different data sets

• Support knowledge harvesting tools

6

Current Approach

7

Morshed, A., Caracciolo, C., Johannsen, G., Kizer, J. (2011): Thesaurus alignment for Linked Data publishing. International Conference on Dublin Core and Metadata Applications 2011

Limitations of current approach

• Target ontology needs to be downloaded

into triple store; may not be the latest

version

• Full comparisons is time-consuming and

not scalable

• Manual evaluation by domain experts

required without tools support

• Multi-lingual terms not exploited

8

Semantic Mediation Tool

9

Semantic Mediation Tool GUI

10

Proposed Framework

11

Proposed Framework

• Index and match strategy (80-20

hypothesis)

• Source and Target accessed through

endpoints

• Automatic discovery of alignments

• Visualization and navigation tools to aid

decision-making

• Use multiple languages if available

12

Experimental Setup

• Source and target thesauri - AGROVOC

and the STW Thesaurus of Economics

• English preferred labels used

• Lucene version 3.5 used for indexing

• Thresholds used to limit results returned

• Precision and recall calculated based on

the existing 1136 links

13

Use Jaro-Winkler Algorithm

Initial Experiments

• Three separate experiments

1. Only index and match

2. Stemming before matching

3. String distance Jaro-Winkler algorithm

added to reduce misalignments

14

Rejected Matches

Accepted Matches

Threshold 1 Threshold 2

Proposed Framework

15

Example Experimental Output

16

Results

17

Mappings found

Correct Mappings

Precision Recall

Plain Labels 1587 1005 0.633 0.885

With Stemming included

1624 1025 0.631 0.902

With String distance

1389 1062 0.765 0.935

Results with primary threshold = 4.0, secondary threshold = 6.0

Results

18

Mappings found

Correct Mappings

Precision Recall

Plain Labels 1420 1006 0.708 0.886

With Stemming included

1445 1021 0.707 0.899

With String distance

1293 1037 0.802 0.913

Results with primary threshold = 5.0, secondary threshold = 6.0

Results

19

Mappings found

Correct Mappings

Precision Recall

Plain Labels 986 847 0.859 0.746

With Stemming included

1005 857 0.852 0.754

With String distance

996 855 0.858 0.753

Results with primary threshold = 6.0, secondary threshold = 7.0

Discussion

• Framework can work with good speed

– Indexing and matching takes less than 200 seconds in our experiments

• High recall with high threshold, but precision suffers and vice-versa

• As we’re proposing a semi-automatic system, a higher recall is preferred

• Hard to set threshold for Lucene score as it gives higher weight to less frequent words

20

Future Work

• Make use of multi-lingual aspect of

AGROVOC

• One-to-many and many-to-many matching

and linking by having a large index

• Modify Semantic Mediation Tool to be

front-end to interface to the framework

• Experiment with unlinked ontologies; but

how to evaluate?

21

22