exploiting multilinguality for creating mappings between thesauri

Exploiting Multilinguality For Creating Mappings Between Thesauri

Mauro Dragoni

Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)

https://shell.fbk.eu/index.php/Mauro_Dragoni - [email protected]

SAC 2015, Salamanca, Spain

April, 14th 2015

Outline

1. Background on Ontology Matching

2. Motivations

3. The Approach

4. Evaluation of the System

Ontology Matching - 1

Given two thesauri/ontologies/vocabularies find alignments between entities

Formally a “match” may be represented with the following 5-tuple:

‹ id, e1, e2, R, c ›

Extensive literature about matching approaches (early ‘80s)

Ontology Matching - 2

Multilinguality started to be considered around 15 years ago

EuroWordNet

MultiWordNet

Domain-specific applications

English-Asian alignment

Multi-lingual vs. Cross-lingual

Motivations

Need: a system, for experts, able to suggest possible matches between concepts

Exploit multilinguality… why?

allows to reduce ambiguity: the probability, for two different concepts, of having the same label across several languages is very low.

term translations have been adapted to the domain: experts in charge of translations put a lot of their cultural heritage in choosing the right terms for each concept.

First step of an ontology matching platform

The Proposed Approach - 1

Inspired by information retrieval techniques

Built on top of the Lucene search engine

For each element of the thesaurus a structured multilingual representation is built:

An index for each thesaurus is built

[prefLabel] "Food chains"@en

[prefLabel] "Catene alimentari"@it

[altLabel] "Food distributions"@en

[altLabel] "Reti alimentari"@it

label-en: “food chain”

label-en: “food distribution”

label-it: “catena alimentare”

label-it: “rete alimentare”

The Proposed Approach - 2

How matches are suggested?

source and target thesauri are chosen

for each concept, a query is performed from the source to the target thesaurus

the standard Lucene scoring formula is used for computing the ranking

for each query, a ranking of 5 suggestions is provided to the user

Evaluation Set-Up

2 contexts:

six multilingual thesauri (3 medical domain, 3 agricultural domain)

adapted Multifarm benchmark

2 tasks:

matching system (only the first suggestion is considered)

suggestion system

Results - 1

Mapping Set # of Mappings Prec@1 Prec@3 Prec@5 Recall

Eurovoc Agrovoc 1297 0.816 0.931 0.967 0.874

Agrovoc Eurovoc 1297 0.906 0.969 0.988 0.695

Avg. 0.861 0.950 0.978 0.785

Gemet Agrovoc 1181 0.909 0.964 0.983 0.546

Agrovoc Gemet 1181 0.943 0.981 0.994 0.740

Avg. 0.926 0.973 0.989 0.643

MDR MeSH 6061 0.776 0.914 0.956 0.807

MeSH MDR 6061 0.716 0.888 0.939 0.789

Avg. 0.746 0.901 0.948 0.798

MDR SNOMED 19971 0.621 0.826 0.908 0.559

SNOMED MDR 19971 0.556 0.760 0.855 0.519

Avg. 0.589 0.793 0.882 0.539

MeSH SNOMED 26634 0.690 0.871 0.931 0.660

SNOMED MeSH 26634 0.657 0.835 0.908 0.564

Avg. 0.674 0.853 0.920 0.612

Results obtained by the proposed system on the domain-specific thesauri

Results - 2

Mapping Set IRBOM WeSeE

(2012)

RiMOM

(2013)

YAM++

(2013)

YAM++

(2012)

AUTOM

Sv2

(2012)

Agrovoc Eurovoc 0.821 0.785 0.628 0.615 0.615 0.599

Gemet Agrovoc 0.759 0.726 0.548 0.579 0.579 0.485

MDR MeSH 0.771 0.749 0.611 0.613 0.613 0.536

MDR SNOMED 0.563 0.624 0.495 0.473 0.473 0.405

MeSH SNOMED 0.642 0.631 0.457 0.458 0.458 0.497

Results obtained by the all systems on the domain-specific thesauri

Results - 3

System Name Precision Recall F-Measure

IRBOM 0.68 0.43 0.53

WeSeE (2012) 0.61 0.32 0.41

RiMOM (2013) 0.52 0.13 0.21

YAM++ (2013) 0.51 0.36 0.40

YAM++ (2012) 0.50 0.36 0.40

AUTOMSv2 (2012) 0.49 0.10 0.36

Results obtained by all systems on the adapted Multifarm Benchmark

Future Work

Analyzing all kind of relationships between concepts

Using weights associated with relationships

Improve the search mechanism (faceting, fuzzy, …)

Practical aspects: Web-application

Mauro Dragoni

https://shell.fbk.eu/index.php/Mauro_Dragoni [email protected]

exploiting multilinguality for creating mappings between thesauri

Data & Analytics