exploiting multilinguality for creating mappings between thesauri

13
Exploiting Multilinguality For Creating Mappings Between Thesauri Mauro Dragoni Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL) https://shell.fbk.eu/index.php/Mauro_Dragoni - [email protected] SAC 2015, Salamanca, Spain April, 14 th 2015

Upload: mauro-dragoni

Post on 18-Jul-2015

162 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Exploiting Multilinguality For Creating Mappings Between Thesauri

Exploiting Multilinguality For Creating Mappings Between Thesauri

Mauro Dragoni

Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)

https://shell.fbk.eu/index.php/Mauro_Dragoni - [email protected]

SAC 2015, Salamanca, Spain

April, 14th 2015

Page 2: Exploiting Multilinguality For Creating Mappings Between Thesauri

Outline

1. Background on Ontology Matching

2. Motivations

3. The Approach

4. Evaluation of the System

Page 3: Exploiting Multilinguality For Creating Mappings Between Thesauri

Ontology Matching - 1

Given two thesauri/ontologies/vocabularies find alignments between entities

Formally a “match” may be represented with the following 5-tuple:

‹ id, e1, e2, R, c ›

Extensive literature about matching approaches (early ‘80s)

Page 4: Exploiting Multilinguality For Creating Mappings Between Thesauri

Ontology Matching - 2

Multilinguality started to be considered around 15 years ago

EuroWordNet

MultiWordNet

Domain-specific applications

English-Asian alignment

Multi-lingual vs. Cross-lingual

Page 5: Exploiting Multilinguality For Creating Mappings Between Thesauri

Motivations

Need: a system, for experts, able to suggest possible matches between concepts

Exploit multilinguality… why?

allows to reduce ambiguity: the probability, for two different concepts, of having the same label across several languages is very low.

term translations have been adapted to the domain: experts in charge of translations put a lot of their cultural heritage in choosing the right terms for each concept.

First step of an ontology matching platform

Page 6: Exploiting Multilinguality For Creating Mappings Between Thesauri

The Proposed Approach - 1

Inspired by information retrieval techniques

Built on top of the Lucene search engine

For each element of the thesaurus a structured multilingual representation is built:

An index for each thesaurus is built

[prefLabel] "Food chains"@en

[prefLabel] "Catene alimentari"@it

[altLabel] "Food distributions"@en

[altLabel] "Reti alimentari"@it

label-en: “food chain”

label-en: “food distribution”

label-it: “catena alimentare”

label-it: “rete alimentare”

Page 7: Exploiting Multilinguality For Creating Mappings Between Thesauri

The Proposed Approach - 2

How matches are suggested?

source and target thesauri are chosen

for each concept, a query is performed from the source to the target thesaurus

the standard Lucene scoring formula is used for computing the ranking

for each query, a ranking of 5 suggestions is provided to the user

Page 8: Exploiting Multilinguality For Creating Mappings Between Thesauri

Evaluation Set-Up

2 contexts:

six multilingual thesauri (3 medical domain, 3 agricultural domain)

adapted Multifarm benchmark

2 tasks:

matching system (only the first suggestion is considered)

suggestion system

Page 9: Exploiting Multilinguality For Creating Mappings Between Thesauri

Results - 1

Mapping Set # of Mappings Prec@1 Prec@3 Prec@5 Recall

Eurovoc Agrovoc 1297 0.816 0.931 0.967 0.874

Agrovoc Eurovoc 1297 0.906 0.969 0.988 0.695

Avg. 0.861 0.950 0.978 0.785

Gemet Agrovoc 1181 0.909 0.964 0.983 0.546

Agrovoc Gemet 1181 0.943 0.981 0.994 0.740

Avg. 0.926 0.973 0.989 0.643

MDR MeSH 6061 0.776 0.914 0.956 0.807

MeSH MDR 6061 0.716 0.888 0.939 0.789

Avg. 0.746 0.901 0.948 0.798

MDR SNOMED 19971 0.621 0.826 0.908 0.559

SNOMED MDR 19971 0.556 0.760 0.855 0.519

Avg. 0.589 0.793 0.882 0.539

MeSH SNOMED 26634 0.690 0.871 0.931 0.660

SNOMED MeSH 26634 0.657 0.835 0.908 0.564

Avg. 0.674 0.853 0.920 0.612

Results obtained by the proposed system on the domain-specific thesauri

Page 10: Exploiting Multilinguality For Creating Mappings Between Thesauri

Results - 2

Mapping Set IRBOM WeSeE

(2012)

RiMOM

(2013)

YAM++

(2013)

YAM++

(2012)

AUTOM

Sv2

(2012)

Agrovoc Eurovoc 0.821 0.785 0.628 0.615 0.615 0.599

Gemet Agrovoc 0.759 0.726 0.548 0.579 0.579 0.485

MDR MeSH 0.771 0.749 0.611 0.613 0.613 0.536

MDR SNOMED 0.563 0.624 0.495 0.473 0.473 0.405

MeSH SNOMED 0.642 0.631 0.457 0.458 0.458 0.497

Results obtained by the all systems on the domain-specific thesauri

Page 11: Exploiting Multilinguality For Creating Mappings Between Thesauri

Results - 3

System Name Precision Recall F-Measure

IRBOM 0.68 0.43 0.53

WeSeE (2012) 0.61 0.32 0.41

RiMOM (2013) 0.52 0.13 0.21

YAM++ (2013) 0.51 0.36 0.40

YAM++ (2012) 0.50 0.36 0.40

AUTOMSv2 (2012) 0.49 0.10 0.36

Results obtained by all systems on the adapted Multifarm Benchmark

Page 12: Exploiting Multilinguality For Creating Mappings Between Thesauri

Future Work

Analyzing all kind of relationships between concepts

Using weights associated with relationships

Improve the search mechanism (faceting, fuzzy, …)

Practical aspects: Web-application

Page 13: Exploiting Multilinguality For Creating Mappings Between Thesauri

Mauro Dragoni

https://shell.fbk.eu/index.php/Mauro_Dragoni [email protected]