framework for matching and linking large ontologies
TRANSCRIPT
Kow Weng Onn, Michelle Lim Sien Niu,
Dickson Lukose (MIMOS)
Gudrun Johannsen, Johannes Keizer
(UN FAO)
Framework for Matching and
Linking Large Ontologies
Outline
• Introduction
• Objective
• Previous Work
• Proposed Framework
• Initial Experimental Results
• Future Work
• Q&A
2
AGROVOC Thesaurus
4
• Multilingual agricultural thesaurus • More than 40,000 concepts in up to 22 languages • Standard for document indexing • Information exchange and retrieval
Agriculture Linked Open Data (as of June 2012)
5
Vocabulary Domain Language Out-links (from AGROVOC)
EuroVoc General EU EN, ES, DE, FR, etc. (24 languages)
1,297
GEMET Environment EN, ES, DE, FR, etc. (29 languages)
1,191
LCSH General EN 1,093
NALT Agriculture EN, ES 13,390
STW Economy EN, DE 1,136
TheSoz Social Science EN, DE 846
RAMEAU General FR 686
DBpedia General EN, ES, DE, FR, etc. (97 languages)
993
DDC General EN, ES, DE, FR, etc. (12 languages)
409
Geopolitical Ontology
Geopolitical AR, ZH, FR, EN, ES, RU, IT 253
SWD General DE 5,965
GeoNames Geographical Database
67 languages 212
ASFA Thesaurus Aquatic Sciences
EN, FR, ES 1,812
FAO Biotechnology Glossary
Biotechnology AR, ZH, EN, FR, RU, ES, PL, SR, VI
791
Total 30,074
Why link AGROVOC Concepts?
• Allows access to document repositories
and other agricultural data
• Achieve interoperability of data in the
agricultural domain
• Allows linkage between same concepts in
different languages and different data sets
• Support knowledge harvesting tools
6
Current Approach
7
Morshed, A., Caracciolo, C., Johannsen, G., Kizer, J. (2011): Thesaurus alignment for Linked Data publishing. International Conference on Dublin Core and Metadata Applications 2011
Limitations of current approach
• Target ontology needs to be downloaded
into triple store; may not be the latest
version
• Full comparisons is time-consuming and
not scalable
• Manual evaluation by domain experts
required without tools support
• Multi-lingual terms not exploited
8
Proposed Framework
• Index and match strategy (80-20
hypothesis)
• Source and Target accessed through
endpoints
• Automatic discovery of alignments
• Visualization and navigation tools to aid
decision-making
• Use multiple languages if available
12
Experimental Setup
• Source and target thesauri - AGROVOC
and the STW Thesaurus of Economics
• English preferred labels used
• Lucene version 3.5 used for indexing
• Thresholds used to limit results returned
• Precision and recall calculated based on
the existing 1136 links
13
Use Jaro-Winkler Algorithm
Initial Experiments
• Three separate experiments
1. Only index and match
2. Stemming before matching
3. String distance Jaro-Winkler algorithm
added to reduce misalignments
14
Rejected Matches
Accepted Matches
Threshold 1 Threshold 2
Results
17
Mappings found
Correct Mappings
Precision Recall
Plain Labels 1587 1005 0.633 0.885
With Stemming included
1624 1025 0.631 0.902
With String distance
1389 1062 0.765 0.935
Results with primary threshold = 4.0, secondary threshold = 6.0
Results
18
Mappings found
Correct Mappings
Precision Recall
Plain Labels 1420 1006 0.708 0.886
With Stemming included
1445 1021 0.707 0.899
With String distance
1293 1037 0.802 0.913
Results with primary threshold = 5.0, secondary threshold = 6.0
Results
19
Mappings found
Correct Mappings
Precision Recall
Plain Labels 986 847 0.859 0.746
With Stemming included
1005 857 0.852 0.754
With String distance
996 855 0.858 0.753
Results with primary threshold = 6.0, secondary threshold = 7.0
Discussion
• Framework can work with good speed
– Indexing and matching takes less than 200 seconds in our experiments
• High recall with high threshold, but precision suffers and vice-versa
• As we’re proposing a semi-automatic system, a higher recall is preferred
• Hard to set threshold for Lucene score as it gives higher weight to less frequent words
20
Future Work
• Make use of multi-lingual aspect of
AGROVOC
• One-to-many and many-to-many matching
and linking by having a large index
• Modify Semantic Mediation Tool to be
front-end to interface to the framework
• Experiment with unlinked ontologies; but
how to evaluate?
21