ph d thesis-ahsan_slidesv3

54
Ahsan Morshed, FAO 1 / 54 http://www.fao.org/aims/ Aligning Controlled vocabularies for enabling semantic matching in a distributed knowledge management system Ahsan Morshed Doctoral Candidate University of Trento [email protected] PhD Supervisor: Professor Fausto Giunchiglia [email protected]

Upload: ahsan-morshed

Post on 12-May-2015

1.094 views

Category:

Documents


4 download

DESCRIPTION

My PhD thesis slides which I presented on 12th of April at University of Trento, Trento, Italy

TRANSCRIPT

Page 1: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 1 / 54

http://www.fao.org/aims/

Aligning Controlled vocabularies for enabling semantic matching in a distributed knowledge

management system

Ahsan Morshed Doctoral CandidateUniversity of Trento

[email protected]

PhD Supervisor: Professor Fausto [email protected]

Page 2: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 2 / 54

http://www.fao.org/aims/

Publications (1-3)

A. Morshed. Controlled Vocabulary Matching in Distributed Systems, at

BNCOD 2009 Conference,UK.

A. Morshed and M. Sini. Aligning Controlled vocabularies: Algorithm and

Architecture at Workshop on Advance Technologies for Digital Libraries

2009, AT4DL, Trento, Italy.

M. Sini, J. Keizer, G. Johannsen, A. Morshed, S. Rajbhandari and M.

Amirhosseini.The AGROVOC Concept Server Workbench System:

Empowering management of agricultural vocabularies with semantics at

International Association of Agricultural Information Specialists (IAALD),

France, 2010.

Page 3: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 3 / 54

http://www.fao.org/aims/

Publications (4-6)

A. Morshed, G. Johanssen, J. Keizer and M. Zeng,. Bridging End Users’ Terms and AGROVOC Concept Server Vocabularies. International Conference on Dublin Core and Metadata Applications (DC-2010), Pittsburgh, USA, 2010 (submitted).

A. Morshed, M. Sini and J. Keizer. Aligning Controlled Vocabularies using a facet based approach. (Technical Paper at FAO).

A. Morshed and R. Singh. Evaluation and Ranking of Ontology Construction Tools (Technical Paper).

Page 4: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 4 / 54

http://www.fao.org/aims/

Agenda

Background: the role of controlled vocabulary in semantic matching

The overall goal: Aligning Controlled Vocabularies in a distributed system

A facet based matching

An Architecture for matching system

A running prototype for matching system

Evaluation Methodology and Results

Limitations and Related Works

Conclusions and Future work

Page 5: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 5 / 54

http://www.fao.org/aims/

Some matching techniques

Element Matching techniques

ex: edit distance

Corpus-based techniques

ex: token or extension of classes

Structure-based tecniques

ex: graph matching

Knowledge-based techniques

ex: external resources

Page 6: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 6 / 54

http://www.fao.org/aims/

Some matching systems

Cupid

- element level and structure level matching

RiMOM

- based on edit distance and Vector distance

FALCON-AO

- based on Linguistic and structure matching

CTXMatch, S-match

-based on knowledge based

Page 7: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 7 / 54

http://www.fao.org/aims/

Some matching projects

HILT (High Level Thesaurus Project)

-JISC funded project, UK

-to facilitate the cross-searching of distributed information services by subject in a multi-schema environment.

-used datasets (e.g.,DDC,LCSH, IPSV, AAT)

CAT to AGROVOC Dr. Chan chung

64,638 Chinese terms, 51,614 descriptors and 13,024 non-descriptors

13,105 exact matches,11,408 BT match, 173 NT match, and 17,47other matches

Page 8: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 8 / 54

http://www.fao.org/aims/

Some matching project

OAEI 2007 (Ontology Alignment Evaluation Initiative) -Food Track- AGROVOC-NALT thesauri

System Alignment Alignment Type

Falcon-AO 15,300 exactMatch

RiMOM 18,420 exactMatch

X-SOM 6,583 exactMatch

DSSim 14,962 exactMatch

[Willem , 2008]

Page 9: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 9 / 54

http://www.fao.org/aims/

Matching in Distributed System

Edutella

Edutella is an open source project that creates an infrastructure for sharingmetadata in RDF format

It applies the peer-to-peer model using the JXTA protocol

Swap

aims at overcoming the lack of semantics in current Peer-to-Peer system

Page 10: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 10 / 54

http://www.fao.org/aims/

Semantic Matching in Lighweight ontologies

To use of lightweight ontologies for matching purpose, all entities need toagree on the exact meaning of the concepts.

Descriptive lightweight ontologies

-used for defining the meaning of terms as well the nature and structure of adomain.

Classification lightweight ontologies

-used for describing, classifying, and accessing collection of document.

[Fausto et al.,2007]

Page 11: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 11 / 54

http://www.fao.org/aims/

Controlled Vocabulary (CV)

A vocabulary stores words, synonyms, word sense definitions (i.e.glosses), relations between word senses and concepts; such avocabulary is generally referred to as the Controlled Vocabulary (CV)if choice or selection of terms are done by domain specialists [ahsan etal.,2009]

Page 12: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 12 / 54

http://www.fao.org/aims/

Controlled Vocabulary

General controlled vocabulary:

Example: Thesaurus, WordNet, Classification, Directories, Lightweight Ontologies

Subject specific controlled vocabulary (SSCV)

Library of Congress and Authors List

Uniform List

Series List

Page 13: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 13 / 54

http://www.fao.org/aims/

Applications for managing controlled vocabularies

Traditional Controlled Vocabulary tools

Ex: Old Agrovoc Thesaurus

Modern Controlled Vocabulary

Ex: AGROVOC Concept Server

Page 14: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 14 / 54

http://www.fao.org/aims/

AGROVOC Concept Server

-store concepts

-Edit concepts

-visualize the

concepts

modern controlled vocabulary

Ref: http://nais.cpe.ku.ac.th/agrovoc/

Page 15: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 15 / 54

http://www.fao.org/aims/

Applications for exploiting controlled vocabularies

Background Knowledge

Document annotation

Information retrieval and extraction

Audio and Video retrieval

Page 16: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 16 / 54

http://www.fao.org/aims/

Challenges of Matching

Factors of heterogeneity problem

Time

Place

Structure

Culture diversity

Different vocabulary specialists

Page 17: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 17 / 54

http://www.fao.org/aims/

Challenges of Matching

Different heterogeneity

Syntactic heterogeneity

Lexical heterogeneity

Semantic heterogeneity

Pragmatic heterogeneity

Metadata heterogeneity

[Pavel, 2006 ]

Page 18: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 18 / 54

http://www.fao.org/aims/

Problem of CV

Page 19: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 19 / 54

http://www.fao.org/aims/

FACET

A facet is like a diamond that consists of different faces.

Its distinct features allow thesauri, classifications or taxonomies tobe organized in different ways.

composed of collectively exhaustive aspects of properties orcharacteristics of a domain.

For example, a collection of rice might be classified using culturaland seasonal facets.

[Fausto et al.,2009] [ahsan et al.,2009]

Page 20: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 20 / 54

http://www.fao.org/aims/

Faceted Controlled vocabulary

Page 21: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 21 / 54

http://www.fao.org/aims/

Faceted Controlled vocabulary

Seasonal rice type Cultural rice type

Page 22: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 22 / 54

http://www.fao.org/aims/

Creation of a Facet

Domain Analysis

analysis of terms are done by consulting domain experts

simple concept are identified.

Term collections and organization

terms are order according to their characteristic and meaningful sequence

ex: cow and milk form a facet called Diary system(part of relationship)

[Fausto et al., 2009]

Page 23: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 23 / 54

http://www.fao.org/aims/

Exisiting Metholodies

PMEST : Personality(P), Matter(M), Energy(E), Space (S), and Time(T)

[Ranganathan]

DEPA : Discipline(D), Entity (E), Property (P), Action(A)

[Bhattachary and Fausto et al., 2009 ]

Page 24: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 24 / 54

http://www.fao.org/aims/

Properties of facets

Hospitalities

Compactness

Flexibility

Reusability

The Methodology

Homogeneity

[Bhattachary and Fausto et al., 2009 ]

Page 25: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 25 / 54

http://www.fao.org/aims/

Concept Facet Matcher

Based on DEPA model

CF={mg,lg,R} Where, mg is more general concepts ,lg is less general concepts, R is related concepts.

Based on Element Lebel Matchers

[Ahsan, 2009 and Ahsan et al., 2009 ]

Page 26: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 26 / 54

http://www.fao.org/aims/

Concept Facet Matcher

Algorithm 1 buildCFacet(CV)

for i = 0 to CV do

store cF (Mg,Lg;R)

end for

return cF

[Ahsan et al., 2009 ]

Page 27: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 27 / 54

http://www.fao.org/aims/

Concept Facet Matcher

Algorithm 2 MatchingFacet(CV1,CV2)cF1=BuildCFacet(CV1)cF2=BuildCFacet(CV2)for i = 0 to cF 1:size dofor j = 0 to cF 2:size docfmatcher=elementLevelMatcher(cF 1;cF2)end forend for

[Ahsan et al., 2009 ]

Page 28: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 28 / 54

http://www.fao.org/aims/

System Architecture

Page 29: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 29 / 54

http://www.fao.org/aims/

Data Model

Agrovoc databaseRef: http://aims.fao.org/website/Download/sub

Page 30: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 30 / 54

http://www.fao.org/aims/

DATA Model

CABI database

Ref: http://cabi.org

Page 31: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 31 / 54

http://www.fao.org/aims/

A Running Prototype

Search Sring

Validators/ domain

specialist s

Page 32: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 32 / 54

http://www.fao.org/aims/

An architecture for a semantic search

Page 33: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 33 / 54

http://www.fao.org/aims/

Running Prototype for search

user’s choice

Page 34: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 34 / 54

http://www.fao.org/aims/

Evaluation and Results

A domain Expert Exact Match Partial Match No Match

Page 35: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 35 / 54

http://www.fao.org/aims/

Datasets

Comparision

Characteristics AGROVOC CAB

Tree leaves 29172 47805

Term counts 18200 32884

Single words 6842 11720

MultiWords 11358 21161

Hierarchy depth 7 14

multiple BT 2546 1207

redundant BT 57 76

Page 36: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 36 / 54

http://www.fao.org/aims/

Datasets

AGROVOC version 2007-08-10 2007-08-10

CABI version 2009-11-01 2009-11-01

AGROVOC term-leaves 35036 35036

CABI term-leaves 29172 29172

Coversion hierarchy hierarchy

Knowledge base WordNet 2.1 SWN 400.000

Page 37: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 37 / 54

http://www.fao.org/aims/

Datasets

Relationship

BT NT RT UF

AGROVOC 228466 228424 326389 54370

CABI 15154 15841 41239 7094

Page 38: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 38 / 54

http://www.fao.org/aims/

Input files

Agrovoc input file CAB input file

Page 39: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 39 / 54

http://www.fao.org/aims/

Results

Experiment 1 Experiment 2

Exact Match 5976 6021

Partial Match 164255 164278

No Match 69800745 69800745

Facet based appraoch

Page 40: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 40 / 54

http://www.fao.org/aims/

Results

Experiment 1 Experiment 2

Exact Match 8795 8795

Partial Match 334255 334258

No Match N/A N/A

Standard Tool

Page 41: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 41 / 54

http://www.fao.org/aims/

Results

Min Max Min Max

Overall 25.8065 31.4496 21.7391 21.7391

Positive 18.6047 14.0814 10.4895 14.6154

Negative 97.1831 52.1495 94.7368 99.1304

Page 42: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 42 / 54

http://www.fao.org/aims/

Advantage of Facet based System

No knowledge base required

Based on hidden semantic. Semantic meaning retrived during the processing

Page 43: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 43 / 54

http://www.fao.org/aims/

Limitations Structure Problems

AGROVOC SQL Format and CABI Text Format Provided CABI file does not contain chemical and scientific concepts

Term Variants In AGROVOC, we found ``frog farms" which should have been ``frog farming"

because ``frog farms" is used for ``frog culture" and BT is ``aquaculture". Also, wefound the abbreviated term ``UHT milk" (one kind of milk product) which shouldhave been "UHT milk".

There were some ambiguous term which had different meanings, for example``cutting" ( i.e., slicing of bread or meat) or ``cuttings" (i.e.,propagation material).

there were some terms spells whose meaning is to difficult to capture, forexample “2.4.4-T”, “2.4.5-TP 2.4-D”, “2.4 DES”, “2.4 dinitrohenol”. Similarly, CABIcontained the term “4-H Clubs”. These terms did make sense during anymapping experiments.

Page 44: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 44 / 54

http://www.fao.org/aims/

Limitations

Domain expert

To evaluate our results, we were able to find one domain expert from

FAO but we did not get any domain expert from CABI. The results mayhave been different if we had another domain expert.

Lack of consistency

Since the relationships in thesauri lack precise semantics, they areapplied inconsistently, both creating ambiguity in the interpretationof the relationships and resulting in an overall internal structure thatis irregulated and unpredictable

Page 45: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 45 / 54

http://www.fao.org/aims/

Limitations

Limited automated processing

Traditional thesauri are designed for indexing and query formulation bypeople and not for automated processing. The ambiguous semantics thatcharacterizes many thesauri makes them unsuitable for automatedprocessing.

Page 46: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 46 / 54

http://www.fao.org/aims/

Related Works

[Fausto et. al, 2004] apply element level matching techniques

for semantic matching

[Stamou et.al] apply string matching techniques for ontologymatching

[Karin Koogan Breitman et.al 2005] apply string matching

techniques for lighweight ontology matching

[Paul Buitelaar et. al, 2009] apply string matching for linguisticmatching system

[Maria Teresa Pazienza et.al, 2007] Apply string matching forsemi-automatic matching system

Page 47: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 47 / 54

http://www.fao.org/aims/

Conclusion and Future work

To build the extended knowledge base

Page 48: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 48 / 54

http://www.fao.org/aims/

Conclusion and Future work Integrating Mapping into AGROVOC concept Server

Page 49: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 49 / 54

http://www.fao.org/aims/

Conclusion and Future work

We have described the facet based matching system for a large dataset

We have shown a running prototype for this system.

The majority of this work was done under the supervision of the FAO andthe CABI. At the moment, a prototype is running at the FAO

We will integrate this mapping file for searching purpose in AGROVOCConcept Server.

Page 50: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 50 / 54

http://www.fao.org/aims/

Questions

Page 51: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 51 / 54

http://www.fao.org/aims/

References

[Fausto et al., 2003]: F.Gunchiglia and P. Shvaiko. Semantic MatchingOntologies and Distributed System workshop, IJCAL,2003

[Fausto et al., 2004]: F. Gunchiglia, P. Shvaiko, and M. Yatskevich. S-Match: An algorithm and an implementation of semantic matching.In Proceedings of ESWS’04, 2004.

[Fausto et al., 2004]: F.Gunchiglia and M. Yatskevich. Element levelsemantic matching. In meaning Coordination and Negotiationworkshop, ISWC,2004

[Pavel et al., 2006]: P. Shvaiko, F.Gunchiglia and M. Yatskevich.Discovering missing background knowledge in ontology matching. In17th European Conference on Artificial Intelligence (ECAI 2006),volume 141,pages 382-386,2006

Page 52: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 52 / 54

http://www.fao.org/aims/

References (cont)

[Fausto et al., 2007]: F.Gunchiglia and I. Zaihrayeu. Light weight

Ontologies . Technical report at DIT, University of Trento Italy, October

2007

[Pavel et al., 2007]: P. Shvaiko, and J.Euzenate. Ontology matching.

Springer, 1st edition , 2007.

[Fausto et al., 2004]: F.Gunchiglia and M. Yatskevich. Element levelsemantic matching. In meaning Coordination and Negotiationworkshop, ISWC,2004

[S.R. Ranganathan]: S.R. Ranganathan. Element of library classification.Asia Publishing house

Page 53: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 53 / 54

http://www.fao.org/aims/

References (cont)

[Fausto et al., 2009]: F.Gunchiglia, B. Dutta, and V. Maltese. Facetedlightweight ontologies. In LNCS, 2009

[Bhattachary 1979]: G. Bhattachary. POPSI: its foundamentals andprecedure based on a general theory of subject indexing language. InLibrary Science with a slant to Documentation, volume 16, pages.

[Pavel]: P. Shvaiko . Iterative schema-based semantic matching (PhDthesis), Technical report DIT-06-10Pavel]: 2,December 2006.

[morshed 2009]: A. Morshed and M. Sini. Aligning Controlledvocabularies: Algorithm and Architecture at Workshop on AdvanceTechnologies for Digital Libraries 2009, AT4DL, Trento, Italy

[Morshed 2009]: A. Morshed, M. Sini and J. Keizer. AligningControlled Vocabularies using a facet based approach. (Technical Paperat FAO).

Page 54: Ph d thesis-ahsan_slidesv3

Ahsan Morshed, FAO 54 / 54

http://www.fao.org/aims/

Thank You