ph d thesis-ahsan_slidesv3
DESCRIPTION
My PhD thesis slides which I presented on 12th of April at University of Trento, Trento, ItalyTRANSCRIPT
Ahsan Morshed, FAO 1 / 54
http://www.fao.org/aims/
Aligning Controlled vocabularies for enabling semantic matching in a distributed knowledge
management system
Ahsan Morshed Doctoral CandidateUniversity of Trento
PhD Supervisor: Professor Fausto [email protected]
Ahsan Morshed, FAO 2 / 54
http://www.fao.org/aims/
Publications (1-3)
A. Morshed. Controlled Vocabulary Matching in Distributed Systems, at
BNCOD 2009 Conference,UK.
A. Morshed and M. Sini. Aligning Controlled vocabularies: Algorithm and
Architecture at Workshop on Advance Technologies for Digital Libraries
2009, AT4DL, Trento, Italy.
M. Sini, J. Keizer, G. Johannsen, A. Morshed, S. Rajbhandari and M.
Amirhosseini.The AGROVOC Concept Server Workbench System:
Empowering management of agricultural vocabularies with semantics at
International Association of Agricultural Information Specialists (IAALD),
France, 2010.
Ahsan Morshed, FAO 3 / 54
http://www.fao.org/aims/
Publications (4-6)
A. Morshed, G. Johanssen, J. Keizer and M. Zeng,. Bridging End Users’ Terms and AGROVOC Concept Server Vocabularies. International Conference on Dublin Core and Metadata Applications (DC-2010), Pittsburgh, USA, 2010 (submitted).
A. Morshed, M. Sini and J. Keizer. Aligning Controlled Vocabularies using a facet based approach. (Technical Paper at FAO).
A. Morshed and R. Singh. Evaluation and Ranking of Ontology Construction Tools (Technical Paper).
Ahsan Morshed, FAO 4 / 54
http://www.fao.org/aims/
Agenda
Background: the role of controlled vocabulary in semantic matching
The overall goal: Aligning Controlled Vocabularies in a distributed system
A facet based matching
An Architecture for matching system
A running prototype for matching system
Evaluation Methodology and Results
Limitations and Related Works
Conclusions and Future work
Ahsan Morshed, FAO 5 / 54
http://www.fao.org/aims/
Some matching techniques
Element Matching techniques
ex: edit distance
Corpus-based techniques
ex: token or extension of classes
Structure-based tecniques
ex: graph matching
Knowledge-based techniques
ex: external resources
Ahsan Morshed, FAO 6 / 54
http://www.fao.org/aims/
Some matching systems
Cupid
- element level and structure level matching
RiMOM
- based on edit distance and Vector distance
FALCON-AO
- based on Linguistic and structure matching
CTXMatch, S-match
-based on knowledge based
Ahsan Morshed, FAO 7 / 54
http://www.fao.org/aims/
Some matching projects
HILT (High Level Thesaurus Project)
-JISC funded project, UK
-to facilitate the cross-searching of distributed information services by subject in a multi-schema environment.
-used datasets (e.g.,DDC,LCSH, IPSV, AAT)
CAT to AGROVOC Dr. Chan chung
64,638 Chinese terms, 51,614 descriptors and 13,024 non-descriptors
13,105 exact matches,11,408 BT match, 173 NT match, and 17,47other matches
Ahsan Morshed, FAO 8 / 54
http://www.fao.org/aims/
Some matching project
OAEI 2007 (Ontology Alignment Evaluation Initiative) -Food Track- AGROVOC-NALT thesauri
System Alignment Alignment Type
Falcon-AO 15,300 exactMatch
RiMOM 18,420 exactMatch
X-SOM 6,583 exactMatch
DSSim 14,962 exactMatch
[Willem , 2008]
Ahsan Morshed, FAO 9 / 54
http://www.fao.org/aims/
Matching in Distributed System
Edutella
Edutella is an open source project that creates an infrastructure for sharingmetadata in RDF format
It applies the peer-to-peer model using the JXTA protocol
Swap
aims at overcoming the lack of semantics in current Peer-to-Peer system
Ahsan Morshed, FAO 10 / 54
http://www.fao.org/aims/
Semantic Matching in Lighweight ontologies
To use of lightweight ontologies for matching purpose, all entities need toagree on the exact meaning of the concepts.
Descriptive lightweight ontologies
-used for defining the meaning of terms as well the nature and structure of adomain.
Classification lightweight ontologies
-used for describing, classifying, and accessing collection of document.
[Fausto et al.,2007]
Ahsan Morshed, FAO 11 / 54
http://www.fao.org/aims/
Controlled Vocabulary (CV)
A vocabulary stores words, synonyms, word sense definitions (i.e.glosses), relations between word senses and concepts; such avocabulary is generally referred to as the Controlled Vocabulary (CV)if choice or selection of terms are done by domain specialists [ahsan etal.,2009]
Ahsan Morshed, FAO 12 / 54
http://www.fao.org/aims/
Controlled Vocabulary
General controlled vocabulary:
Example: Thesaurus, WordNet, Classification, Directories, Lightweight Ontologies
Subject specific controlled vocabulary (SSCV)
Library of Congress and Authors List
Uniform List
Series List
Ahsan Morshed, FAO 13 / 54
http://www.fao.org/aims/
Applications for managing controlled vocabularies
Traditional Controlled Vocabulary tools
Ex: Old Agrovoc Thesaurus
Modern Controlled Vocabulary
Ex: AGROVOC Concept Server
Ahsan Morshed, FAO 14 / 54
http://www.fao.org/aims/
AGROVOC Concept Server
-store concepts
-Edit concepts
-visualize the
concepts
modern controlled vocabulary
Ref: http://nais.cpe.ku.ac.th/agrovoc/
Ahsan Morshed, FAO 15 / 54
http://www.fao.org/aims/
Applications for exploiting controlled vocabularies
Background Knowledge
Document annotation
Information retrieval and extraction
Audio and Video retrieval
Ahsan Morshed, FAO 16 / 54
http://www.fao.org/aims/
Challenges of Matching
Factors of heterogeneity problem
Time
Place
Structure
Culture diversity
Different vocabulary specialists
Ahsan Morshed, FAO 17 / 54
http://www.fao.org/aims/
Challenges of Matching
Different heterogeneity
Syntactic heterogeneity
Lexical heterogeneity
Semantic heterogeneity
Pragmatic heterogeneity
Metadata heterogeneity
[Pavel, 2006 ]
Ahsan Morshed, FAO 18 / 54
http://www.fao.org/aims/
Problem of CV
Ahsan Morshed, FAO 19 / 54
http://www.fao.org/aims/
FACET
A facet is like a diamond that consists of different faces.
Its distinct features allow thesauri, classifications or taxonomies tobe organized in different ways.
composed of collectively exhaustive aspects of properties orcharacteristics of a domain.
For example, a collection of rice might be classified using culturaland seasonal facets.
[Fausto et al.,2009] [ahsan et al.,2009]
Ahsan Morshed, FAO 20 / 54
http://www.fao.org/aims/
Faceted Controlled vocabulary
Ahsan Morshed, FAO 21 / 54
http://www.fao.org/aims/
Faceted Controlled vocabulary
Seasonal rice type Cultural rice type
Ahsan Morshed, FAO 22 / 54
http://www.fao.org/aims/
Creation of a Facet
Domain Analysis
analysis of terms are done by consulting domain experts
simple concept are identified.
Term collections and organization
terms are order according to their characteristic and meaningful sequence
ex: cow and milk form a facet called Diary system(part of relationship)
[Fausto et al., 2009]
Ahsan Morshed, FAO 23 / 54
http://www.fao.org/aims/
Exisiting Metholodies
PMEST : Personality(P), Matter(M), Energy(E), Space (S), and Time(T)
[Ranganathan]
DEPA : Discipline(D), Entity (E), Property (P), Action(A)
[Bhattachary and Fausto et al., 2009 ]
Ahsan Morshed, FAO 24 / 54
http://www.fao.org/aims/
Properties of facets
Hospitalities
Compactness
Flexibility
Reusability
The Methodology
Homogeneity
[Bhattachary and Fausto et al., 2009 ]
Ahsan Morshed, FAO 25 / 54
http://www.fao.org/aims/
Concept Facet Matcher
Based on DEPA model
CF={mg,lg,R} Where, mg is more general concepts ,lg is less general concepts, R is related concepts.
Based on Element Lebel Matchers
[Ahsan, 2009 and Ahsan et al., 2009 ]
Ahsan Morshed, FAO 26 / 54
http://www.fao.org/aims/
Concept Facet Matcher
Algorithm 1 buildCFacet(CV)
for i = 0 to CV do
store cF (Mg,Lg;R)
end for
return cF
[Ahsan et al., 2009 ]
Ahsan Morshed, FAO 27 / 54
http://www.fao.org/aims/
Concept Facet Matcher
Algorithm 2 MatchingFacet(CV1,CV2)cF1=BuildCFacet(CV1)cF2=BuildCFacet(CV2)for i = 0 to cF 1:size dofor j = 0 to cF 2:size docfmatcher=elementLevelMatcher(cF 1;cF2)end forend for
[Ahsan et al., 2009 ]
Ahsan Morshed, FAO 28 / 54
http://www.fao.org/aims/
System Architecture
Ahsan Morshed, FAO 29 / 54
http://www.fao.org/aims/
Data Model
Agrovoc databaseRef: http://aims.fao.org/website/Download/sub
Ahsan Morshed, FAO 30 / 54
http://www.fao.org/aims/
DATA Model
CABI database
Ref: http://cabi.org
Ahsan Morshed, FAO 31 / 54
http://www.fao.org/aims/
A Running Prototype
Search Sring
Validators/ domain
specialist s
Ahsan Morshed, FAO 32 / 54
http://www.fao.org/aims/
An architecture for a semantic search
Ahsan Morshed, FAO 33 / 54
http://www.fao.org/aims/
Running Prototype for search
user’s choice
Ahsan Morshed, FAO 34 / 54
http://www.fao.org/aims/
Evaluation and Results
A domain Expert Exact Match Partial Match No Match
Ahsan Morshed, FAO 35 / 54
http://www.fao.org/aims/
Datasets
Comparision
Characteristics AGROVOC CAB
Tree leaves 29172 47805
Term counts 18200 32884
Single words 6842 11720
MultiWords 11358 21161
Hierarchy depth 7 14
multiple BT 2546 1207
redundant BT 57 76
Ahsan Morshed, FAO 36 / 54
http://www.fao.org/aims/
Datasets
AGROVOC version 2007-08-10 2007-08-10
CABI version 2009-11-01 2009-11-01
AGROVOC term-leaves 35036 35036
CABI term-leaves 29172 29172
Coversion hierarchy hierarchy
Knowledge base WordNet 2.1 SWN 400.000
Ahsan Morshed, FAO 37 / 54
http://www.fao.org/aims/
Datasets
Relationship
BT NT RT UF
AGROVOC 228466 228424 326389 54370
CABI 15154 15841 41239 7094
Ahsan Morshed, FAO 38 / 54
http://www.fao.org/aims/
Input files
Agrovoc input file CAB input file
Ahsan Morshed, FAO 39 / 54
http://www.fao.org/aims/
Results
Experiment 1 Experiment 2
Exact Match 5976 6021
Partial Match 164255 164278
No Match 69800745 69800745
Facet based appraoch
Ahsan Morshed, FAO 40 / 54
http://www.fao.org/aims/
Results
Experiment 1 Experiment 2
Exact Match 8795 8795
Partial Match 334255 334258
No Match N/A N/A
Standard Tool
Ahsan Morshed, FAO 41 / 54
http://www.fao.org/aims/
Results
Min Max Min Max
Overall 25.8065 31.4496 21.7391 21.7391
Positive 18.6047 14.0814 10.4895 14.6154
Negative 97.1831 52.1495 94.7368 99.1304
Ahsan Morshed, FAO 42 / 54
http://www.fao.org/aims/
Advantage of Facet based System
No knowledge base required
Based on hidden semantic. Semantic meaning retrived during the processing
Ahsan Morshed, FAO 43 / 54
http://www.fao.org/aims/
Limitations Structure Problems
AGROVOC SQL Format and CABI Text Format Provided CABI file does not contain chemical and scientific concepts
Term Variants In AGROVOC, we found ``frog farms" which should have been ``frog farming"
because ``frog farms" is used for ``frog culture" and BT is ``aquaculture". Also, wefound the abbreviated term ``UHT milk" (one kind of milk product) which shouldhave been "UHT milk".
There were some ambiguous term which had different meanings, for example``cutting" ( i.e., slicing of bread or meat) or ``cuttings" (i.e.,propagation material).
there were some terms spells whose meaning is to difficult to capture, forexample “2.4.4-T”, “2.4.5-TP 2.4-D”, “2.4 DES”, “2.4 dinitrohenol”. Similarly, CABIcontained the term “4-H Clubs”. These terms did make sense during anymapping experiments.
Ahsan Morshed, FAO 44 / 54
http://www.fao.org/aims/
Limitations
Domain expert
To evaluate our results, we were able to find one domain expert from
FAO but we did not get any domain expert from CABI. The results mayhave been different if we had another domain expert.
Lack of consistency
Since the relationships in thesauri lack precise semantics, they areapplied inconsistently, both creating ambiguity in the interpretationof the relationships and resulting in an overall internal structure thatis irregulated and unpredictable
Ahsan Morshed, FAO 45 / 54
http://www.fao.org/aims/
Limitations
Limited automated processing
Traditional thesauri are designed for indexing and query formulation bypeople and not for automated processing. The ambiguous semantics thatcharacterizes many thesauri makes them unsuitable for automatedprocessing.
Ahsan Morshed, FAO 46 / 54
http://www.fao.org/aims/
Related Works
[Fausto et. al, 2004] apply element level matching techniques
for semantic matching
[Stamou et.al] apply string matching techniques for ontologymatching
[Karin Koogan Breitman et.al 2005] apply string matching
techniques for lighweight ontology matching
[Paul Buitelaar et. al, 2009] apply string matching for linguisticmatching system
[Maria Teresa Pazienza et.al, 2007] Apply string matching forsemi-automatic matching system
Ahsan Morshed, FAO 47 / 54
http://www.fao.org/aims/
Conclusion and Future work
To build the extended knowledge base
Ahsan Morshed, FAO 48 / 54
http://www.fao.org/aims/
Conclusion and Future work Integrating Mapping into AGROVOC concept Server
Ahsan Morshed, FAO 49 / 54
http://www.fao.org/aims/
Conclusion and Future work
We have described the facet based matching system for a large dataset
We have shown a running prototype for this system.
The majority of this work was done under the supervision of the FAO andthe CABI. At the moment, a prototype is running at the FAO
We will integrate this mapping file for searching purpose in AGROVOCConcept Server.
Ahsan Morshed, FAO 50 / 54
http://www.fao.org/aims/
Questions
Ahsan Morshed, FAO 51 / 54
http://www.fao.org/aims/
References
[Fausto et al., 2003]: F.Gunchiglia and P. Shvaiko. Semantic MatchingOntologies and Distributed System workshop, IJCAL,2003
[Fausto et al., 2004]: F. Gunchiglia, P. Shvaiko, and M. Yatskevich. S-Match: An algorithm and an implementation of semantic matching.In Proceedings of ESWS’04, 2004.
[Fausto et al., 2004]: F.Gunchiglia and M. Yatskevich. Element levelsemantic matching. In meaning Coordination and Negotiationworkshop, ISWC,2004
[Pavel et al., 2006]: P. Shvaiko, F.Gunchiglia and M. Yatskevich.Discovering missing background knowledge in ontology matching. In17th European Conference on Artificial Intelligence (ECAI 2006),volume 141,pages 382-386,2006
Ahsan Morshed, FAO 52 / 54
http://www.fao.org/aims/
References (cont)
[Fausto et al., 2007]: F.Gunchiglia and I. Zaihrayeu. Light weight
Ontologies . Technical report at DIT, University of Trento Italy, October
2007
[Pavel et al., 2007]: P. Shvaiko, and J.Euzenate. Ontology matching.
Springer, 1st edition , 2007.
[Fausto et al., 2004]: F.Gunchiglia and M. Yatskevich. Element levelsemantic matching. In meaning Coordination and Negotiationworkshop, ISWC,2004
[S.R. Ranganathan]: S.R. Ranganathan. Element of library classification.Asia Publishing house
Ahsan Morshed, FAO 53 / 54
http://www.fao.org/aims/
References (cont)
[Fausto et al., 2009]: F.Gunchiglia, B. Dutta, and V. Maltese. Facetedlightweight ontologies. In LNCS, 2009
[Bhattachary 1979]: G. Bhattachary. POPSI: its foundamentals andprecedure based on a general theory of subject indexing language. InLibrary Science with a slant to Documentation, volume 16, pages.
[Pavel]: P. Shvaiko . Iterative schema-based semantic matching (PhDthesis), Technical report DIT-06-10Pavel]: 2,December 2006.
[morshed 2009]: A. Morshed and M. Sini. Aligning Controlledvocabularies: Algorithm and Architecture at Workshop on AdvanceTechnologies for Digital Libraries 2009, AT4DL, Trento, Italy
[Morshed 2009]: A. Morshed, M. Sini and J. Keizer. AligningControlled Vocabularies using a facet based approach. (Technical Paperat FAO).
Ahsan Morshed, FAO 54 / 54
http://www.fao.org/aims/
Thank You