query-driven approach of contextual ontology module...

Noname manuscript No.(will be inserted by the editor)

Query-driven approach of contextual ontology modulelearning using web snippets

Nesrine Ben Mustapha · Marie-AudeAufaure · Hajer Baazaoui Zghal · HendaBen Ghzala

Received: date / Accepted: date

Abstract The main objective of this work is to automatically learn ontologymodules that cover search terms of users for query reformulation in ontology-based question answering on the Web. Indeed, some arising approaches of ontol-ogy module extraction aim at solving the problem of identifying ontology fragmentcandidates that are relevant for the application. The main problem is that cur-rent approaches of ontology modularization consider only the input predefinedontologies, instead of underlying semantics found in texts. This work proposes anapproach of contextual ontology module learning covering particular search termsby analyzing past user queries and snippets provided by search engines. The ob-tained contextual modules will be used for query reformulation. The proposal hasbeen evaluated on the ground of semantic cotopy measure of discovered ontologymodules, relevance of search results.

Keywords Ontology, modular ontology, knowledge, ontology learning

Nesrine Ben MustaphaEcole Centrale Paris, MAS LaboratoryGrande voie des vignes, 92290Chatenay Malabry, FranceE-mail: [email protected]

Marie-Aude AufaureEcole Centrale Paris, MAS LaboratoryGrande voie des vignes, 92290Chatenay Malabry, FranceE-mail: [email protected]

Hajer Baazaoui ZghalNational School of Computer Sciences, RIADI LaboratoryCampus of LaManouba, TunisiaE-mail: [email protected]

Henda Ben GhzalaNational School of Computer Sciences, RIADI LaboratoryCampus of LaManouba, TunisiaE-mail: [email protected]

2 Ben-Mustapha N. et al.

1 Introduction

With the increasing availability of ontologies on the Web, modularity principlehas become an important issue to overcome scalability problems over ontology-based systems. Ontology module extraction (OME) approaches consist in reducingontology to an ontology fragment covering a particular vocabulary. The extractedontology modules are used for knowledge selection or reuse. Generally, the inputof those approaches [2] consists in large ontologies and the output is a set ofindependent modules.

Current approaches of modularization basically rely on static existant ontolo-gies which can be built manually by an expert, inferred from a general ontology(Mesh, Sensus, DOLCE, etc.) or be extracted automatically from external sources.In the specific applications of ontology-based information retrieval and questionanswering, these ontologies are used to reformulate the query and to improve therelevance of the output results. Therefore, they should cover the vocabulary ofsearch terms and the searched text documents. Then, identifying a relevant frag-ment of these input ontologies can be inconsistent to cover users need. Indeed,users’ interest and also domain knowledge evolve with new discoveries and usages.As a consequence of this continuing evolution, automatic methods for ontologyconstruction are required. For the best of our knowledge, machine learning strate-gies have not yet been explored for ontology modularization as mentioned in [2].

The main objective of this work is to propose an automatic solution for contex-tual ontology module building faithful to information retrieval applications thatcould not be based on predefined ontologies. Basing on the assumptions that thecontextual ontology modules are the basic commitment between user’s queries andsearched documents, we propose a novel approach of contextual ontology moduleextraction from web snippets, and user’s feedback (past user queries and selecteddocuments). Unlike many previous approaches of modularization, the proposedmethod has been designed in an automatic and is based on unsupervised pattern-based techniques of ontology learning. Web-based information distributions wereemployed to assess the reliability of the extracted knowledge.

This paper is organized as following. In section 2, an overview of related workson ontology module extraction is presented. Section 3 describes the proposed ap-proach of contextual OM extraction. In section 4, we describe an evaluation on theground of two criteria which are the comparison of discovered ontology moduleswith an upper-level ontology MESH and the impact of OM extraction on the rel-evance and the ranking quality of search results. Finally, we conclude and discussdirections for future research.

2 Related Works

2.1 Ontology Modularization

Ontology module extraction consists in reducing an ontology to a single ontologyfragment that covers a particular vocabulary. In [5], the proposed approach called”ontology segmentation” takes one or several classes of the ontology as an input.It applies a generic algorithm to include all related classes that participate in

Contextual Ontology Module Learning Approach 3

the definition of the input classes, on the basis of class subsumption and OWLrestrictions.

Noy and Mussen [3] define a novel traversal view extraction technique formodule extraction. Starting from one class of the considered ontology, relations ofthis class are recursively traversed to include related entities as in [5]. However,this technique is not automatic and the user is involved in selecting the relationsto be traversed and associating to each of them a level of recursion, at which thealgorithm should stop ”traversing” relations. Besides, the proposed approach in [4]is composed of:(1) the selection of the relevant ontologies, (2) the modularizationof the selected ontologies, (3) the merger of the relevant ontology modules in thecase when the query terms are covered by several different ontologies. The inputof the ontology modularization approach is made up with an ontology and a setof terms that should be covered by the smallest part of the ontology. Unlike thealgorithm proposed in [5], all the super-concepts of a selected concept are notnecessarily included (only the ones that are directly related to concepts of themodule, i.e. the most specific common concepts).

In the approaches mentioned above, if we consider that the main objectiveof these work is to apply ontology modularization for enhancing information re-trieval, two main limits can be noticed. First, the existing approaches of ontologymodularization rely on static ontologies that can be inconsistent to cover basicallyuser’s need for information search on the Web. These ontologies are usually builtmanually by the domain experts or inferred and learned from external sources.

Second, The proposed modularization algorithms consider mainly the struc-ture of the input ontology, instead of semantics or context. Consequently, we needsemantics-based criteria so as to determine the border of ontology modules. More-over, the contextuality of the module will considerably depends on the semanticcovertness of the original input ontologies. However, ontologies on the web are notsufficiently consistent and contextualized to cover specific domain knowledge.

Therefore, our proposed strategy should also differ from previously discussedstrategies. Obtained modules have to be based on observing the relevant inter-actions for knowledge selection, not on a human or human-driven dedicatedspecification, nor on the structural properties of the ontology (as in traversal viewstrategies). Ontology learning (OL) techniques could be a way to overcome theselimits. In fact, for the best of our knowledge, unsupervised machine learning strate-gies have not yet been explored for ontology modularization as mentioned in[2].

2.2 Unsupervised Ontology Learning Techniques

Ontology learning (OL) aims at building ontologies from knowledge sources usinga set of machine learning techniques and knowledge acquisition methods. OL fromtexts has been widely used in the knowledge engineering community. By applyinga set of text mining techniques, a granular ontology is enriched with concepts andrelationships. In the last decade, with the enormous growth of Web information,the Web has become an important source of information for knowledge acquisition.Its main advantages are its huge size and its large degree of heterogeneity. OL fromWeb documents requires the same techniques as those used for ontology extractionfrom texts. A study of several types of available Web search engines and how they


can be used to assist the learning process (searching Web resources and computingIR measures) were explored in [6].

In this paper, we focus mainly on two categories of unsupervised techniquesthat don’t need any background knowledge: Pattern-based techniques and unsu-pervised web-based distributional measures.

2.2.1 Pattern-based Techniques

Pattern-based Ontology Learning (OL) techniques are considered as linguistic OLtechniques. Linguistic analysis of texts requires the use of a grammar representingthe sentence structure. We distinguish two types of grammars that mainly allowto represent the structure of a sentence in a given natural language:

– Constituency Grammar : this grammar is the basis of the formal theory oflanguage used in computational linguistics. The analysis using this type ofgrammar is based on the position of words in the sentence and how they canbe grouped.

– Dependency Grammar : the analysis using this grammar provides binary gram-matical links between words in a sentence. When two words are connected by adependency relationship, we say that one is the ruler or the head and the otheris a dependent. In general, the extracted relations are schematically representedby an arc between the ruler and the dependent.

Linguistic patterns have been used in many fields namely in non-supervised in-formation extraction, knowledge acquisition and ontology learning approaches. Inlinguistic approaches, lexico-syntactic patterns are manually defined by linguistsas proposed for taxonomic relation discovery in [8]. Some research works havebeen proposed to automatically extract lexico-syntactic patterns. These methodsstart from an existing ontology and extract a set of pairs of concepts linked byrelationships, in order to learn hyponymy relationships (or taxonomic relations)and produce lexico-syntactic patterns. These ones are used to discover other re-lationships, based on the learned patterns, between the concepts of the existingontology. These techniques, based on lexico-syntactic patterns learning, lead togood results for learning hyponymy relationships.

Lexico-syntactic techniques were also adapted to web approaches. In fact, GijsGeleijnse and Jan Korst [11] propose the identification of concept instances usingthe search engine Google. Queries are constructed based on lexico-syntactic pat-terns defied by Hearst [8]. A term is accepted when the number of hits (numberof results returned by Google) exceeds a given threshold. The same idea was alsoexplored by [13] in order to extract taxonomic relations and attributes of concepts.Finally, interesting work on learning non-taxonomic relation from web has beenwell detailed in [13] and led to the development of the Pankow system. Pankowalso relies on the idea that lexico-syntactic patterns described above can be appliednot only to a text corpus, but also in the World Wide Web as in [14].

In the meantime, some problems appear like terms polysemy or errors producedthat are dependent from the corpus. Therefore, statistical assessment is requiredto refine the set of discovered elements.


2.3 Distributional Measures of Concept Relatedness

Semantic similarity is defined by the set of lexical relations including synonymy,antisense, and hyponymy. Semantic Relatedness is more general than semanticsimilarity paradigm. It is defined by any type of association relation that can existbetween two words. In general, the assessment of concepts relatedness is based onthe estimation of semantic evidence observed in knowledge source.

In the literature, there exists several similarity estimation paradigms accord-ing to the techniques and the knowledge employed to perform the assessment.On one hand, there are supervised approaches that exploit knowledge structurelike taxonomies and ontologies (subsumption hierarchies, domain ontology, andlexical network as Wordnet [27]) to compute similarities. However, the main prob-lem of these path-based approaches is that they heavily depend on the degree ofthe completeness, the homogeneity and the coverage of the semantic links repre-sented in the ontology. Besides, although many researches have made an enormousprogress in adopting Wordnet to compute lexical semantic similarity, the followingdisadvantages still exist:

– the establishment of a semantic network requires experts and takes a lot ofmanpower and time

– the update of these resources with new vocabulary is difficult.

Considering the massive feature of the web environment, authors have exploitedthe web hit count (page count) returned by a Web search engine to estimate termcollocation probabilities by querying each or both terms with respect to the totalamount of web resources. For this reason, many corpus-based similarity measures[28] [29] can be applied to the former.

From a unsupervised point of view, the statistical assessment of the semanticlink between concepts, as proposed in [15] [16] typically uses a measure derivedfrom the following co-occurrence function between two terms:

Ck(concept, candidat) =prob(concept AND candidat)kprob(concept)× prob(candidat) (1)

The Symmetric Conditional Probability (SCP − IR)) [15] can be defined asC2 and the Pointwise Mutual Information measure (PMI − IR) [16] as log2c1.

Considering the web as a massive source of knowledge, several statistic ap-proaches have exploited the number of pages returned by a web search engineto estimate the probabilities of co-occurrence of terms. We consider the followingnotations:

– hit(a) is used to denote the number of web pages containing the query returnedby a search engine;

– totalWebs denotes the total number of pages indexed by the search engine.

The probability “prob(a AND b)” is computed using the hit number pro-vided by search engines, as stated by the following formula:

prob(a, b) =hit(a AND b)

totalWebs(2)


The score derived from this function was defined by Turney as follows:

score(concept, candidat) =prob(conceptANDcandidat)kprob(concept)× prob(candidat) (3)

The measures proposed by Turney were applied and evaluated in [17]. However,since the semantic content and the context of words are not taken into accountby these measures, limited performance were observed in [18]. Those measureshave been applied to the task of evaluating concept relatedness in [16] and fortaxonomic relation learning. In [19], authors propose to construct a taxonomy bysubmitting the initial keywords to Google in order to retrieve web pages containingthese terms. A study of several available types of search engines on the web hasbeen carried out in order to be used in the learning process (searches web resourcesand calculates a score based on the number of hits).

However, due to their lack of semantic content and word context, limited per-formance is observed in [10]. Since absolute co-occurrence cannot reflect semanticrelatedness, the need to contextualize these measures is important tohave more accurate relatedness value between two given concepts.

3 Synthesis and Main Objectives

Based of survey on ontology learning approaches discussed in [1], Ontology modu-larity paradigm has not been considered in the beginning of the ontology learningprocess. In addition, as mentioned in the previous sections, most of the currentapproaches of ontology modularization consist essentially in extracting modulesfrom existing ontologies by applying predefined logical rules. They are also basedon the structure of the ontology to perform the ontology segmentation into smallontology modules. As also emphasized in [2], machine learning methods have notbeen explored yet for automatic ontology module extraction from unstructureddata.

In addition, the study of different ontology learning techniques applied to textand the web allowed us to get the following findings:

– Despite of the use of lexico-syntactic patterns combined with web-based simi-larity measures [6], erroneous results are noticed especially in the case of termpolysemy and ambiguity. For this reason, it is required to take into the accountthe context of terms in the definition of similarity measures.

– Besides applying lexico-syntactic patterns (using constituency grammar) ona corpora without the analysis of different syntactic dependencies of the holesentences may lead to erroneous discover relationships (especially in the case oftaxonomic relationships). Indeed, the recognition of these patterns in a part of asentence do not necessarily reflect the relevance of the relationship associatedwith this pattern. Thus the use of deep syntactic dependency analysis couldimprove the results of these techniques. However, dependency analysis has notbeen applied to the web environment.

– The use of dependency grammar in text linguistic analysis has more advan-tages than constituency grammar. Firstly, dependency analysis approaches thehuman analysis since words are connected by grammatical relations without


any prior hypothesis. Secondly, dependency links are intuitively semantic rela-tions useful for a subsequent semantic analysis task. Third, it does not dependon the domain of knowledge.

– The results presented in [20] shows that Stanford parser [21] is able to generatetests for most of sentences type since it is a probabilistic parser (which useslanguage models derived from statistical methods, dependency grammar andmachine learning).

– Statistical techniques (measures applied in information retrieval area) havelimitations due to the lack of contextual information when definingselection measures.

– The technique of topic signature has improved the relevance of the resultsobtained by applying lexico-syntactic patterns (for taxonomic relationshipsdiscovery).

– generating dependency graphs from sentences is unsupervised domain-independenttechnique allowing the recognition of concepts and relationships.

The main challenge of the present work is to adapt unsupervised ontologylearning techniques for on-line ontology module construction and its integrationin the search process. This work is part of the generic approach. It aims to developits modular semantic layer from the associations between queries and documentsresults in order to improve the contextualization of user’s goal search and conse-quently, the answers’ relevance of semantic search [7].

Based on these findings, main assumptions adopted in the proposal are definedwithin the following section.

4 Basic Assumptions and Originality

The assumptions adopted in our proposal are described by the following state-ments:

– A module is defined as an ontological fragment that describes a reduced conceptof ”target subject” called pivotal concept. Thus, the terms referring to the pivotconcept, synonyms, attributes, and terms strongly linked to its definition aregrouped into a single ontological module;

– The objective of this approach consists in building light-weight ontology mod-ules that do not include axioms or rules. Indeed, the modules are automaticallyextracted and enriched with an incremental manner;

– The proposed approach should be domain-independent and is applied for atarget given language (English in this work);

– ontology module construction is invoked by a given user query (set of keywordsor a question).

The originality of the proposal resides in the adaptation of the following tech-niques with the use of web search engines capabilities:

– Input sources are composed of disambiguated queries (using user feedback) andweb snippets provided by available search engines;

– Combining dependency analysis and lexico-syntactic patterns in order to buildweighted concept map from web snippets;

– Proposition of selection scores based on web contextual distribution of termsand on applied patterns.


5 Contextual Ontology Module Learning Approach

In this section, we describe an approach of Ontology Module (OM ) building inan automatic and domain-independent way, using past users queries and resultedsnippets (returned by web search engines).

Our underlying hypothesis is that an OM is an ontology fragment that repre-sents a question on specific domain knowledge. This OM can be used to annotatedocuments related to the specific knowledge component.

The input of the proposed process is composed of couples of questions andanswers (URLs) related to a specific theme. The main steps of the proposal arethe followings (Fig 1):

Past queries

Query 1 + Urls

…

…...

Query n + Urls

Query analysis and contextual

information extraction

Topic signature

Term_1, Term_2, ��,Term_n

Extraction of web snippets

(definitions) List of web

snippets

Generation of grammatical

dependency lists

Syntactic parser

(Stanford Parser)

Parsed

snippets

Weighted concept map

construction

Enriched concept map with

hyponyms and hyperonyms Concept map

Ontology module

formalisation with typed

attributed graph

Patterns based

on

dependency

links

ontology module creation related to an hyponym

Fig. 1 Ontology module learning approach

– Query analysis and topic signature construction;– Web snippets extraction;– Generation of dependencies’ list ;– Weighted concept map construction;– Concept map enrichment with taxonomic and non-taxonomic relationships;– Ontology module formalisation using Typed Attributed Graph (TAG).

First, each question is analyzed by identifying the answers’ patterns to be usedin the next step. Second, these patterns are employed to reformulate queries inorder to collect relevant snippets provided by a web search engine. Next, a conceptnetwork called ”concept map” is extracted from the obtained textual snippets byapplying ontology learning techniques (pattern-based rules and Web Co-occurrenceScores (WCS)).


A top-level ontology (such as Mesh, Sensus) describing very general conceptsthat are the same across all knowledge domains can be used to identify questionconcepts and import related concepts and relations. The obtained concept mapacts as a skeleton on which the OM is built. it is represented using typed attributedgraph.

5.1 Definitions

In this section, we present the definitions of the main conceptual units exploredin our proposal, and including the following items:

– Lexico-syntactic pattern (LSP);– Grammatical concept Map (GCM);– Weighted Semantic Concept Map(SCM);– web snippet;– Typed Attributed Graph (TAG).– Ontology Module (OM)

Definition 1 A Lexico-Syntactic Pattern is defined as a subgraph LSP (r, le, ls)made up with a root node r, input dependency links inl and output links outl.

The main role of this patterns consists in defining dependency analysis role toextract a grammatical concept map from a set of snippets.

Definition 2 A Grammatical Concept Map (WGCM) is defined by a graphWGCM(t, ed) where nodes t a term extract from a dependency link and edges edare grammatical dependency link provided by the parsers ld.

The sentence analysis based on a dependency grammar, can generate a list ofdependencies from which the grammatical concept map is extracted. By identify-ing a corresponding subgraph of lexico-syntactic patterns in a grammatical map,an action associated to the corresponding rule lead to transform this map intosemantic concept map (defined below).

Definition 3 A Weighted Semantic Concept Map (WSCM) ) is built onthe basis of grammatical concept map of a pivotal concept. it makes up complexnodes and edges. Nodes in this map are usually expressed in terms of domainconcept (considered as a graph of equivalent terms) and the arcs are labeled usuallyby verb-based phrases. The nodes and the arcs are weighted by the contextualselection scores (as defined in Section 5.5.2).

Definition 4 Note that the term snippet is used here to denote a fragment of aWeb page returned by remote search engines (such as GOOGLE or YAHOO) andsummarizing the context of searched keywords.

In the following subsections, the main steps of this approach are detailed withan example of ambiguous query containing the acronym BMI that refers to BodyMass index.


5.2 Question Analysis and Topic Signature Construction

This step aims to identify the pivotal terms (concepts) that characterize the onto-logical module from the user query and to determine the context of these pivotalterms in order to better understand their meaning (Figure 2).

Query Reformulation

Topic signature construction

Context selection Question:

Reformulated query

(What is BMI ?)

{Measure, body, weight, height, fat, body fat, …….}

Ctx2={height, weight}

selected pages BMI + height + weight -1,18187E-08

-1,13087E-08

-1,172988E-08

-2,51316E-08

-1,91602E-08

-1,92276E-08

-1,98792E-08

Fig. 2 question analysis and extraction of contextual information

Inspired by the assumptions adopted by corpus-based disambiguation approaches,we consider that the context of a concept may be defined by ”a set of terms thatfrequently occur with the query terms in most of the sentences extracted from theresults selected by user (user feedback)”. The context of a term is also called topicsignature or contextual signature [22].

According to the approaches of corpus analysis, the context of a term is gen-erally defined by the set of words co-occurring therewith in the windows of a sizegenerally set to two words. Our main contribution lies here in using the user feed-back (selected results) to determine the context of the query. Indeed, the termsthat co-occur with query terms are extracted from the results selected by usersand are sorted by the measure PMI − IR [16].

Indeed, the semantic signature defines the context of the ontology module.It will be used in the definition of contextual scores for selecting concepts andrelations and for the query reformulation with the purpose of retrieving the relevantsnippets from the web.

For instance, in the case of the query that contains the pivotal term ”BMI”,we suppose that the results selected by the user are related to the sense of ”body


mass index”. The obtained semantic signature contains the set of terms sortedby the value of the relatedness measure PMI-IR (as shown in Figure 2). We notealso that the following terms ”measure of body mass”, ”weight” and ”height” arecharacterized by the highest values of PMI-IR measure. Thus, these two termscan be used to eliminate the ambiguity of the query term and to better define itscontext.

5.3 Web Snippets Extraction

As emphasised in previous sections, an ontology module is based on a set of con-cepts involved in the definition of the corresponding meaning of a pivotal concept.In fact, these concepts aim to define:

– its attributes;– its composition;– its roles.

To do this, we use a set of patterns aiming to identify these concepts such asthe followings:

– < pivotal − term > is < noun− phrases >;– < Pivotal − term > refers to < noun− phrases >.

The instantiation of these patterns is used to reformulate queries in order toretrieve snippets from the Web.

This step is performed by a remote search engine each time after the analysisof input queries or the use extraction patterns. And for each reformulation, we getautomatically a set of 30 to 50 snippets. The use of the Web search engine serveris generally set by several attributes including:

– number of results;– preferred language of the returned results;– requested type of results (URL, snippets, title, etc).

The step of query reformulation relies on the following items:

– the ”pivotal” term of the query (PT );– the context of the pivotal term (derived from its topic signature);– the patterns for identification of the definitions.

Then, only the well-formed sentences of the returned snippets are selected tobe processed in the next step by the syntactic parser ”Stanford Parser” to buildthe grammatical concept map.

5.4 Generation of Grammatical Dependency List

Sentences collected from the previous step are considered as key data that con-tain the related concepts to the pivotal concept in the context of a sentence ora paragraph. The use of a parser is crucial in this step (eg. Stanford Parser). Asyntactic dependency list is generated by the latter. The structure of the extracteddependency is as follows:


dependency Name(token, token) (4)

In this step, we adopt the principle of compositionality [23]. This principle isbased on the idea that the order in which the words of a sentence are organizedin addition to the relations between these words define the semantics of a sentence.

Illustration: We illustrate, here the processing of a returned snippets relatedto the same example of the query (”BMI”):

Body Mass Index (BMI) is a number calculated from a person’s weight andheight.

The list of dependencies links resulting from the parsing of this sentence isdescribed as follows (Fig 1):

nn(Index3, Body − 1)nn(Index− 3,Mass− 2)nsubj(number − 9, Index− 3)abbrev(Index− 3, BMI − 5)cop(number − 9, is− 7)det(number − 9, a− 8)partmod(number − 9, calculated− 10)det(person− 13, a− 12)poss(weight− 15, person− 13)prep from(calculated− 10, weight− 15)prep from(calculated− 10, height− 17)conj and(weight− 15, height− 17)

Table 1 Grammatical dependency list generated for one snippet

A set of lexico-syntactic patterns based on dependency relationships have beendefined to identify the concepts and relationships. These patterns are generic anddomain independent and are detailed in the following sub-section.

5.5 Weighted Concept Map Construction

The construction phase of the conceptual map uses as input the list of dependenciesgenerated for each snippet (Fig 3). We defined the dependency-based patterns inorder to identify the terms related to the pivotal concepts, the candidate conceptsand the relationships that will make up the ontology module.

We distinguish mainly two types of operators that use these patterns:

– The operators responsible for concept and relation extraction;– The operators responsible for concept and relation selection.

Operators are designed to apply the functions associated with each identifiedpattern. We have defined a set of rules based on dependency patterns in orderto derive new concepts or new relationships (Fig 3). These patterns are classifiedaccording to the grammatical type of the dependency link and are detailed in thenext subsections.


snippet�� of�

�e��e��ies

snippet�� of �

�e��e��i��

�� al��

��

��

�� al��

Taxonomic_�� al��

�� !�

Patterns for concept discovery

Patterns for relationship discovery

WE

B S

EA

RC

H E

NG

INE

semantic concept map

Operator of conceptidentification

opertor ofconceptselection

Operator forrelationlearning

Operator for relationselection

Fig. 3 Weighted concept map construction

Once the list of candidate concepts and relationships is obtained, a prelimi-nary semantic concept map is built around the pivotal concept. Then, operators ofcandidate selection use the proposed contextual scores (defined in Section 5.5.2)to prune the semantic concept map.

5.5.1 Operators of concept and relation learning.

Pattern based on noun phrases for concept identification. Defined rules forconcept node construction are based on a subgraph of dependency relationshipsand are explained in Fig 4.

Indeed, we distinguish several types of patterns namely:

– dependency links that cause aggregation of simple terms in order to obtain acompound one (as in the case of compound nouns link (nn), adjectives link(amod), appositional modifiers (appos) and abbreviation modifiers (Abbrev).Corresponding rules (R1, R2, R3, R4 and R5) are formalised by Fig 4;

– grammatical links pointing to a subject role (such as the link ”subj”) and theirsub-links which identifies a concept (Fig 5);

– all links related to additional objects (dobj and iobj) and their sub-links (Fig6).

We describe in the following sub-section patterns based on verb-based phrases.

Patterns based on verb-based phrases for relation identification. Verb-based phrases can be identified in a simple way from the pattern ”subject verbe Objet”but in other cases of long sentences, the identification of verbal links may appear


Pattern Actions R1 Compound noun (nn)

Dep : nn(Terme1-pos1, Terme2-pos2)

If pos1 – pos2 == 1 alors Terme t= Creation_terme(”Terme2 Terme1”) ; Concept C1= Creation_NodeConcept(t) ; C1.name=t ; Delete (Dep) ; Substitute_list(Terme1-pos1 , t) ; Substitute_list (Terme2-pos2 , t) ; else Apply (R2) ;

Example: Dep: nn(diseases-2, animal-1)

R2 Compound noun (nn) Dep1: nn(Terme1-pos1, Terme2-pos2) Dep2 : nn (Terme1-pos1, Terme2-pos3)

Si pos3 – pos2 == 1 alors Terme t = creation_terme( « Terme2 Terme3 Terme 1); Concept C1= Creation_NoeudConcept(t) ; Supprimer(Dep) ; Remplacer_liste(Terme1-pos1 , t) ; Remplacer_liste(Terme2-pos2 , t) ; Remplacer_liste(Terme3-pos3 , t) ;

Example: Dep1: nn(index-3, mass-2) Dep2: nn(index-3, body-1)

Terme1 Terme2

nn

Animal disease

Body mass index

Fig. 4 Rules R1 and R2 (concept construction)

complex. To do this, we defined a set of patterns to identify relationships aggre-gating verbs, their auxiliary verbs, their negation and proposals related to theseverbs.

For each candidate concept, the operator perform the pattern matching on thedependency list in order to determine if the incoming and outgoing links of thisconcept is similar to the pattern representation. If the pattern is recognized, thefunction associated to the latter is triggered.

Figure 7 details the aggregation of dependency links in order to identify alabel for a given relation. For example, the list of dependency links generated bythe following sentence ”the concepts can be extracted from abstracts” includes thefollowing dependency triples:

– aux(extracted, can)– auxpass(extracted, be)– prep from(extracted, abstracts)

According to this rule R9, a relation labeled with the expression ”can be ex-tracted from” is identified and is linking the previously-discovered concepts ”con-cepts” and ”abstract”.

Pattern associated to complex sentences. Patterns related to complex structuresof sentences are related to the presence of dependencies links related to:

– relative pronouns (rcmod): ”That”, ”Which”, ”whome”, ”whose”,etc. (Fig.8);– coordination conjunctions (Fig. 8);


Pattern Actions

R3 Adjective links (amod)

Dep1: amod(Term1-pos1, Adj2-pos2)

Term t1 = Creation_term(« Term1 ») ;

t1.type= adj ;

Term t2= Creation_term(« Adjective Term1” );

Concept C1= Creation_NodeConcept(t1) ;


Relation R= Creation_Relation

(type=taxonomy, C1, C2) ;

Delete(Dep) ;

substitute_liste(Term1-pos1 , t2) ;

Example:

Dep: amod(measure-6, inaccurate-5)

R4 Appositional modifiers (appos)

Dep1: appos(Term1-pos1, Term2-pos2)

Term t1 = creation_term(“Term1 “) ;

Term t2= creation_term( “Term2 “);



Search_ equivalence_Relation (t1, t2) ;

Relation R= Creation_Relation (type=equivalence,

C1, C2) ;

Delete(Dep) ;

substitute_liste(Term1-pos1 , t2) ;

Example:

Dep: "Sam, my brother"

appos(Sam, brother)

Dep:"Bill (John's cousin)"

appos(Bill, cousin)

R5 Abbreviation modifiers (Abbrev)

Dep: Abbrev(Term1-pos1, Term2-

pos2)

Term t1 = creation_term(“Terme1”) ;

Term t2= creation_term(“Terme2”)

TRelation R = creation_relation (type= abrev, t1,t2) ;


C1.add(t2) ;

C1.add(R) ;

Delete(Dep) ;

Dep: Abbrev("Body mass Index",

BMI);

Term1 Adj2 amod

Inaccurate

Measure

Measure

Type= Hyponymy

Name= amod

Term1 Term2 appos

My

brother

Sam

Type= equivalence

Name= is-a

Body Mass

Index BMI

Type= abrev

Fig. 5 Rules R3, R4, R5 (concept construction)

Rules R12, R13 and R14 exploit the patterns relying on the presence of relativepronouns to determine the terms referenced by these pronouns and deduce theconceptual structure in terms of relations and concepts. Prepositions in complexsentences may be part of an expression indicating time, place or group report.Prepositions that are used after a verb are normally integrated in the label of therelationship (R14).

According to the conjunctions-based rules used in the case of a noun phrasesor verbal phrases, the concepts that represent the predicates of such conjunctionsinherit the subgraph attached to the first predicate of this conjunction (Fig. 8).


Pattern Actions

R6 subject of past participle (nsubjpass)

Dep1: amod(pastVerb-pos1, Term-pos2)

Term t1 = Creation_term(« Term ») ;


R7 Direct Object (dobj)

Dep1: dobj (Verb-pos1, Term-pos2)

R8 Indirect Object (iobj)

Dep1: iobj (Verb -pos1, Term2-pos2)

Fig. 6 Links related to additional objects

Pattern Actions R9 Dep0: nsubjpass(Verb-posi+2, Term1-pos1)

Dep1: aux(Verb-posi+2, auxiliary-posi) Dep2: auxpass(Verb-posi+2, be-posi+1) Dep3: prep_X(Verb-posi+2, Terme-posi+4)

TRelation TR1= Creation_Relation(”auxiliary be verb X”) ; TR1.type=R9 ; Delete(Dep1) ; Delete(Dep2) ; substitute_list(Verb-posi+2, TR1.name) ; TR1.link(Term1-posi, Term2-posi+4)

Example : Body Mass index can be calculated from weight

Dependency list: Dep0: Nsubjpass(calculated-5, Body mass Index-1) Dep1: Aux(calculated-5, can-3) Dep2: Auxpass(calculated-5, be-4) Dep3: Prep_with(calculated-5, weight-6)

R10 Dep 1: nsubj(GVerb-posi+2, Term1-posi) Dep 2: dobj(Term-posi, Term2-posi+1)

Term t1 = Creation_term( “Term1”) ; Term t2 = Creation_term( “Term2”) ; Concept C1= Creation_NodeConcept(t1) ; Concept C2= Creation_NodeConcept(t2) ; Delete(Dep1) ; Delete(Dep1) ; CRelation CR1= Creation_Relation(C1, C2) ; CR1.name= GVerb ;

R11 Dep0:nsubj(GVerb-posi+2, Term1-posi) Dep1: dobj(GVerb-posi+2, Term1-posi+3) Dep2:prep_X(GVerb-posi+2, Term2-posi+4)

Term t1 = Creation_term(“Term1”) ; Term t2 = Creation_term( “Term2“) ; Concept C1= Creation_NodeConcept(t1) ; Concept C2= Creation_NodeConcept(t2) ; Delete(Dep0) ; Delete(Dep1) ; Delete(Dep2) ; CRelation CR1= Creation_Relation(C1, C2) ; CR1.name= « Gverb X »

Body Mass

Index

Weight

Can be

calculated

from

Fig. 7 Patterns based on verb-based phrases

Other relationships are assigned to a specific category in the literature. Weinclude the relationship of composition, hyponymy, attributes and causality. Themost commonly used patterns (Fig 9) are those of subsumption relations.


Pattern Actions

R12 Dep1: rcmod(Term1-posi, verbe-posi+2)

Dep2: nsubj(verb-posi+2, Term2-posi+3)

Dep3: dobj(verb-posi+2, Term2-posi+3)

Concept C1 = Creation_Concept(Term1) ;

Concept C2 = Creation_Concept(Term2) ;

TRelation TR1= Creation_Relation(C1, C2) ;

TR1.name= Verb;

TR1.type= R13;

Delete(Dep1);

Delete(Dep2);

Delete(Dep3);

Example :

……… calculation that scores your body fat.

Liste de dépendances:

Dep1: rcmod (calculation-4, score-6)

Dep2: nsubj (calculation-4, that-5)

Dep3: dobj(score-6, body fat)

R13 Dep 1: nsubj(GVerb-posi+2, Term1-posi)

Dep 2: dobj(Term-posi, Term2-posi+1)

Term t1 = Creation_term( “Term1”) ;

Term t2 = Creation_term( “Term2”) ;



Delete(Dep1) ;

Delete(Dep1) ;

CRelation CR1= Creation_Relation(C1, C2) ;

CR1.name= GVerb ;

R14 Dep0:nsubj(GVerb-posi+2, Term1-posi)

Dep1: dobj(GVerb-posi+2, Term1-posi+3)

Dep2:prep_X(GVerb-posi+2, Term2-posi+4)

Term t1 = Creation_term(“Term1”) ;

Term t2 = Creation_terme( “Term2“) ;



Delete(Dep0) ; Delete(Dep1) ;

Delete(Dep2) ;

CRelation CR1= Creation_Relation(C1, C2) ;

CR1.name= « Gverb X »

R15 Dep1: Conj_and (Term1-posi, Term2-

posi+1)

Dep2: nsubj(verb-posi+2, Term2-posi+1)

Dep3: dobj(verb-posi+2, Term3-posi+2)

Concept C1 = Creation_Concept(“Term1”) ;





TR1.name= Verb

TR2.name = verb

TR1.type= R15;

copy_dependency((“Term1”, (“Terme2”))

Example :

Body Mass Index and Other Measures of

adiposity estimate critical fat values.

Dependency list:

nsubj(estimate-9, Body mass Index-3)

conj_and(Body Mass Index-3, Measures-6)

nsubj(estimate-9, Measures-6)

dobj(estimate-9, critical values-12)

calculation Body fat Score

Fig. 8 Patterns based on relative pronouns and on conjunctions

Illustrative example of semantic concept map (SCM) building. We illus-trate the application of the rules, described above, by an example on the followingquery: ”BMI?”, the generated list of dependencies for one snippet is showed byFig. 1.

The different stages of the construction of the semantic concept map are illus-trated by Fig 10 and Fig11.


PATTERN

R16

Body Mass Index and Other Measures of Adiposity estimate critical fat values. nsubj(estimate-8, Body Mass Index-3) prep_of(Measures-7, Adiposity-5) conj_and(Body Mass Index-3, Measures-7) nsubj(estimate-8, Measures-7) dobj(estimate-8, critical fat values-11)

Term1posi Term2posi+1

Conj-Of


poss


Has-a

Term1 Term1 of

term2

Term2

Term1’s

term2

Same as Name=subsume

type = taxonomic

Name=has

type = attribut

Fig. 9 Patterns related to possessive and taxonomic relationships

The result of the application of the last rule ”R16” gives the semantic conceptmap shown by Fig 11.

Relations and concepts arising from the identification of the previously de-scribed patterns are not validated immediately. it is necessary to use two opera-tors to select relevant concepts and relationships. These operators are describedin Section 5.5.2.

5.5.2 Proposition of Contextual Scores Associated to Concepts and Relations

In this section, we describe the contextual scores used by the operators responsiblefor concept and relation selection. In fact, three categories of scores are distin-guished in this work:

– a contextual score of concept selection which takes into account the con-text of the pivotal concept (semantic signature);

– a contextual score related to the selection of taxonomic relationshiplinking a pivotal concept and a candidate one;

– a contextual score related to the selection of non-taxonomic relation-ships;

Only the first two scores are used in the construction of the minimal semanticmap. Other scores are involved in the enrichment step and also for the weightingof the resulted semantic concept map.

Originality. As stated in the related work section, unsupervised web measuresbased on the distributions of terms that are heavily used by recent approachesof ontology learning [19] and [13] suffer from linguistic ambiguity and polysemyproblems (ie. ambiguous words). In fact, a misinterpretation of the co-occurrence


steps Actions

Step 1 nn(Index-3, Body-1)

nn(Index-3, Mass-2)

nsubj(number-9, Index-3)

abbrev(Index-3, BMI-5)

cop(number-9, is-7)

det(number-9, a-8)

partmod(number-9, calculated-10)

det(person-13, a-12)

poss(weight-15, person-13)

prep_from(calculated-10, weight-15)

prep_from(calculated-10, height-17)

conj_and(weight-15, height-17)

Deleting the determinants

Step 2 Application of R2 and R5

nn(Index-3, Body-1)

nn(Index-3, Mass-2)

nsubj(number-9, Index-3)

abbrev(Index-3, BMI-5)

cop(number-9, is-7)







nsubj(number-9, Body Mass Index-3)

cop(number-9, is-7)








cop(number-9, is-7)






Step 5 Application of rule 16


cop(number-9, is-7)

Final ontology module

Fig. 10 Example of concept map extraction (1)

of terms could distort the selection of candidate concepts since the score definedby Turney relies only on the number of web pages containing the pivotal conceptand the candidate one (Formula 3).

Therefore, the need to contextualize these scores is motivated by the fact thatcontext can provide enough information on the meaning of the two words.

It is necessary to take into account the context ctxα (definition 5) which can beobtained from user search history (semantic signature of pivotal concepts extractedfrom the user feedback).

Definition 5 The context of a word ctxα is defined by the set of α words belong-ing to semantic signature and having the highest PMI − IR values.


steps Actions

Step 5 Application of rule 16


cop(number-9, is-7)

Final ontology module

Fig. 11 Final concept map

Score of candidate concept selection By redefining the measure of Pointwise Mu-tual Information ”PMI− IR” and the score of candidate selection Score Webαctx,we get respectively the following formulas(5) (6), as follows.

PMI − IRαctx(concept, candidat) = ln(hit(concept and candidat and ctxα)2

hit(concept and ctxα)× hit(candidat and ctxα))

(5)

(6):

Score Webαctx(concept, candidat) =hit(conceptandcandidatandctxα)2

hit(candidatandctxα)(6)

We illustrate these proposed scores in the next subsection.

Illustrative example of concept selection. Given the context ctx obtained from theuser history related to the ambiguous query ”BMI”. We suppose that the hiddenmeaning of this acronym is concerned by this phrase : (”Brain interface machine”).

In this example, ”ctx” includes the terms ”neurology” and ”device”. Thus, thescore associated with each candidate concept is computed as follows:

Score web2ctx(”IMC”, candidat) =hit(”BMI” and candidat and ”neurology” and ”device”)

hit(candidat and ”neurology” and ”device”)(7)

We note that the terms whose scores are higher are characterized by an impor-tant semantic relatedness with the concept Brain interface machine (ie. ”disease”,”medicine” and ”paralysis) ”.

The proposed score is used by the selection operators in order to validate theconcepts and relationships that should be added to the ontological module. It is


Mot1 scoreαctx PMI − IRαctxdisease 1,5581E-10 0,11586207medicine 2,448E-10 0,10566038paralysis 1,5152E-11 0,10515152institute 4,3082E-11 0,06389262index 5,8368E-10 0,04907749company 1,496E-10 0,04479167weight 3,6453E-10 0,03652695

Table 2 computing of selection score of concepts

important that the module contains only the ontological concepts sharing strongsemantic correlation between them.

A third score is defined in the next sub-section for the selection of candidateconcepts linked to the pivotal concept by a taxonomic relationship.

Contextual score of hyponyms selection based on hyponymy pattern distribution onthe web. This score uses lexico-syntactic patterns of taxonomic relationship as apart of the formula. Given that there are several patterns, the occurrence score foreach pattern is calculated and only the maximum value of the obtained results isconsidered.

The proposed score ScoreP atternHypoαctx(concept, candidat) is defined by the

following formula (8).

Score PatternHypoαctx(concept, candidat) =PatternHypo

maxi=1..l

(hit(Patterni(concept, candidat)andctxα)2

hit(candidatandctxα))(8)

The candidate hyponyms are extracted from snippets by reformulating querieswith hyponymy patterns [8] as described by the following table.

Lexico-syntactic pattern queries ExampleNP{, }including{NP, } ∗{or|and}NP

”Pivotal Term including”∗

”body mass index including” ∗

suchNPas{NP, } ∗{(or|and)}NP

”such Pivotal Term as” ∗ ”such body mass index as” ∗

NP{, }suchas{NP, } ∗{or|and}NP

”Pivotal Term such as” ∗ ”body mass index such as” ∗

NP{, }especially{NP, } ∗{or|and}NP

”Pivotal Term especially”∗

”body mass index especially” ∗

Table 3

Figure 12 shows an example of reformulated query.

Illustrative example of hyponyms selection We explain here how to use the pro-posed score 8 for hyponym selection of the concept”cancer”(Table 4).


snippet 1

Candidatehyponyms

Fig. 12 Candidate hyponyms

CandidateConcept

”cancersuch as”

”cancer in-cluding”

”such can-cer as”

cancerespecially”

Maximum (em-pirical thresh-old = 1.E−5)

Breast cancer 8.33E−4 0.01 1.49E−4 0.0093 0.01Leukemia 0.0022 0.038 7.09E−9 0.0127 0.038Solid tumors 0.04 0.19 3.04E−3 0.047 0.047Squamous cellcarcinoma

0.0044 0.0949 3.67E−7 0.00336 0.0949

national Can-cer

8.771E−9 1.19E−8 0 2.92E−9 2.33E−8

Table 4 computation of score related to hyponyms selection

With an empirical threshold defined by the value of 1.E−5, the list of hy-ponyms presented in Table 4 whose score is superior to this threshold are re-tained. Therefore, the candidate concept ”national cancer” is excluded from thelist of hyponyms.

5.6 Enriching and Weighting the Semantic Concept Map

The Relation Label Learning Operator (RLLO) aims to discover other possiblelabels of discovered relations using snippets and lexico-syntatctic patterns. For ex-ample, to discover new label of the relation ”IS-A” that relates the concept ”BMI ”and ”measure”, the following query: ”BMI * measure” is made to a search enginein order to import snippets that contain sentences regrouping these two concepts inorder to extract possible verbs relating them. According to the following snippets:


Abstract 1 .. adipose tissue located throughout the body, of which BMI providesa measure

Abstract 2 so BMI reverts to a measure of gross body weight in the individual.Abstract 3 BMI Index gives a measure of what is underweight,Abstract 4 BMI offers a reasonable measure of fatness .....

Possible relation labels include the following verbs ”provide”, ”revert to”, ”give”,”offer” (figure 14).

Fig. 13 Example of a context Map and an extracted typed attributed graph related to BMItopic

Each label (verb) is weighted by the proposed score (formula 9) that takes intoaccount the occurrence of the expression ”concept-1 verb concept-2” regarding theoccurrence of these linked concepts, in order to deduce the most specific relationthat links these concepts.

Score Web(verb(Concept1, concept2) =(hit(”Concept1verbconcept2”)ANDctx2hit(Concept1 ∗ concept2 AND ctx2

(9)

As shown by the following score values, The verbs ”provide” and ”give” arethe most frequent labels for this kind of relationship. Only verbs whose score isgreater than the empirical threshold (defined 1− e8 ) are retained. :

– score Web(give) = 7, 6458e− 7– score Web (offers)= 0– score Web (provides) = 2, 716297e− 5


On the other hand, new discovered labels are considered as new patterns forcandidate concepts discovery that can be related to domain concepts by means ofthese labels. Therefore, new queries are submitted to the search engine (such as”BMI provides * ”), which provide relevant sentences containing these patterns.Then, new domain candidates are discovered.

5.7 Weighting Concepts

Discovered candidate concepts are represented by a node and are weighted usingthe score presented by the following formula, taking into account the appropriatepattern:

Score Web(PatternRelation(Pivotal Concept, Candidate concept)

=Maxi,j(hit(Patternrelation(Pivotal Termi) , Candidate Termj)AND ctx2)

Maxj(hit(Candidate termj AND ctxα))(10)

This formula computes the maximum probability of finding any non-taxonomicrelation involving the candidate concept and the domain concept in the scope ofa web document. If this score is remarkably lower than a threshold, the discov-ered concept or relation is rejected. Since each concept can be referred by multipleterms, the score Score Web2ctx(PatternRelation(pivotalconcept), concept candidat)is based on the distribution of these terms as shown by the formula 10.

The discovered concepts are also refined using the threshold ∂ initialized bythe average of similarity values between topic signature terms and the expressionof ”hit(PatternRelation(Pivotal Concept)”, as follows:

∂ =

∑i=1..n ScoreW eb(hit(PatternRelation(Pivotal Concept) , term signi))

n(11)

where n is the number of terms belonging to the semantic signature associated tothe pivotal concept.

6 Ontology Module Representation

At an abstract level, since there exists a bijection between graphs and ontology lan-guages, we choose graph-based formalism as an abstract level of modular ontologydefinition. Since the OMs are supposed to be extracted from unstructured text, asexplained in [24], the discovered concepts and relations are not validated at onestep. The ontological representation should take into account change managementand automatic update.

In our work, we propose to define a general structure of an ontological knowl-edge base (networked OMs classified by domain) independent of a concrete lan-guage [25].

The proposed Ontological knowledge warehouse include a graph-based model-ing and functional operators for inter-level navigation. For this reason, we choose


to rely on typed attributed graph as it is powerful enough to represent an ontol-ogy written in RDF, OWL or DAML+OIL. Besides, typed attributed graph is themodel implemented in the ACG library for graph transformation. Details aboutattributed graph are described in [26].

A typed attributed graph representation of the OMAGM is a pair (NG, EG),where NG is a set of attributed nodes and EG is a set of attributed edges.

A typed attributed node NG = (TN , AVN ) has a type TN and a set ofattribute values AVN where Tn is the set of terms referring CN to a concept Cand AVN is the set of score’s values assigned to each of the terms belonging toCN .

A typed attributed edge EGM = (TE , RN , AVE , OE , DE) has a type TE ,a set of attribute values AVE , an origin node OE and a destination node DE ,where TE denotes the type of a relation (hyponymy, meronymy, possession, etc.)and RN is a set of terms referring to the relation defined in sig(R).

A typed attribute value AVE is a pair (RN , score) associating score’s valuescore to a term of (RN ).

We propose the following statements:

– Topic ontology is specified by classic typed attributed graph of thematic con-cepts;

– Modular domain ontology is represented by a typed attributed graph of OMs;– OM is defined by a typed attributed graph of concepts (ontology definition);– Each concept is a typed attributed graph of terms.

Furthermore, we propose the model of typed attributed graph of ontologieswhere The nodes and edges are of several types, as follows.

The Term node (T): is the smallest conceptual unit extracted in the form ofa nominal expression.

The Concept node (C): is a typed attributed graphe of Term nodes (T)connected by arcs typed as Terminological relations (TR) and is a conceptual unitof the module (M) graph.

The Module node (M): is a graph of nodes Concept (C) connected with theconceptual relationships (CR).

The Ontology node (GO): is the largest unit defined as a typed attributedgraph of of Module node (M) connected with the inter-module interface (I).

The figure 14 presents a reused module related to the disease subtopic.

7 Experimentations

In order to study the feasibility of the proposed approach of on-line constructionof ontology modules, two case studies were considered, namely:

– The context of a domain-independent search in web: in this context, it wasproposed to extend the semantic layer of a given search engine for informationexisting in the Web as ”Yahoo”, ”Google” ”Bing Search” or ”live search”;

– The context of a search related to a specific domain, to do this, we chose toimplement the integration of ontologies building modules for medical digitallibrary ”PubMed”. The extension of this system is entitled ”SemPubMed”.


Root_Module_BMI

BMI_014#

• Body Mass index: 0,080 • Body Mass index (BMI):

0,075 • Body fat Index: 0,002

Measure_018#

• Calculation : 0,046 • Assessment : 0,0012 • Formula: 0,075 • Number : 0,002

Is-A: hyponymy Is-a: 0,093

Rever-to : 0,002 Offer : 0,005

Provide : 0,075

Ratio _019#

• Ratio: 0,0085 • fraction : 0,0046 • proportion: 0,0012

Adult _019#

•Adult men: 0,0085 •Adult women :

0,0046

obesity_025#

• Risk of obesity: 0,185 • Excess body fat: 0,248

Height_and_weight_025#

• Union

Role : Calculate • calculate : 0, 389 • Measure: 0,058

Role : use •Is-based-on: 0,28 • Use : 0,058 • factor;:0,012

Role : apply for Apply_for:

0,00039 Role : estimate Estimate : 0,085 Assess: 0,0039

Fig. 14 ontology module represented by an attributed graph

The corresponding component supporting the proposed approach entitled ”Mod-Learn” has been developed in order to interface two external components (thesearch engine Yahoo!Boss for the first case study and the digital library Pubmedfor the second case).

The main submodules that make up ModLearn component are:

– A submodule analysis and reformulation of the user query;– A submodule for extracting candidate answers: is to retrieve abstracts related

to the concept of pivot query using the search engine textityahooBoss. Thenthe sentences extracted from the abstracts are analyzed with the standfordparser to generate the list of dependencies links. This list is handled by aClass ”ExtractConceptCandidate” that applies rules for the identification andthe selection of candidate concepts.

– A submodule for building ontology modules: it uses the list of dependencieslinks and the set of rules for identifying non-taxonomic relationships and tobuild the module ontologies.

– A submodule for displaying results: used to formalize ontology module intordfs or owl format.

7.1 Case Study 1: Modular Ontology Learning For Web Search

In these case studies, we have tried to compare our approach with different relatedworks on the modularization approaches. However, it has been difficult to comparethe proposal with OM extraction approaches since the features and the usage of


Fig. 15 ModLearn Architecture.

each input of those approaches are different. Therefore, it seems more logical toevaluate the present work according to OL creteria.

We add that the evaluation of the proposed approach is based on two maincriteria which are:

– The comparison of OM learning process (figure 16) ;– The impact of OM learning on the relevance of search results (figure 17).

7.1.1 Evaluation of Ontology Module Learning

In figure 16, two ontologies are compared. On one hand, we use a TaxonomicPrecision (TP) which is a similarity measure based on the notion of semanticcotopy (sc). It is recently presented and analysed in [12]. The reason to choosethis measure was to take advantage of its ability to compare ontologies as wholestructures. The values of TP are from the range [0, 1].

We use the Mesh Ontology (MSO) as a referenced ontology. 80 queries on thetopic of animal diseases were collected manually by using 80 concepts of MSO. 80ontology modules which make up a large ontology (RO) were constructed accord-ing to the proposed approach to be compared with (MSO).

Semantic Cotopy sc1(c,O) of a concept c from ontology O is a set containingc and all super and sub-concepts of c in O, excluding the concept root of (O).Then, the taxonomic precision TP(c, RO, MSO) of concept c and two ontologies


Fig. 16 ontology module learning evaluation

RO and MSO where c ∈ RO and c ∈MSO is defined as follows:

TP (C,RO,MSO) =sc(C,MSO)

⋂sc(C,RO)

sc(C,RO)(12)

A Taxonomic Recall (TR) can be assessed as follows:

TP (C,RO,MSO) =sc(C,MSO)

⋂sc(C,RO)

sc(C,MSO)(13)

Therefore, the global TP and TR are computed respectively by the followingformulas:

GTP (RO,MSO) =1

|RO|∑

c∈RO

TP (c,RO,MSO) (14)

GTR(RO,MSO) =1

|MSO| ×∑

c∈RO

TP (c,RO,MSO) (15)

Non-taxonomic precision and recall are calculated according to the same formulaby using the second definition of semantic cotopy the sc-2(c, O) which is the setcontaining a concept c and all concepts related to c by a non-taxonomic relation-ship.

The figure 16 shows the evolution of the precision of taxonomic and non-taxonomic structure according to the number of snippets used in the ontologymodule leaning.

On the other hand, in order to evaluate the approach presented in this paper,the impact of the use of OMs during query reformulation is also experimented.

7.1.2 Impact of Ontology Module learning on the relevance of search results

First, we have computed the precision of results retrieved by means of query re-formulation using discovered modules. Evaluation results contained in Figure 17represent the obtained precision according to the number of retrieved documents(from 4 to 100).


0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

4 5 10 20 30 40 60 80 100

Pre

cisi

on

Documents

Yahoo (Baseline)

Wordnet

ontology module learning(ModLearn)

Fig. 17 Evaluation of result precision

The first scenario represents the initial search, which is a keyword search onYahoo.

The second scenario represents the situation where there are similar cases inthe database. The search is based on using WorldNet to add synonyms.

The third scenario represents the search for information based on the learnedOMs by using 100 snippets for ontology module learning.

The query reformulation is based on answers pattern extracted from con-structed OMs. We have observe a significant improvement of the relevance of theretrieved information according to the amount of knowledge considered duringquery reformulation and OM creation. We have also noticed that this improve-ment is maintained as the number of documents increases, even though the qualityof the retrieved document set decreases due to the higher amount of noisy and non-related documents retrieved.

The results have revealed that: (1) Their accuracy was significantly improvedby using modular ontologies; (2) Strongly, discovered ontology module are impor-tant to better contextualize users searches and (3) The relevance of documents arenot based on the terms frequency but on the semantic relatedness between terms.

Second, in order to evaluate the ranking quality of results according to the for-mulated query, we used the well-known Normalized Discounted Cumulative Gainmeasure. While evaluating a ranking list, NDCG is computed according to theoriginal paper [9], as follows:


NDGC(n) = Zn∑nj=1

2r(j)−1log(j+1) ,

Where r(j) is the rating of the j-th document in the list, and the normalizationconstant Zn is chosen so that the perfect list gets a NDCG score of 1.

Fig. 18 Evaluation of results ranking

Figure 7.2.1 shows the evaluation results measured by the NDCG for the twoscenarios previously described. The X-axis refers to the Web page rank. Again, itis shown that reformulated queries using pattern answers (extracted from obtainedontology modules) have contributed to improve significantly the document raking.

7.2 Experimentation 2: Modular Ontology Learning for Digital Libraries

Several tests have been performed to evaluate Sem-PubMed System. We have sub-mitted several queries to the system and considered the evaluation of the relevanceof two outputs:

– the discovered concepts ;– and the returned results.

This evaluation concentrates, then, on two main criteria: the relevance of thenewly discovered ontological concepts and the relevance of the search results.

7.2.1 Evaluation of Discovered Concepts

In this section we compare the ontology built by Sem-PubMed with the OntologyMeSH to assess the results of the Ontology Building module of the system.

As the goal is to compare ontologies as whole structures we use TaxonomicPrecision (TP), a similarity measure based on the notion of semantic-cotopy. Itsvalues are in the [0, 1] range. The tests consisted on many queries related todifferent medical diseases. RO represents the ontology built by Sem-PubMed andMSO is the MeSH ontology.


In these expressions SC(C,O) is the number of elements of the set containingthe concept C and all the concepts related non-taxonomically to it in the ontologyO. As shown in Fig 7.2.1, the precision average is below 50 %, meaning that lessthan half of the new concepts discovered by the system appear in MeSH. Thus,it can be argued that Sem-PubMed is able to detect automatically new domainconcepts from the Web, in order to assist the users search in digital libraries. Onthe other hand, recall is higher than precision, as the terms that appear both inRO and MeSH are between 0.5 and 0.7. This shows that the number of conceptsrelated to the query in RO is higher than the one in MeSH.Finally, the erraticbehavior of the precision is due to the specific queries that were performed: someconcepts are well described in MeSH while others are less considered or absent.

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

10 20 30 40 50 60 80 100 120 150 170

Pre

cisi

on

/Re

call

Ave

rage

Number of concepts

precision

recall

Fig. 19 Comparision of Precision in Search Results.

7.2.2 Evaluation of Returned Results

Fig 7.2.2 shows the average precision values for this set of queries, depending onthe number of results that are considered. For instance, if we consider only the 10first results given by PubMed, the average precision is 0.7 (7 of the 10 results arerelevant for the query); in the same case for Sem-PubMed, the average precisionis 1 (the 10 results are relevant for the query). It can be seen how, in all cases, theprecision of the results of Sem-PubMed is clearly higher than the one offered byPubMed (especially when the number of results is low).

This shows that the candidate concepts found by Sem-PubMed contributesignicantly to improve the relevance of semantic search in digital libraries. Thus,the two main consequences of these results are the following:

– The accuracy of the results was signicantly improved by using candidate con-cepts to refine the initial query of the user;


0

0,2

0,4

0,6

0,8

1

1,2

10 20 40 50 100 200 300 400 600 800

Pré

cisi

on

Nombre des documents

Pubmed

Our system

Fig. 20 comparison of Precision in Search Results.

– The improvement of the relevance of documents is not a result of consideringthe frequency of syntactic terms, but rather the semantic relatedness betweenthe query and the documents.

8 Computational Complexity Analysis

Computational complexity for the construction of a module from an ontologicalquery depends on the number of abstracts that are taken as input. For a givenquery, ”n” abstracts are analyzed by the Stanford Parser parser with complexitycsp.

For each abstract, the maximum of rules that can be applied is sixteen rules.The complexity is denoted by crules. The application of a rule will create no

more than four conceptual units with a complexity of cmap.In sum, the overall complexity is equal to cspn + 16n + crulesc4 map. Thus,

the complexity is of order O(n).The processing time for a request varies from 1.05 seconds to 1.15 on average

for the analysis of a single abstract.

9 Discussions

From the experimental results presented in the previous sections, we have identifiedthe following findings:

– A significant improvement in the relevance of results (at the top 20 documents):This is due to the integration of on-line ontology module learning with the in-formation retrieval process. Indeed, we observed that the modular structureof the ontology has contributed to the improvement of research similar docu-ments. This is due to the fact that the ontology module contains on one handthe semantics of a concept built around the pivot terms and also includes termsselected according to the contextual relatedness score computed on the base


of term distribution on the web. These measures are adapted to the context ofthe search engine used for ontology learning.

– The variance of the precision and recall of learned ontological elements beginsto tend to 0, using more than 75 abstracts on the web.

– The recall of non taxonomic relationships is more important than taxonomicrelationships.

10 Conclusion and Future Work

In this paper, we have proposed a new approach of Ontology Module extractionfrom web snippets, and user’s feedback (past user queries and selected documents).Unlike many previous modularization approaches, the originality of this work isthat it has been designed in an automatic and domain-independent way, exploitingunsupervised techniques and the web as a large scale learning source. The contri-bution resides in the following techniques: Web-based co-occurrence measures forthe assessment of extracted knowledge (concepts and relationships) and Unsuper-vised method for context map construction and Attributed graph representationfor a multi-label representation of ontology module. The evaluation of the proposalis based on two criteria which are the comparison of OM extraction process andthe impact of module-based query reformulation on the relevance of search results.The evaluation of question answering system has revealed that the accuracy of theresults was significantly improved by using modular ontologies. Our ongoing workaims at exploring ontology module construction for social search systems.

References

1. Ben-Mustapha, N., Baazaoui-Zghal, H., Marie-Aude, A., and Ben-Ghzala, H. Survey on on-tology learning from web and open issues. In Third International Symposium on Innovationin Information and Communication Technology (ISIICT’ 2009), Amman Jordan, (2009)

2. Stuckenschmidt, H., Parent, C., Spaccapietra, S.: Modular Ontologies: Concepts, Theo-ries and Techniques for Knowledge Modularization In, Springer-Verlag, Berlin, Heidelberg,(2009).

3. Noy, N., Musen, M.: Specifying Ontology Views by Traversal, In Proc. of InternationalSemantic Web Conference (ISWC), (2004).

4. D’Aquin, M., Sabou, M., Motta,E.: Modularization, a key for the Dynamic Selection of Rel-evant Knowledge Components. Proc. of the ISWC 2006 Workshop on Modular Ontologies,(2006).

5. Seidenberg J., Rector, A.: Web Ontology Segmentation: Analysis, Classification and UseIn: Proc. of the World Wide Web Conference (WWW),(2006).

6. Sanchez, D., and Moreno, A.: Learning non-taxonomic relationships from web documentsfor domain ontology construction. DKE, 64(3):600-623, (2008).

7. Elloumi, M.,Ben-Mustapha, N., Baazaoui, H., Moreno, A., Sanchez, D.: Evolutive Content-Based Search System In KDIR, (2010).

8. Hearst, M.A. :Automated Discovery of WordNet Relations. “Wordnet An Electronic LexicalDatabase”. MIT Press, Cambridge, MA, 132-152, (1998)

9. Jarvelin, K., Kekalainen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans-actions on Information Systems, 20(4), (2002) 422-446.

10. Ehrig, H., Ehrig, K., Prange, U., Taentzer, G.: Fundamental theory for typed attributedgraphs and graph transformation based on adhesive hlr categories. Fundam. Inf., 74:31:61,(2006)

11. Geleijnse, G. and Korst, J. H. M.: Automatic ontology population by googling. In Pro-ceedings of the Seventeenth Belgium-Netherlands Conference on Artificial Intelligence, pp.120-126, (2005)


12. Maedche, A., Staab, S.: Measuring similarity between ontologies. Proc. CIKM 2002. LNAIvol. 2473, (2002).

13. Cimiano, P. Ontology learning and population from text - algorithms, evaluation andapplications. Springer, (2006)

14. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S.,Weld, D. S., and Yates, A.: Web Scale Information Extraction in KnowItAll (PreliminaryResults). In: Proceedings of the 13th International WWW Conference, New York, USA,pp. 100-111, (2004)

15. Ferreira, J.: A local maxima method and a fair dispersion normalization for extractingmulti-word units from corpora. World Trade, 369381, (1999)

16. Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Pro-ceedings of the Twelfth European Conference on Machine Learning Freiburg, Germany, pp.491-499 (2001)

17. Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in Web text. InProceedings of the 20th International Joint Conference on Artificial Intelligence ,pp. 2733-2739, (2007)

18. Lemaire, B., Denhire, G.: Effects of High-Order Co-occurrences on Word Semantic Simi-larities. Current Psychology Letters - Behaviour, Brain and Cognition Vol. 18(1), (2006)

19. Sanchez, D.: Domain ontology learning from the web. Knowledge Eng.Review, 24(4) 413,(2009)

20. Stevenson, M. and Greenwood, M. A.: Comparing information extraction pattern model.In Proceedings of the Information Extraction Beyond The Document Workshop COL-ING/ACL, (2006).

21. Klein, D. and Manning, C. D.: Accurate unlexicalized parsing. In: Proceedings of the 41stAnnual Meeting of the Association for Computational Linguistics, pp. 423-430, (2003)

22. Alfonseca, E., Manandhar, S. : An unsupervised method for general named entity recogni-tion and automated concept discovery ”. In Proc. First International conference on genaralWordNet, India, (2002)

23. Godard, D.: Compositionalit : questions linguistiques. In Godard, D., Roussarie, L., andCorblin, F., editors, Smanticlopdie : dictionnaire de smantique. GDR Smantique et Modli-sation, CNRS, (2006)

24. Ben-Mustapha, N., Aufaure, M.-A., Baazaoui-Zghal, H., and Ben-Ghzala, H.: Contex-tual ontology module learning from web snippets and past user queries. In: Proceedingsof the 15th international conference on Knowledge-based and intelligent information andengineering systems - Volume Part II, KES’11, pp. 538-547, (2011)

25. Ben-Mustapha, N., Aufaure, M.-A., Baazaoui-Zghal, H., and Ben-Ghzala, H.: ModularOntological Warehouse for Adaptative Information Search. In: Proceedings of 2nd Interna-tional Conference on Model and Data Engineering (MEDI’2012), (2012)

26. Ehrig, Hartmut and Ehrig, Karsten and Prange, Ulrike and Taentzer, Gabriele: Funda-mental Theory for Typed Attributed Graphs and Graph Transformation based on AdhesiveHLR Categories. In journal of Fundam. Inf 74 (1). IOS Press(2006)31–61.

27. G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, K. Miller: WordNet: An online lexicaldatabase. Int. J. Lexicograph. pp. 235244, (1990).

28. Landauer, t. and Dumais, S.: A solution to platos problem: the latent semantic analysistheory of the acquisition, induction, and representation of knowledge. psych. rev. 104, 2,211 240, (1997).

29. Meadow, C., Boyce, B., and Kraft, D.: Text information retrieval systems, second ed.academic press, (2000).