manually vs semiautomatic domain specific ontology buildingdsc.unisa.it/tesi pubblicate sul web/tesi...

Facoltà di Lettere e Filosofia

Corso di Laurea Specialistica in Comunicazione d’impresa e pubblica

Tesi di Laurea in

Informatica per il Commercio Elettronico

Manually vs semiautomatic domain specific ontology building

Relatore Candidato

Prof. Ernesto D’Avanzo Antonio Lieto matr. 0320400079

Correlatore Prof. Tsvi Kuflik

Anno accademico 2007-2008

2

Acknowledgements

There are many people that I have to thank for the realization of this work. First of

all I want to say thank you to my advisor, Prof. Ernesto D’Avanzo. He always has

been available with me during this year of research thesis, giving me crucial

councils for the development of this work (both regarding the theoretical and the

experimental part). Another special person that I wish to thank is the Prof. Tsvi

Kuflik. His comments and considerations about my work were very important for

its improvement. So if there are some errors or mistakes I’m the only guilty. Their

guide was excellent.

Then I want to thank the Dr. Brenda Schaffer for her suggestions for the manual

ontology building, the Professors Roberto Cordeschi and Annibale Elia, for their

constant interest to the whole project work that have involved even other Master

Degree students of the University of Salerno, and the Prof. Marcello Frixione for

his interesting seminar regarding the evolution of the semantic networks.

Finally (last but not least) I have to thank my family that always has been with me

in happy and difficult moments.

This work is dedicated to my grandmother Maria that, unfortunately, is no longer

among us.

3

Contents Chapter 1. The problem and the research question p. 5 Chapter 2. Methods and Tools for the semi-automatic or automatic ontology generation 2.1 Different approaches to the ontology generation p. 12 2.1.1 Methods and techniques for the ontology generation from text p. 14 2.1.2 Methods and techniques for the ontology generation from dictionaries p. 18 2.1.3 Methods and techniques for the ontology generation from a knowledge base p. 10 2.1.4 Methods and techniques for the ontology from semi-structured schema p. 21 2.1.5 Methods and techniques for the ontology generation from relational schemata p. 23 2.2 Research projects and tools for the ontology generations p. 25 Chapter 3. The Energy Domain Case Study 3.1 Energy Domain Modelling Process

3.1.1 The modelling approach p. 38

3.1.2 Information Sources and Tool p. 41

3.1.3 Definition of Domain Concepts p.44

3.1.4 Horizontal Links p. 45

3.1.5 Ontology Population p.47

3.1.6 Logical Expressions p.48

4

3.1.7 Racer Pro p. 51 3.2 Manual approach: Energy Ontology Description p. 53 3.2.1 Energy Domain p. 53 3.2.2 Energy Sources p. 55 3.2.3 The case of the Hydrogen: Renewable or not Renewable? p.56 3.2.4 Country p. 57 3.2.4 Energy Security Class p. 57 3.2.6 Infrastructures p. 60 3.2.7 Environmental consequences p. 61 3.2.8 Energy Use p.63 3.3 A Semi automatic approach for ontology building p. 64 3.3.1 Semi Automatic Ontology Generation p. 71 Chapter 4 Evaluation and Experiments 4.1 Pilot Study Experiments p. 73 4.2 Precision and Recall for a quantitative evaluation p. 74 4.3 New Experiments p. 79 Chapter 5 Discussions and Conclusions 5.1 About the methodology p. 85 5.2 Proposal: a linguistically motivated keyphrases extraction system for Ontogen p. 89 5.3 Conclusions p. 93

5

Chapter 1. The problem and the research question

Ontologies gain a lot of attention in recent years as tools for knowledge

representation. However, in the context of the information and computer science

there are many definitions regarding “what an ontology is”. Gruber (1993), for

example, define the ontology as “an explicit specification of a conceptualization”

(where a conceptualization “is an abstract, simplified view of the world that we

wish to represent for some purpose”), pointing out the relative simplicity of the

represented knowledge in comparison with the complexity of the knowledge it self.

Tim Berners Lee (2001) give a more concrete definition of this concept defining an

ontology as “a document or file that formally defines the relations among terms”,

underlying, in this way, the importance of the relational aspect (formally defined)

between the elements composing the ontology. In few and simplest words an

ontology can be defined as a formal knowledge representation system (KRS)

composed by three main elements: classes (or concepts or topics), instances (which

are individuals which belongs to a class) and properties (which link classes and

instances allowing to insert informations regarding the world represented into the

ontology).

To obtain a structured representation of the information through the ontologies is

one of the main objectives in order to realize the so called Semantic Web1 (T.B. Lee

et al., 2001). In the context of the Semantic Web, in fact, ontologies are expected to

play an important role in helping automated processes to access information. In

1 According to Tim Berners Lee (1999) the Semantic Web is an extension of the current web in which information is given in a well defined meaning. And that, according to his vision, should enable the machines to “understand” the semantics of the web resources and, therefore, to have a more “intelligent” behaviour in their activities of search.

6

particular, ontologies are expected to be used to provide structured vocabularies that

explicate the relationships between different terms, allowing intelligent agents (and

humans) to interpret their meaning. Another important aspect regarding the role of

ontologies is linked to the issue of the information overload. One of the problem of

the actual World Wide Web, in fact, regards the fact that large part of the

information provided to the users as output of an explicit query are un-relevant.

Through the implementation of ontologies within dedicated information systems

(e.g. search engines) this problem can be reduced or, in the future, solved

completely (at least in theory) because ontologies architecture (which usually is

hierarchic) should be able to design the unique path that the information required

through queries must follow to arrive at the web resources containing the desired

information.

Since the 2006 RDF2, RDF Schema3 and OWL4 are generally considered as

standard Semantic Web languages. In particular the OWL, which is the language

used for the concrete manual ontology building case study (which will be

introducted in the following pages), is the most expressive language. That means

that it allows to augment the number (and the quality) of inferences that the

software agents are able to do.

Ontology building is a very complicated activity for several reasons. First, because

it requires time consuming work of experts. Then because classification task is not

simple as it seems. Finally because the incredible speed with which the knowledge

developed itself in the real world, constraint the ontology engineers to continuously

2 See http://www.w3.org/RDF/ 3 See http://www.w3.org/TR/rdf-schema/ 4 See http://www.w3.org/OWL

7

update and enrich the generated ontologies with new concepts, terms and lexicon. In

this way an ontology often becomes a “never ending work” which constantly

requires manual efforts and resources to be built and maintained. In recent years,

tools and methods have been developed to try and solve automatically or semi-

automatically the problems related to manual ontology building (an overview of

such approaches, tools and techniques will be presented in the second chapter). The

research question of this work is the following: is it possible, with the actual tools

and methods, to substitute (fully or even partially) the human activity in a complex

task such as the ontology building? We will try to answer this question trough

experimental results conducted on a concrete case study where a manual domain

specific ontology has been compared with a semi-automatically built one.

The objective of this work is to present a concrete case study regarding the

evaluation of the manual approach to ontology building compared with the semi-

automatic one. This thesis work has been developed on a three step phases: in the

first one a manual energy domain ontology has been created. Then a part of this

ontology has been semi automatically generated using the software Ontogen. And,

finally, a comparison between the two approaches at the ontology building has been

done trough a quantitative evaluation considering precision and recall measures.

This work is structured as follow: in the chapter 2 will be examined some methods

and tools used for the semi-automatic or automatic ontology generation, then

chapter 3 is dedicated to the description of the energy domain case study and

chapter 4 to the experiments and to the evaluation phase. The last chapter is

dedicated to the results discussion and to the conclusions.

8

Chapter 2: Methods and Tools for the semi-automatic or

automatic ontology generation

Manual ontology building is a time consuming activity that requires a lot of efforts

for knowledge domain acquisition and knowledge domain modeling. In order to

overcome these problems many methods have been developed, including systems

and tools that automatically or semi-automatically, using text mining and machine

learning techniques, allows to generate ontologies. The research field which study

this issues is usually called “ontology generation” or “ontology extraction” or

“ontology learning” (Maedche et al., 2001). It studies the methods and techniques

used to:

• construct automatically or semi-automatically an ontology ex-novo

• enrich or adapt an existing ontology using different sources

The ontology learning process is useful for different reasons. First of all to

accelerate the process of knowledge acquisition, second to reduce the time for the

updating of an existent ontology, and finally to accelerate the whole process of

ontology building. Buitelaar (2005) propose an incremental stratification of this

process (see figure 2.1):

9

Figure 2.1. Stratification of the ontology learning process in (Buitelaar 2005)

Starting from the lowest level each phase can be considered as input of the next

phase. The phases identified are the following:

1. Extraction of relevant terms and their synonyms in a textual corpus for a

target domain;

2. Concepts identification (is the third step of the scale proposed by Buitelaar);

3. Derivation of a hierarchy of the concepts previously identified;

4. Identification of non taxonomic relations between the concepts;

5. Adjustment of the ontology with new instances, concepts and properties

(Ontology Population);

6. Discovering of new rules and axiomatic relations between concepts and

properties.

The major part of the approaches iterate this steps both to integrate the user

feedback in the process of ontology generation and to re-use the new acquired

knowledge as knowledge base for future iterations.

10

Different approaches for the ontology generation

Maedche and Staab (2001), proposed a classification of the systems used for the

automatic or semiautomatic ontology building. It is based on the type of input that

the systems consider to initiate the process of ontology generation. The authors

distinguished between: ontology generation from text, dictionary, knowledge base,

semi-structured schemata and relational schemata. The most widely used

approaches at the ontology extraction from text, as reported in Perez (2004), are the

following:

Pattern-based Extraction (Hearst 1992): this approach usually uses heuristic

methods that examine the text with distinctive lexicon-syntactic pattern. A relation

is recognized and extracted if a sequence of words within the text matches a pattern.

The basic idea of this approach is very simple: to define a regular pattern that is able

to capture the expression presented by the text and able to map the results of the

matching into a semantic structure such as a taxonomy of relations among concepts.

Association Rules: initially defined to extract informations from databases (in the

data mining field, (Agawall et al., 1993) has been used in (Maedche, 2001) to

discover non taxonomic relations between concepts using a concepts hierarchy as

knowledge base.

Conceptual Clustering: in this approach the concepts are grouped according to the

their semantic similarity to build hierarchies. The semantic similarity can be

calculated with different methods. For example it may be calculated according to

the distributional approach: less distance between the linguistic distributions of two

words means more similar concept (see Faure et al. 2000)

11

Ontology pruning: the aim of this approach is to build a domain ontology based on

various sources (Kietz at al., 2000). It includes the following steps: first, a generic

core ontology is used as basic infrastructure for a domain specific ontology. Then, a

dictionary with important domain terms is used for the domain concept acquisition

and these concepts are classified into the generic core ontology. Finally, domain

specific and general corpora of text are used for the process of non domain specific

concepts removal, following the heuristics that domain specific concepts should be

more frequent in a domain-specific corpus than in a generic one.

Concept learning: with this approach a given taxonomy in incrementally enriched

acquiring new concepts from textual documents (see Hahn et al 2000).

The above are the main approaches for the ontology generation from text. However,

as mentioned earlier, the Maedche and Staab classification identifies four other

classes of sources used as input for the ontology extraction. Amongst the ontology

generation from dictionaries is based on the use of machine readable dictionaries to

extract relevant concepts and relations among them. The methods and tools used for

this task are based on linguistic and semantic analysis and usually Wordnet is used

in this approach as dictionary; the ontology learning from a Knowledge Base aims

at generate an ontology using as source an existing knowledge base; the ontology

extraction from semi structured data, instead, have as objective the extraction of an

ontology from sources with a pre-defined structure such as the XML Schema and,

finally, the ontology learning from relational schema aims to learn an ontology

extracting relevant concepts and relations from knowledge in database. In the

following pages will be examined different methods, tools and techniques available

12

in literature for the automatic or semi-automatic ontology generation starting from

different data sources.

2.1 Methods and techniques for ontology generation from text

Since the late fifties many approaches have been proposed to extract terms,

concepts and relationship from text. But since the mid nineties these efforts have

taken a new momentum thanks to the availability of more sophisticated statistical

and NLP techniques. These techniques are now used in many approaches for the

ontology generation (Soergel et al, 2005).

Reinberger’s method (Reinberger et al, 2004) supports ontology extraction from

text. Its aim is to create an initial skeleton of ontology to be then refined by

analysts. It is based on a three step process. First, a textual domain corpus is parsed

with a shallow parser, then noun phrases and their relations with other verb phrases

or noun phrases are listed. At the third step clustering techniques group terms that

share similar relations into the same classes and the ontology is created. Khan and

Luo’s method (Kahn and Luo, 2002) aims to build a domain ontology starting from

text documents and using clustering techniques and Wordnet ontology. The

hierarchy is created by grouping documents similar in content (the documents are

provided by user) within the same cluster and then putting them into a hierarchy

using an algorithm called SOTA. After building a hierarchy of clusters, a concept

(or topic) is assigned to each cluster with a bottom up concept assignment

mechanism (the assignment starts from the leaf nodes of the hierarchy). Then the

assigned topic is associated with the appropriate concept is WordNet. And finally

13

the concepts of the internal nodes are assigned using the descendent nodes and their

hypeyms in WordNet.

Another method for building ontologies take into account linguistic techniques that

come from the Differential semantics was proposed by Bachimont (2002). In this

method the construction of the ontology is made in a three step process. First, there

is a Semantic Normalization where the user chooses the relevant terms of a domain

and normalizes their meaning, expressing similarity and differences of each notions

with respect to its neighbours. Then these terms are placed into a hierarchy (and the

user has to justify her/his decisions). At the second step there is a Knowledge

Formalization phase in which, using the taxonomy obtained in the first step, the

various terms are disambiguated for a domain expert to carry out a formalization of

the knowledge. The third step consists in the Operationalization: the created

taxonomy is transcribed into a specific knowledge representation language.

Nobecourt (Nobecourt, 2000) presents an approach to build domain ontologies from

text using text mining techniques and a corpus. This method is based on two

activities: modelling and representation. The modelling activity is based on the

extraction of relevant domain terms (the “conceptual primitives”) from a corpus.

After this operation, domain experts look for relevant terms of the domain in the list

and identify the main sub-domains of the ontology. These terms are modelled as

concepts and constitute the first skeleton of the ontology. In a second moment these

concepts are described in natural language, constituting, in this way, a new source

of documents (a new corpus) from which a new list of primitives can be extracted in

an iteratively process used to gradually refines the skeleton of the ontology. The

representation activity consists of the translation of the modelling schemata into an

14

implementation language. This method is technologically supported by the platform

TERMINAE (Biebow et al, 1999) which will be discussed further later.

Kietz et al. (2000) proposed a generic method to discover a domain ontology from

given heterogeneous sources using natural language analysis techniques. It is a

semi-automatic approach for ontology building in the sense that the user takes

active part in the process. The authors propose to extract ontologies starting from a

core ontology (e.g. SENSUS, Wordnet etc.) and enriching it with new specific

domain concepts. The user has to specify which documents should be used to refine

the core ontology. New concepts are identified using NL techniques applied to the

suggested documents. Then the resulting enriched core ontology is pruned and

focused to a specific domain through the removal, with several statistical

approaches, of general concepts. Finally relations between concepts are learnt

applying learning methods and added to the resulting ontology. This process is

cyclic because the resulting ontology can be refined applying this method

iteratively.

Aussenac-Gilles et al. (2000), suggest a method that allows to create a domain

model using NLP tools and linguistic techniques for the analysis of corpora. This

method uses text as starting point, but may use also other existing ontologies or

terminological resources to build the ontology. It performs the ontology learning in

three levels: linguistic level, normalization level and formal level. The first one is

composed of terms and lexical resources extraction from text. These elements are

then clustered and converted into concepts and semantic relations at the

normalization level. Finally concepts and relations are formalized by means a

formal language. The process is composed by 4 phases. The first two are referred to

15

as the linguistic level and include: the corpus constitution (the authors point out the

importance of a domain expert aid for the domain specific corpus selection) and the

linguistic study that focuses on the selection of adequate linguistic tools for the

analysis of the text. This phase yields domain terms, lexical relations and a set of

synonyms. The third phase is the normalization. The result of this phase is a

conceptual model expressed in form of semantic network. It is divided into two sub-

phases: a linguistic phase and a conceptual one. During the linguistic phase the

ontology engineer chose the terms and the lexical relations (e.g. hyponyms) that

have to be modelled, then, s/he adds a natural language definition for these terms

considering the senses that they have in the text and defining, for each sense, some

identificative labels. If there are several meanings, the most relevant for the domain

are kept. During the conceptual phase, concepts and semantic relations are defined

in a normalized form using the labels of concepts and relations. Finally the last

phase is the formalization. It includes ontology validation and implementation . The

evaluation of the knowledge learnt is made by the user and by a domain expert.

Once the ontology has been evaluated it can be implemented (e.g. following this

approach an ontology for a private company about fiber glass manufacturing was

built and implemented, see Aussenac-Gilles et al. 2003)

Hearst (1992) method aims to acquire automatically lexical relations of hyponyms

from a corpora to built a general domain thesaurus, using WordNet to verify its

performance. The process exploits a set of predefined lexico-syntactic patterns

easily recognizable. The method aims at discover this patterns and the authors

suggests that other lexical relations will be acquirable in this way. All of them will

16

be used to build the thesaurus. The method proposed the following five steps

procedure to automatically discover new patterns:

• Decide on a lexical relation of interest.

• Gather a list of terms for which this relation is known to hold (this step can

be done automatically using this method).

• Find documents in the corpus where these expressions occur syntactically

one near the other and record the whole environment (the environment is

defined by the linguistic space where the defined expressions appear: e.g. in

the simplest case “Poets such as Shakespeare” is the linguistic environment

of the pattern “such as” used to indicate a hyponym relation between “poet”

and “Shakespeare”. In more complicated cases the environment can be

represented by sentences or full periods) .

• Find commonalities among these environments and the initial hypothesis.

• Once a new pattern has been identified it is used to gather more instances of

the target relation and the process restart from step 2.

To validate this acquisition method, the author proposed to compare it with the

information found in Wordnet. For example: if two terms of the thesaurus presented

in hyponymy relations are also linked hirarchically in Wordnet, then the thesaurus

is verified.

Alfonseca and Manadhar (2002) proposed a method based on the Distributional

Semantic Hypothesis (Harris, 1971) which states that “the meaning of a word is

highly correlated to the linguistic context in which it appears”. From this point of

view the context of a certain concept can be encoded as a vector of context words

containing the words that co-occour with that concept and their frequency (topic

17

signature). These topic signatures are then clustered and a distance measures such

as TFIDF o chi-square is used to separate the different word senses. The obtained

signatures can be compared with the topic signatures of an existing ontology (e.g.

WordNet) identifying, in this manner, the hyperyms candidates. To do that a top

down classification alghoritm is used.

2.1.2 Methods and techniques for the Ontology generation from

dictionaries

2.1.2.1 Jannink and Wiederhold’s approach

This approach (Jannink and Wiederhold, 1999), aims to convert dictionary data into

a graph structure to support the generation of a domain or task ontology. It uses an

algebra extraction technique to generate the graph structure and to create the

thesaurus entries for all the word defined into the graph. According to its purpose

only headwords and definition having many-to-many relations are considered. This

resulted in a directed graph that has two properties: each headword and definition is

grouped in a node and each word in a definition node is an arc to the node having

that headword. The basic hypothesis of this approach is that structural relationship

between terms are relevant to their meaning. They are extracted in a 3 step process

where a statistical approach and PageRank alghorithm are used to develop, as

output, a set of terms related by the strength of the association in the arcs that they

contain.

18

2.1.2.2 Rigau’s method

This method (Rigau et al, 1998) consists in the extraction of lexical ontologies

from dictionaries. Its main goal is to semi-automatically develop versions of

Wordnet for the Spanish language. It is based on two procedures: analysis of

dictionary definitions and Word Sense disambiguation of genus words. So, in a first

phase, in a monolingual dictionary each definition is analyzed to find an hyperonym

of the word which has been defined (also called genus word). Then a Word-sense

disambiguation procedure is used on the genus word to discover what meaning is

used for it. This method was developed as part of the EuroWordNet Project which

aimed at developing lexical ontologies for several european languages.

2.2. Methods and techniques for the Ontology generation from a

knowledge base

2.2.1 Suryanto and Compton’s approach

This approach (Suryanto and Compton 2001) aims at generating an ontology from a

knowledge base of rules. The authors propose an algorithm to extract a taxonomy of

classes where a class is a set of different path of rules that arrive at the same

conclusion and a rule path for node n consists of all the conditions from all

predecessors rules plus the new conditions of the particular rule of node n. The

approach takes the initial trees and create a set of classes trying to discover relations

among them. Three types of relations are considered: subsumption, mutual

exclusivity and similarity. The central idea of this approach is to group all the rules

in each class and calculate a quantitative measure for each relation between each

19

couple of classes. This quantitative measure provides the confidence to whether this

relation exists (for example: the class A subsume the class B with certain

confidence measure if the class A only exists when the class B exists but not the

other way around). With the set of classes and relations created the class taxonomy

is built up. The whole process in evaluated by an expert.

2.3. Methods and techniques for the Ontology generation from

semi-stuctured schema

2.3.1 Papatheodoru and colleagues’ method

This method (Papatheodoru, 2002) aims to build taxonomies using a data mining

approach called cluster mining from domain repositories written in XML or RDF.

The cluster mining approach firstly tries to group similar metadata in a cluster, and

then, processing this cluster, extract a controlled vocabulary used to build the

taxonomy (Perez 2004). The steps proposed by this method are:

1. Data collection and pre-processing where the main objective is to select the

appropriate keywords from the metadata files. That enables to discover

similarities between the documents and, for that purpose, words such as

articles or prepositions are dropped.

2. Pattern Discovery in this step cluster mining approach is used to discover

and build clusters of similar documents and extract representative keywords

of the documents content itself.

20

3. Pattern post processing and evaluation: the keywords extracted in the

previous step are examined and measured with a statistical approach and the

best keywords (the most representative of the content of the clusters) are

selected. These keywords provide the vocabulary necessary to form the

concepts of the taxonomy.

2.3.2 Deitel and colleagues’s approach

Deitel et al. (2001), present an approach for learning ontologies from RDF

annotations of web sources. It focuses on learning, from the whole RDF graph new

domain concepts, enriching, in this way, the ontology to which the RDF annotation

belong. To extract the description of a resource from the graph this approach

follows a criterion called description of length n of a resource that is “the largest

connected subgraph in the whole RDF graph containing all possible paths of length

smaller or equal to n starting from ending to the considered resource”. The proposed

steps used for building a hierarchy based on resource description are the following:

1. Extract resource description of length one and repeat the process

incementing the lenght until covered the maximum path in the graph.

2. Extraction of resource description of lenght one from the whole RDF graph

(these descriptions form a set of RDF triple: resources, properties and

values).

3. Iterative generalization of all possible pairs of triple: the generalization of

two triples is the most specific triples subsuming them.

21

4. Construction of the intension5 of length one. The triples sharing a same

extension are grouped together.

5. Build the generalization hierarchy based on the inclusion relations between

the node extension.

6. Repeate the process incrementing the length of the resource description to

be extracted.

2.4 Methods and techniques for Ontology generation from

relational schemata

2.4.1 Kashyap’s method

Kashyap (1999), uses database schemas to build an initial ontology that will be then

refined through a collection of queries made by users. The process is interactive,

because an expert is involved in the process of deciding which classes and

properties are important for the domain ontology, and iterative because it is

repeated as many times as necessary. The process has two phases. At the first one

the database schemas are analyzed in detail and, at the end of the process, a new

database schema is created and, trough reverse engineering techniques, its content is

mapped into an ontology. At the second phase the ontology built from the database

schemas is refined by means of user queries that allow to add or delete attributes, to

create new entities etc.

5 An intension may include redundant triples, one being more general than another. It is cleaned up by deleting triples subsuming another one (Perez, 2004).

22

2.4.2 Rubin and colleagues’ approach

Rubin et al. (2002) has as objective to automate the process of creating instancies

and their value using the data extracted from external relation sources. This method

uses XML Schema as interface between the ontology and the data sources. The

process allows the automatization of updating the links between ontology and data

acquisition when the ontology changes. This approach needs the following

components: an ontology (with domain classes and relations among them), an XML

Schema (is the interface between the ontology and the data acquisition), and an

XML translator (to convert external incoming data in XML). The method is based

on a four step process:

1. An ontology model of a domain must be created (e.g. with Protegé)

2. An XML Schema must be generated from the ontology (once the onology is

built and the constraint on the properties are explicitated the XML Schema

is sufficiently determined and can be written directly from the ontology).

3. The data acquired from the external resources must be put into an XML

document using the syntax specified into the XML Schema.

4. Ontology updating and propagating changes.

2.4.3 Stojanovic and colleagues’ approach

Stojanovic et al (2002), try to build light ontologies from conceptual database

schemas using a mapping process. The ontology generation follow a five step

process:

23

1. The information from a relational schema is captured trough a reverse

engineering. This process tries to preserve as much information as possible

from the database schema.

2. The information obtained is analyzed applying a set of mapping rules in

order to built ontological entities. These rules specify the way with which to

migrate elements from the database into the ontology. These rules are

applied in the following order: concept creation, inheritance and relations

creation to incrementally create the ontology.

3. The ontology is created through the application of the rules mentioned in the

previous step.

4. The ontology is evaluated and refined.

5. The ontological instancies are created on the base of the tuples of the

relational database.

2.5 Research Projects and tools for the ontology generation

Automatic and semi-automatic generation of ontology from documental corpus or

from other types of data collections is nowadays one of the research challenges of

the Semantic Web. Many research projects have been developed and many

prototypes and tools have been created for that task. The following section

overviews the majors tools.

Mo’K Workbench

Mo’K Workbench (Bisson, 2000) is a tool that semi-automatically creates ontology

from a textual corpus using different techniques of conceptual clustering. It doesn’t

24

need of a previous semantic knowledge (e.g. existing ontology) and, by applying

NLP techniques, extracts from the documental corpus sets of triples. Each of them

is formed by a verb, a word and by the syntactic role of that word within the

sentence. Using the various triples, Mo’K calculates the number of the occurrences

for each one, removing from the list the triples with too much and too few

occurrences. Finally it calculates the semantic distances between the triples to form

conceptual clusters.

Text to Onto

This system (Maedche and Volz, 2000; Maedche and Staab, 2004) integrates the

KAON environment, an Open source ontology management infrastructure, with a

tool suite for building ontologies from an initial core ontology. It combines

knowledge acquisition and machine learning techniques to discover conceptual

structures. Terms are extracted according to their occurance frequency and

distribution criteria. Semantic and hierarchical relations are extracted with

association rules or linguistic patterns and the relationship are weighted according

to support and confidence criteria. The result of Text to Onto is a domain ontology.

The whole process is supervised by an ontologist.

Text Storm and Clouds

This system (Pereira, 1998) has been developed for semi-automatic construction of

a semantic network using relevant text for target domain. It is composed of two

modules: TextStorm (Oliveira et al. 2001) and Clouds (Pereira et al. 2000) that

perform complementary activities. TextStorm is NL tool that extract binary

predicates from text using syntactic and discourse knowledge. The predicates on

25

which the tool is focused are those that relate two concepts in a sentence. The

process works as follows: target domain text is provided to the system and is tagged

using Wordnet to find all parts of speech to which a word may belong. The text is

then parsed using an augmented grammar to obtain a lexical classification of the

words during the parsing process. Finally. TextStorm create a list with extracted

terms that becomes an input for the Clouds tools. Clouds is responsible for the

construction of the semantic network in an interactive way. Using the previous list

of binary predicates extracted by the text, Clouds builds a hierarchical tree of

concepts, learning some particulars of the domain using two techniques: best

current hypothesis based algorithm (to learn the categories of the arguments of each

relation) and Inductive Logic Programming based algorithm (to learn the recurrent

context in each relation).

OntoLT

OntoLT is Protegé plugin developed by Buitelaar et al. (2004) that allows to extract

automatically concepts (classes in Protegé), and relations (properties in Protegé),

from a collection of linguistically annotated texts. To do that OntoLT uses some

rules that allow mapping the linguistic entities within the text with the

classes/properties in Protegé. For using this tool there is the need for a collection of

texts automatically annotated with linguistic information provided in XML format.

The annotation includes: “part of speech” tags, morphological analysis, sentences

and predicates-arguments analysis. The annotation allows extracting automatically

linguistic entities that can be used to build an ontology of concepts, subconcepts and

26

relations of a specific domain. The mapping rules are defined through X-Path

expressions (used to extract requested elements or attributes from an XML

document). A mapping rule defines how to “map” linguistic entities within a corpus

(a collection of documents in XML) with classes and slots of Protegé. These rules

are implemented through the use of pre-conditions (X-Path). If all the pre-

conditions are satisfied then a set of linguistic units will be generated and one or

plus operators will be activated to define in which way each of them will be

“mapped” into the correspondent class/slot of Protegé. OntoLT includes statistics

analysis functions for the construction of syntactic rules finalized to allow the

identification of linguistic units relevant for the target domain. For each linguistic

entity a system of ratings is computed and the linguistic entities more specific for

the domain corpus receive the higher rating.

CORPORUM Ontobuilder

Corporum Ontobuilder (http://ontoserver.cognit.no) extracts ontologies and

taxonomies from natural language texts. The tools uses many linguistic techniques

that drive the analysis and the information extractions. It extracts information from

structured and unstructured documents using Ontowrapper (Engels, 2002) and

Ontoextract (Engels, 2001). Ontowrapper extract infomations from texts and

OntoExtract obtain taxonomies from natural language texts in RDF format and is

also able to refine existing concepts taxonomies.

JATKE

JATKE (available open source on http://jatke.opendfki.de/cgi-bin/trac.cgi) is a

Protégé plug-in that offers an unifed platform for ontology construction. The

27

platform has high degree of flexibility and allows to combine a variety of

approaches and even to create new ones. In fact, it allows the combination of pre-

existing modules guaranteeing a personalized set-up for each scenario. JATKE is

composed of three different modules: (Information/Source, Evidence, Proposal) and

its main characteristics are as follows:

• It allows the final user to define the correct mix of learning algorithms that

better fits the domain or the objective desired, allowing a personalized set-

up of the different modules;

• Its modular design allows to develop rapidly new modules, re-using the

existing ones;

• It allows a semi-automatic ontology building process, and generating

proposals for the modifications of an exististing ontology, while the user has

to decide if to accept the system proposals or not (each one has a certain

degree of confidence).

The communication between the different modules is possible because of an

internal ontology that represents the internal domain of the system in which every

type of data or command is represented as an instance. Within the ontology of the

system there is the button “Proposal” (last button in figure 2.2) that allows

activating a learning process. The proposals are divided in three categories:

• Class: is a proposal concerning a class, a concept, that can be created,

deleted, re-named or moved;

• Slot: is a proposal referred to a relation or to a property;

• Instance: is a proposal related to an instance.

28

All the proposals are analyzed by the user that accepts or rejects them. If the user

accepts the proposal then the adjustment is communicated to all the modules

participating it the process and, in this way, the ontology is updated.

Figure 2.2. The slot proposal in Jatke

Ontogen

Ontogen (Fortuna et al 2005) is a semi-automatic data driven topic ontology editor

which integrates machine learning and text mining algorithms. OntoGen’s mains

features are represented by automatic keyword extraction from documents given as

an input to the system (the extracted keywords are “candidate concepts” of the

ontology) and by the concepts suggestions generation. This system will be

discussed in a detailed manner in the next chapter.

Ontobuilder

This tool (Modica et al. 2001) helps the users in the ontology building process using

as sources semi-structured data coded in XML or in HTML. It has been designed to

work as a web browser (see figure 2.3). Once the URL of a web page is defined and

the page has been downloaded by the system, it is possible to extract an ontology

from that web source. The system is composed of three main modules: the user

29

interaction module, the observer module and the ontology modelling module. The

process that enables building the ontology has two phases: a training phase in which

an initial domain ontology is built using the data provided by the user (e.g. the user

suggest browsing websites that contains relevant domain informations) and an

adaptation phase in which the obtained ontology is gradually refined by the way

(for each new site candidate ontology is extracted and merged with the existing one)

Figure 2.3. The browser interface of Ontobuilder

OntoLearn

OntoLearn (Velardi 2004) is an ontology learning system based on NLP and

machine learning techniques and is based on three phases: Term extraction:

relevant terms (both n gram and bi grams as “credit card”) are extracted with NLP

statistical techniques. Specific and generic corpora are used to prune the non

specific terminology of the domain. The domain documents are used as input and

the system, after a parserization, extracts a list of terms “syntactically plausible”

(e.g. Adj+N). Two measures based on the entropy are used to evaluate the

importance of each term: Domain Consensus and Domain relevance. The first is

used to select only the terms referred at the documental corpus. The second is used

to select the terms belonging to the domain of interest, and is calculated considering

as basic point a set of terms of different domains. Finally the extracted terms are

filtered using the lexical cohesion measure that quantifies the degree of association

30

of all the words in a string of terms. The second phase consists on the semantic

interpretation of terms which is based on the principle of compositional

interpretation according to which, for example, the meaning of a composite word as

“business plan” is derived by the association of each single term at the correct

identificator of an existing ontology (e.g. Wordnet). In this phase a WordSense

disambiguation task is executed applying an algorithm called SSI (Structural

Semantic Interconnections) that is based on a syntactic patter matching. SSI

produces a semantic graph which includes the selected senses and the semantic

interconnection between them. The third phase is represented by the extension and

refinement of the starting ontology. Once the terms are have been semantically

interpreted they are organized in sub-trees and inserted on the appropriate node of

the initial ontology. Furthermore some nodes of the initial ontology are pruned to

create a specific view of the ontology. Finally the new ontology is converted in

OWL format.

WeDaX tool

WeDaX (Snoussi 2002) was developed by a group of researchers of the Montreal

University (Canada). Its aim is to extract information from web pages using an

ontology to model the data to be extracted. More specifically: the web page is

converted to XML format, then a mapping with an ontological model is done (the

model definition and the mapping are manually done by the user trough a graphic

interface). An automatic process extracts the information and the result is an XML

document containing a set of standardized data that allows to execute queries.

WebKB

31

WebKB (Craven 2000) has been developed by Carnegie Mellon University (USA),

and its objective is to automatically create a knowledge base understandable by a

computer extracting informations from web documents. In particular this system

extracts the information it needs for knowledge base creation from an initial set of

web pages and then search new web sites in order to automatically populate the

knowledge base with new assertions. The tool needs two inputs: an ontology that

specifies classes, and examples of web pages representing the classes or the

instancies of such ontology, for mapping the ontology to the Web.

SOBA (SmartWeb Ontology-based Annotation)

SOBA is a component of the SmartWeb system developed at the University of

Kalshruhe (Germany). It automatically populates a knowledge base starting from

the information extracted from web pages about the football. It is composed of a

web crawler, components for linguistic annotations and a final module for the

transformation of the linguistic annotations in an ontological representation (see

Cimiano 2006).

RelExt tool

RelExt (http://www2.dfki.de/web/) is a tool for the relationship extraction from a

collection of texts used for ontology building. Usually, domain ontology rarely

model verbs as concept relations. The basic assumption of this system is that the

role of the verbs as element of connection between the concepts is evident. RelExt

is a system able to automatically identify relevant triples (a couple of concepts

connected with a relation) within the concepts of an existing ontology. The system

work by extracting relevant verbs and their arguments from a collection of domain

32

specific documents and calculate, through a combination of statistical and

linguistics techniques, relevant relations with the concepts.

Doddle

This system (Yamaguchy 1999), aims to construct domain ontologies, in particular

a hierarchically structured set of domain terms without concepts definitions, reusing

a machine readable dictionary (MRD) and adjusting it for specific domains. Since

Doddle just generates a hierarchically structured sets of domain terms, it support the

user in the concept categorization task and concept name suggestion. The tools

deals with the concepts drift (the senses of the concept change depending on

applications domain). For this purpose, two strategies have been followed (match

result analysis and trimmed result analysis). Both try to identify which part may

stay or should be moved by the initial ontology. In order to analyze the concept drift

between a MRD and a domain ontology there are involved two main activities. The

first one is based on the building of an initial model from a MRD extracting

informations about relevant terms in a given domain. The second one regards the

management of concepts drifts making an initial model adjusted to the specific

domain.

SVETLAN

Svetlan (Chaelendar and Grau, 2000), is a domain independent tool that creates

clusters of the words appearing in the text. The scope of this tool is to build a

hierarchy of concepts. Its learning method is based on distributional approach:

nouns playing the same syntactic role in sentences with the same verb are grouped

together in the same class. The learning process follows three steps: syntactic

analysis, aggregation and filtering. At the first step the tools retrieve sentences from

33

the original text (it only accept French texts in natural language) in order to find the

verbs inside the sentence since the basic assumption is that verbs allow categorizing

nouns. The output of this step is a list of triplets of verbs, nouns and syntactic

relations between them. The aggregation create cluster of nouns with similar

meanings (using conceptual clustering techniques) and the filtering step is based on

the weights of the nouns inside the classes (the cluster created) and on the removal

of nouns not relevant inside the groups.

TERMINAE

This system (Biebow et al. 1999), integrates linguistic and knowledge engineering

tools. The linguistic one allows defining terminological form from the analysis of

term occurrences in a corpus: the ontologists analyze the uses of the terms in a

corpus to define their meaning. The knowledge engineering tool, instead, helps to

represent terminological forms as concepts. TERMINAE uses a method to build

concepts from the study of the corresponding terms in a corpus. First it establishes,

using an extractor tool, a list of candidate terms which are proposed to the

ontologists (which select a set of terms). Then the ontologists conceptualize the

terms and analyses the uses of the terms in corpus to define all their meanings.

Finally the ontologists give a definition in natural language for each meaning and

then translate it into an implementation language.

The development of this kind of systems, and their alignment to the manual

“gold standard” ontologies, represents one of the main challenges for the next

future in the Semantic Web. To achieve this objective the evaluation phase of

the systems and of their results plays a crucial role. In the next chapter will be

34

presented the results of an evaluation of a semi-automatically generated

domain ontology with a manually built one.

35

Chapter 3: The Energy Domain Case Study

In this chapter, a concrete case of comparison between manual versus semi-

automatic ontology construction is presented. The main objective is to present a

comparative evaluation of those two different approaches. The questions we’ll try to

answer are: which approach is more useful in the ontology building task? Are these

approaches alternatives or can they be integrated? What are the pros and the cons

for each one? And, finally, is it realistic, in the near future, to think of a complete

automatization or, at least, at a semi-automatization of the whole process of

ontology building? In order to answer these questions, a set of experiments was

performed, in which a manually built ontology is compared with a semi-

automatically generated one. The ontology compared is a domain ontology for

energy security. It represents our “case study” and allows us to formulate some

conclusions about the two approaches. The reason why we chose to focus our

attention and our research efforts on energy domain is represented by the

importance that, nowadays, energy issues gain importance in the global institutional

and economical agenda. In this way, the realization (both with manual and

semiautomatic approach) and the evaluation of a dedicated ontology for this field

represents, at the same time, a big challenge for the ontology engineers and a great

opportunity for the institutional and economic decision makers of the energy

domain. In fact, the implementation of such ontology in a dedicated information

36

system may improve both the precision and recall6 of search results in information

retrieval tasks. This improvement may help decision makers of energy field, getting

the relevant information which they need for making their decisions (in this view

the ontology would be used to create a sort of “dedicated ontology-based decision

support system for energy domain”). In the following paragraphs we will explain

the methodology used for the energy domain modelling (paragraph 3.1). Then, a

more detailed description of the energy ontology manually built will be presented

(paragraph 3.2). Ontogen, the tool used for the semi automatic ontology generation

will be presented next (paragraph 3.3), followed by a presentation of the semi-

automatic ontology. Finally, the experimental results of the manually built

ontology, will be compared with the results obtained by semi-automatic ontology

building process (paragraph 3.4).

3.1 Energy Domain Modelling Process

3.1.1 The modelling approach

Poesio (2005) states that there are, at least, two different research traditions in the

domain modelling literature. One school of thought supports the thesis of the need

of more rigorous logical and philosophical foundations for domain modeling

formalisms. It’s aim is both to establish a “Tarskian Semantics” for the formalism

used in the domain ontologies (leading to description logics) and to have cleaner

6 Precision and recall are two standard measures of the Information Retrieval field. In few words they represent, respectively, how much relevant are the documents retrieved by a system after a query (precision) and how many relevant documents a system is able to retrieve on the total of the relevant documents (recall). These two measures will be presented widely and in a more detailed manner in the following pages.

37

domain ontologies (where the expression “clean ontology” stands for “ontology

with a clear semantics and based on sound philosophical and scientific principles”).

Whoever supports this line of research argue that the ontologies built according to

formalist principles are very beneficial for many NLP (Natural Language

Processing) applications such as, for example, information extraction from a

database using natural language or, vice versa, information extraction from texts

and their addition to a database. The second school of thought, instead (that Poesio

defines as “cognitive”), argue that the best way to identify epistemological

primitives is to study concept formation and learning in humans. Conversely, the

best approach to the construction of domain ontologies is by the use of machine

learning techniques to automatically extract ontologies from language corpora. This

approach has its philosophical foundations in the work of the late Wittgenstein

(Philosophicae Investigationes, 1959) and especially in its implicit critic, explicited

in the psychological perspective by the work of Eleonoire Rosch (Rosch, 1973), to

the classic Aristotelian theory of concepts7 trough the introduction of the concepts

of “language games” and “family resemblance”. These two concepts, in fact, had

indirectly dealt a mortal blow to the Aristotelian theory bringing out that the

humans, and their memory, aren’t able to classify language and concept according

with what the theory stated, even for the complexity intrinsic at the language itself.

The failure of this theory, and in general of the pure formal logic approach as key to

explain the way in which the humans thinks and categorize (Thaggard, 1998), has

been used as argument against the “formalistic approach” to the domain modelling

(the first one mentioned above). Regarding the specific case of energy modelling, 7 According to this theory a concept can be defined exclusively with a finite set of necessary and sufficient conditions.

38

it’s not easy to classify our work within one of these two categories (Formalism vs

Empiricism),because it has been used both a bottom up strategy and a top down one

for the energy ontology building. To be clearer: mainly a “bottom-up" ontology

building strategy8 was used. It was bottom up because the domain knowledge

acquisition, and then the process of ontology creation, started from corpora of

textual documents and not from abstract conceptual knowledge or conjectures made

a priori by the ontologists about the world of the domain to be represented (in this

view we could say that we privileged an empiric and language-based point of view).

On the other hand, “top-down” approach was applied by the guidance of an energy

domain expert (Dr. Brenda Schaffer of the University of Haifa). This help has been

crucial for our work and allowed us to modify the ontology and the research

directions in a more meaningful manner, allowing us to to better focus on specific

relevant aspects. For example: after the reviews of Dr. Shaffer, we mainly focused

our attention on the class “energy security” (which is only one of the 54 classes of

the ontology) and on its implications regarding, for example, geopolitical,

economical and environmental problems (we will see the class description and its

ontological implication in the next paragraph). To summarize, for manual energy

ontology building we mainly used a corpora based strategy. However, at the same

time, we have been guided by a domain expert hence applying a to-down strategy.

So our approach can maybe be considered as a “middle up-down” one. In the

8 The work of manual ontology building has been made in sharing with other Master Degree students of the University of Salerno which also worked, with different objectives and perspectives, at the same project I worked on. In order to facilitate the informations circulation about the common relevant issues of the project we created a wiki which revealed to be an excellent instrument for knowledge sharing. To know more about our different thesis work projects see http://energyontologythesiswork.wikispaces.com.

39

following paragraphs will be describe, step by step, the whole process of manual

energy ontology building.

3.1.2 Information Sources and Tools

The work has been based on information extracted and inferred from a document

database of about 200 documents about energy. The database’s documents has been

recovered from the web; and mainly (but not only) from the online resources of the

most important energy agencies and associations in the world. The strategy used for

the documents recovery was the following: in a first phase the documents selected

were those recovered by means of the results of the search queries on search

engines about: energy policies, energy sources and energy use. In this first phase we

didn’t have a specific knowledge about energy domain and so even the search

queries weren’t precise. The first 30 documents of the database were recovered

following this strategy. After readings them we started to acquire initial knowledge

of the domain and that allowed us to better evaluate the relevance of documents

found by search queries. So even the queries strategies changed and started to

become more precise. In this second phase, in fact we started to have queries using

the first classes inserted into the ontology (e.g. energy security + resources

affordability or energy security + reliability of supply) in order to retrieve more

relevant and specific documents. Then, even another strategy was followed. In fact

we started to browse a list of websites considered “authoritative sources” about the

energy issues and started to do a sort of “human crawling” work in the sense that,

40

starting from the first pages of the selected sources (and following all the links

presented there), we recovered other documents and informations. The full list of

the sources used is the following: the EIA (Energy International Association, see

http://eia.gov) which is the USA’s premier source of unbiased energy data, analysis

and forecasting; the IEA (International Energy Agency, http://iea.org) which acts as

energy policy advisor to 26 member countries; the ENI (Ente Nazionale

Idrocarburi), which is one the most important integrated energy companies in the

world operating in the sectors of oil, gas, power generation etc.; the OECD

(Organization for economic co-operation and development), that is one of the

world's largest provider of sources of comparable statistics, and economic and

social data; the Oil & Gas Journal , which delivers the latest international Oil and

Gas news; the FERC (Federal Energy Regulatory Commission only for USA area)

which regulates and oversees energy industries in the economic, environmental, and

safety interests of the American public; the NREL (National Renewable Energy

Laboratory, referred to USA area) which is the America’s primary laboratory for

renewable energy and energy efficiency research and development (R&D); the

British Petroleum, another important global energy company in the world and,

finally, the World Bank which is one of the most important institution interested in

the themes of energy security and of their impact on the global economic aspects..

The reason why we privileged these sources and not others is strictly linked to the

problem of the “trust” of the information provided on the web. In our opinion the

informative sources selected by the ontology engineers to discover information and

to acquire specific domain knowledge about the knowledge field to represent, are a

sort of “foundational bricks” for the ontology infrastructure. And so they have a

41

very important role. According to this view we choose to have this kind of selection

because we retained these sources as “authoritative” in the energy field and,

therefore, with a good degree of trust. According with the domain information

extracted from the selected documents we started to create the manual ontology

using Protegé software editor, version 3.3.1 (http://www.protégé.stanford.edu). We

choose this software for the following reasons:

• It is an extensible knowledge model. The internal representational primitives in

Protégé can be redefined declaratively. Protégé’s primitives - the elements of its

knowledge model - provide classes, instances of these classes, slot representing

attributes of classes and instances, and facets expressing additional information

about slots.

• Have a customizable user interface. The standard Protégé user interface

components for displaying and acquiring data can be replaced with new

components that fit particular types of ontologies best (e.g., for OWL).

• Allows importing ontologies in different formats. There are several plug-ins

available for importing ontologies in different formats into Protégé, including XML,

RDF, and OWL.

• Support data entry. Protégé provides facilities whereby the system can

automatically generate data entry forms for acquiring instances of the concepts

defined by the source ontology.

• Have a lot of ontology authoring and management tools. The PROMPT tools are

Protégé plug-ins that allow developers to merge ontologies, to track changes in

ontologies over time, and to create views of ontologies. The Protégé internal

42

knowledge representation can be translated into the various representations used in

the different ontologies. Protégé has different back end storage mechanisms,

including relational database, XML, and flat file.

• Presents an extensible architecture that enables integration with other

applications. Protégé can be connected directly to external programs in order to use

its ontologies in intelligent applications, such as reasoning and classification

services.

• Availability of the Java Application Programming Interface (API). System

developers can use the Protégé API to access and programmatically manipulate

Protégé ontologies. Protégé can be run as a stand-alone application or through a

Protégé client in communication with a remote server.

3.1.3 Definition of Domain Concepts

Explicit presentation of the subdivision of the different steps followed for the

manual ontology generation it’s not easy, because different operations took place in

parallel (by one or more persons involved into the project). However, in general, the

process of the energy ontology modeling was the following: in a first phase, after

initial documents reading and initial domain knowledge acquisition, a first

hierarchy of concept classes was constructed (the first version of the manual

ontology was formed by 16 classes, 11 properties, inverse functions included and 40

instances, see the table 3.1). The process followed for the domain concepts

definition was based on the manual extraction of the main keywords presented into

the read texts and on the inference we made on them (e.g. for each keyword was

considered its hyponyms, hyperonyms, meronyms, and, finally, the potential related

43

concepts that was possible to extract). Then, after creation of the “infrastructure” of

the ontology concepts, we started to insert more detailed knowledge into the

ontology, defining instances and different type of relationships (properties in

Protegé) in order to put more meaningful information within the ontology.

Classes Instances Properties

Energy Domain, Risks,

Solutions, Energy Security,

Reliability to Supply,

Friendliness to Environment,

Energy Sources, Primary and

Secondary Sources, Nuclear,

Nuclear Weapon Proliferation,

Nuclear Energy Proliferation

Country, Infrastructure,

Renewable and Non Renewable

Sources.

Alchohol Fuel, Biofuel,

Ethanol, Coke, Diesel Fuel,

Gasoline, JetFuel, Biomass,

Corn,Geothermal, Photovoltaic,

Solid Waste, Solar, Waste,

Water, Wind, Wood,

Anthracite, Coal, Bituminous

Coal, Lignite, Oil, Propane,

Natural Gas, Peat,, Belarus,

USA, China, Russia, India,

Canada, Brazil, Venezuela,

Nigeria, Saudi Arabia, Norway,

Mexico, UK, Uzbekistan,,

Algeria, Kuwait.

Is a type of, includes, is mostly

exported by, exports, is one of

the major producer, is mostly

produced by, is used to, is made

into, cause, is caused by, is

useful in case of.

Table 3.1: Classes, Properties and Instances of the first version of the manual Ontology

3.1.4 Horizontal Links

After the properties and relations definition we started to create some horizontal

links between different classes, subclasses and instances. The horizontal links are

properties or relations which link individuals or classes which aren’t in hierarchical

44

relationships. This kind of relations are very important because allow to represent,

within the ontology, types of information of higher level if compared with the sub-

sumption (“is-a” relations) and similarity information usually provided by the

knowledge representation systems. An example of the usual information presented

is the following “Margherita is a type of pizza”. Through horizontal links, instead,

is possible to give more information representing, for example, information as the

following: “In Naples there is the local X, in front of the place Y, that prepares a

fantastic pizza Margherita”. So it seems evident the improvement given by the

horizontal links to the quality of information that is possible to provide within these

system. To create the horizontal links we have used logical properties with one or

more arguments. An example of a logical property with one argument is the

following: “Mario is a funny guy”, where “Mario” is the argument and “to be a

funny guy” is the property. An example of logical properties with two arguments

(this kind of properties are usually named, in the Logic literature, “relations”) is the

following: ““ Saudi Arabia is one of the major exporter of oil” where “Saudi

Arabia” and “oil” represents the “arguments” and “to be one of the major exporter

of” is the relation asserted.

This kind of properties allowed us to create relations within the ontology integrating

up to 3 classes (a concrete example will be shown in the next paragraph with the

ontology description). The table 3.2 show the full stats of the horizontal links

presented within the ontology. The major part of them link 2 classes and, then, is

based on properties with two arguments.

45

Horizontal Links between 2 classes Horizontal Links up to 3 classes

96 12

Table 3.2. Horizontal Links Data

Figure 3.1 shows an example of the high degree of interconnection created within

the ontology using horizontal links. An example easily seen in figure (on the right

part) is represented by the link between the Class “Solutions” and the class

“Friendliness to Environment” It show, and give us the information, that there is

one instance of the class Solution (we recognize that is an instance because the

colour of the link is pink) that is linked to the issue of friendliness to environment

(and probably is linked in a positive manner in the sense that one of the solution for

the energetic problems goes in the direction of the friendliness to environment).

Figure 3.1. Horizontal Links within the ontology

46

3.1.5 Ontology Population

Once the ontology structure was defined, ontology population phase started

launching crawlers on the web in order to retrieve other meaningful documents

about energy domain, both using automatic lexical acquisition systems in order to

update the ontology with new concepts and new lexicon. The first strategy, with the

crawlers, allowed us to recover relevant documents not previously retrieved with

the search queries. The second strategy was followed through the use of an

automatic keyphrase extraction system called LAKE (D’Avanzo et al. 2004) that

will be described in detail in the next chapter. This system was used to extract

relevant keywords from the documents given in input (the input documents were

that retrieved using the crawlers strategy) and new lexicon and new concepts, then

manually inserted into the ontology, were recovered using the extracted keywords.

This kind of work has been mainly done for a specific class of the ontology: the

energy security class.

3.1.6 Logical Expressions

In order to insert into the ontology different types of information, we used logic

expressions. To create these expressions we used the quantifiers of the first order

logic, as introduced by Frege (1879), to introduce more and better inferential

mechanisms into the system. The introduction of descriptive logic expressions of

first order was possible because of the choice of building the energy ontology using

OWL DL. This language, in fact, allows enriching the ontology by inserting formal

descriptive expressions. OWL DL can be viewed as an expressive Description

47

Logics, and an OWL DL ontology is the equivalent of a Description Logic

knowledge base. We used two kinds of expressions to better specify, in a formal

manner, the information provided to the system: expressions with the Existential

and Universal Quantifiers:

1. (∃) Expressions with the Existential quantifier : which specify, for a set of

individuals, the existence of a (at least one) relationship along a given property to

an individual that is a member of a specific class.

2. (∀∀∀∀) Expressions with the Universal quantifier: which constrain the

relationships along a given property to individuals that are members of a specific

class (in other terms is used to state that “all the individuals of the class X have a

certain property Y”).

A simple example of a “double translation”, first in formal logic and then in OWL

DL construction, of a statement in natural language about the energy domain is the

following:

• Verbal Proposition: Some Fossil Fuels cause some environmental

consequences or some Risks for Energy domain

• First Order Predicate Logic: ∃x (Fx → Ce V Cr)

• Protégé OWL DL construction: ∃ Fossil Fuels cause some (Environmental

Consequences or Risks, see figure 2 for example).

48

Figure 3.2. A logic expression inserted into the Energy Ontology

Another reason that guided our choice of OWL DL language is represented by the

fact that there are many implementations for reasoning testing and that was

important for the evaluation of the consistency of the ontology. The OWL Plugin of

Protégé, in fact, provides direct access to DL reasoners such as Racer Pro (that will

be discussed in the next paragraph).

The current user interface supports two types of DL reasoning: Consistency

checking and classification (subsumption).

Consistency checking (i.e., the test whether a class could have instances) can be

invoked either for all classes with a single mouse click, or for selected classes only.

Inconsistent classes are marked with a red bordered icon.

Classification (i.e., inferring a new subsumption tree from the asserted definitions)

can be invoked with the classify button on a one-shot basis. When the classify

button is pressed, the system determines the OWL species, because some reasoners

49

are unable to handle the ontologies in OWL Full. If the ontology is in OWL Full

(e.g., because metaclasses are used) the system attempts to convert the ontology

temporarily into OWL DL. The OWL Plugin supports editing some features of

OWL Full (e.g., assigning ranges to annotation properties, and creating

metaclasses). These are easily detected and can be removed before the data are sent

to the classifier. Once the ontology has been converted into OWL DL, a full

consistency check is performed, because inconsistent classes cannot be classified

correctly.

Finally, the classification results are stored until the next invocation of the classifier,

and can be browsed separately. Classification can be invoked either for the whole

ontology, or for selected sub-trees only. In the latter case, the transitive closure of

all accessible classes is sent to the classifier. This may return an incomplete

classification because it does not take incoming edges into account, but in many

cases it provides a reasonable approximation without having to process the whole

ontology. OWL files store only the subsumptions that have been asserted by the

user. However, experience has shown that, in order to edit and correct their

ontologies, users need to distinguish between what they have asserted and what the

classifier has inferred. Many users may find it more natural to navigate the inferred

hierarchy, because it displays the semantically correct position of all the classes.

The OWL Plugin addresses this need by displaying both hierarchies and making

available extensive information on the inferences made during classification. After

classification the OWL Plugin displays an inferred classification hierarchy beside

the original asserted hierarchy. The classes that have changed their superclasses are

highlighted in blue, and moving the mouse over them explains the changes.

50

Furthermore, a complete list of all changes suggested by the classifier is shown in

the upper right area, similar to a list of compiler messages. A click on an entry

navigates to the affected class. Also, the conditions widget can be switched between

asserted and inferred conditions. All this allows the users to analyze the changes

quickly. The manual ontology correctness has been evaluated using the first above

mentioned task: consistency checking.

3.1.7 Racer Pro

RacerPro stands for Renamed ABox and Concept Expression Reasoner

Professional. Its origins are within the area of description logics. Since description

logics provide the foundation of international approaches to standardize ontology

languages in the context of semantic web, RacerPro is also used as a system for

managing semantic web ontologies based on OWL. It can be used as a reasoning

engine for ontology editors such as Protégé. An important aspect of this system is

the ability to process OWL documents. The following services are provided for

OWL ontologies and RDF data descriptions:

• Check the consistency of an OWL ontology and a set of data descriptions.

Find implicit subclass relationships induced by the declaration in the

ontology.

• Find synonyms for resources (either classes or instance names).

• Since extensional information from OWL documents (OWL instances and

their interrelationships) needs to be queried for client applications, an OWL-

51

QL query processing system is available as an open-source project for

RacerPro.

• HTTP client for retrieving imported resources from the web. Multiple

resources can be imported into one ontology.

• Incremental query answering for information retrieval tasks (retrieve the

next n results of a query).

In addition, RacerPro supports the adaptive use of computational resource: answers

which require few computational resources are delivered first, and user applications

can decide whether computing all answers is worth the effort. In order to evaluate

the manually built ontology we firstly connect this tool to the ontology in OWL (the

process is shown in figure 3.3) and then run it to check the ontology consistency.

Figure 3.3. Ontology connection with RacerProReasoner

The process of ontology evaluation has been executed during the whole process of

ontology building until the last updating of the ontology. And that allowed us to

52

correct immediately the logical inconsistencies we found during the process. During

the last evaluation no logical inconsistency or taxonomy errors were found. That

means that the manually built ontology have an it’s internal logical coherence.

3.2 Manual approach: Energy Ontology Description

In this paragraph will be described the classes of the manual energy ontology. Some

of them revealed to be more “meaningful” with respect to others because of their

high level of interconnection with other classes and instances. The full data of the

manually built ontology are in the table 3.3.

Table 3.3. Energy Ontology Data

Documents Classes Instancies Properties

200 54 121 30 (inv. Functions included)

3.2.1 Energy Domain

For the Energy ontology we have adopted a super-class named Energy_Domain

composed of 6 sub-classes: Country, EnergySecurity, EnergySources,

Infrastructures, Market and EnvironmentalConsequences (see the figures 3.4 and

3.5 below).

53

Figure 3.4: Energy domain Subclasses in the “Classes view”tab of Protégé

Figure 3.5 a visualization of the Energy domain subclasses with Ontoviz plugin

At the same level of the energy domain super-class two transverse classes such as

Risks and Solutions have been inserted (figure 3.6). This choice has been made

because these two categories were presented in many of the classes of the ontology

and so it wasn’t possible to clearly decide in which class to include them and in

54

which not. With this strategy, instead, it was possible to create many horizontal

relations between these classes and the other classes of the ontology.

For the concept “Risks” we considered as subclasses different type of issues such

as:: economic problems (e.g. market instabilities, market cartels, growth in energy

price etc.), geopolitical problems (e.g. natural gas, oil and nuclear dispute) and

technical problems.

While, for Solutions, we mainly considered as subclasses the governments action

and policies used to avoid problems about energy. We divided this class in 3 sub-

classes. long term-medium term and short term policies.

Figure 3.6: Energy Ontology Superclasses

3.2.2 Energy Sources

We divided this class into Primary, Secondary and Nuclear sources following the

most common distinction made in the literature about these issues. Primary sources

don’t need to be subjected to any conversion or transformation process to be used.

Secondary Energy Sources, instead, have to be transformed from one form to

another to be used. Each of these two classes have other subclasses represented by

the distinction between renewable and not renewable sources. For the nuclear class

the subclasses nuclear energy sources and nuclear weapons proliferations have been

OWL Thing

Energy Domain Risks Solutions

55

identified. This last class allowed us to create horizontal links with the class

“geopolitical problems” which is a subclass of the superclass “Risks”.

Figure 3.7. Energy Sources Classification

3.2.3 The case of the Hydrogen: Renewable or not Renewable?

One of the main problems encountered during the ontology building task was the

difficulty to classify some entities as belonging to one class or to another one. A

meaningful example of this type of classification problem was what we have called

the “case of the Hydrogen”. The problem was the following: different sources

which we consulted classified the hydrogen in different manners because the

hydrogen can be both renewable and non renewable. It depends from what source it

is extracted. So we had the problem that one concept of the ontology was member,

at the same time, of one class and of its opposite. And, of course, that isn’t allowed.

In order to overcome this problem we created different “hydrogen entities”. One of

that was “hydrogen by renewable sources (solar, wind etc)”. And the other one was

“hydrogen by non renewable (hydrocarbon etc.)”. In the figure 3.8 is shown the way

in which we solved the problem.

Energy Sources

Primary Sources

Nuclear Secondary Sources

Renewable Not renewable

Renewable Not Renewable

Nuclear En.

Sources

Nuclear W.

Profilerat.

56

Figure 3.8: the Classification of Hydrogen

3.2.4 Country

Country class has 2 subclasses that include OPEC members and NON OPEC

members. It is an important class because it allows to create horizontal relationship

between with energy sources class or with the economic and environmental policies

aspects. This class can be modified and modelled in a different manner considering

different classifications criteria. For example: if we want to focus on the

geopolitical problems (which, of course, involve the countries) related to the

natural gas energy source, a different classification can be proposed by a domain

expert and the number of the subclasses can be easily enlarged

3.2.5 Energy Security Class

Energy Security is the class of the ontology on which we mainly focused during the

last part of manual ontology building, following the suggestions of the energy

domain expert Dr. Brenda Schaffer. This issue has risen to the top of the agenda

among policy makers, international organizations and business because of in the last

decade there was a sustained growth in demand for energy that seriously concerns

the long-term availability of reliable and affordable supplies.

Renewable Sources

Hydrogen (water, solar)

Non Renewable Sources

Hydrogen (hydrocarbon)

57

Energy security has broad economic, political, and societal consequences. A lack of

energy security can exacerbate geopolitical tensions and impede development.

Because of the interconnection of this concept with many other aspects (geopolitical

and economic problems, technical aspect etc.) we spent time and efforts in the

classification of this class because there isn’t a clear definition of it. It is a sort of

“multiface concept” because it regards different issues. So we created for this

concept an infrastructure able to link it with other concepts within the ontology.

Relations created allowed the connection of this concept until with 3 other classes.

The main direct subclass of energy security are presented in figure 3.9.

Figure 3.9: Energy Security direct subclasses

Figure 3.10 exemplifies the way we created horizontal links with the other classes

of the ontology. For example: we linked the concept “energy security” with the

concept “geopolitical problems” which is a sub-concept of the superclass “Risks”.

Moreover, creating these links, we were able to connect three classes of the

ontology: Energy security (reliability of supply is its subclass), Country and

Geopolitical Problems. The created infrastructure, indeed, allowed us to provide

this kind of information: “Ex. Cut off in supply for USA (which is an energy

Energy Security

Reliability of supply

Scarce Resources Dependency

Resources Affordability

Friendliness to Environment

58

security subclass) is caused by USA vs Venezuela dispute (that is subclass of

geopolitical problems) and, this fact, involves USA and Venezuela (which are

instances of the Class country)”. In the figure the arrows represent the horizontal

links (the position of the classes in figure do not show their real position within the

ontology: e.g Risks and Energy Security aren’t at the same level). USA and

Venezuela are two instances and are shown in blue. The classes are in yellow. The

property which link “Cut off in supply for USA and the subclass “Venezuela vs

USA dispute” is “is caused by” (and there is also the inverse property “cause”) and

is represented by the black arrow in the figure. The property which links the

instances “USA” and “Venezuela” with the classes in figure (red arrows) is

“involve” (with its inverse function “is involved in”) and an example of information

provided with this links is the following: “Venezuela and USA are involved in

dispute among them that caused cut off in supply for USA”.

59

Figure n 3.10: Ontology connection between Energy Security and Geopolitical Problems

Another example of horizontal connection created is shown in figure 3.11. The link

is between an energy security subclass (scarce resources dependency) and

Economic Problems such as market cartels, growth of energy price, and market

instabilities (Economic Problems is one of the subclass of the superclass Risks)..

The horizontal links are represented, in figure, by the bidirectional arrows. The

arrows indicates the property “cause” (and its inverse function “is caused by”). So,

in this case, the created links allowed us to provide, for example, this kind of

information within the ontology: “Scarce Resources Dependency cause growth of

energy price” and its inverse “Growth of energy price is caused by scarce resources

dependency”. The same is for the other classes linked in figure.

60

Figure 3.11 Energy security and Economic Problems

3.2.6 Infrastructures

This class represents the infrastructure used for the energy transformation,

extraction and transportation. For this class have been identified 10 instances which

are listed here: Oil Tanker, Pipelines Turbine, Drilling Equipment, Barge,

Heliostats, Methane Pipelines, Ship, Train, Refinery.

3.2.7 Environmental Consequences

“Environmental Consequences” are a very important issue regarding the thematic

of the energy domain. Within the energy ontology we created this class identifying

14 instances referred to this issue. They are listed here: climate change, pollution,

deforestation, desertification, global warming, higher global temperatures, bird

61

flight patterns, CO2 Emissions, Damage to views, Flooding, Droughts, Increased

Rains, Greenhouse gas emissions, impacts weather, urbanization.

Many of these instances are highly correlated between them. For example: climate

change refers to any significant change in measures of climate (such as

temperature, precipitation, or wind) lasting for an extended period (decades or

longer). And it may result from:

1. natural factors, such as changes in the sun's intensity or slow changes in

the Earth's orbit around the sun;

2. natural processes within the climate system (e.g. changes in ocean

circulation);

3. human activities that change the atmosphere's composition (e.g. through

burning fossil fuels) and the land surface (e.g. deforestation,

reforestation, urbanization, desertification, etc.)

This last possibility allows to link the instances “desertification”,

“deforestation”, “urbanization” and “climate change” through an information as

the following: “Deforestation is one of the cause of climate change”. Even this

information is provided by means of an horizontal link between the instances

belonging to the same class (in fact even in this case the link is horizontal

because the instances are all at the same level of the hierarchy).

62

3.2.8 Energy Use

For the class energy use we followed the major distinction made in literature about

this matter identifying 5 subclasses. One for each sector on which the energy is

consumed: The subclasses identified are: Commercial, Electric, Industrial

Transportation, Residential. For this subclasses we created some horizontal links

regarding the energy sources mainly utilized for the different uses (linked, in this

way, the concepts of “Energy Sources” with the above mentioned “Energy Use

subclasses”). For example: for the energetic use mainly are consumed non

renewable sources and the information provided through the horizontal link is the

following “Residential Use mainly make use of Non Renewable Sources” (the

property identified for this link is “mainly use” and its inverse function “is mainly

used”).

63

3.3 A Semi automatic approach for ontology building

There are pros and cons to manual ontology construction. The pros are that human

expertise and background knowledge of a specific domain directs the definition of

the concepts taxonomy and the creation of relevant properties. Moreover, in enables

the insertion of meaningful information into the ontological system. On the other

hand, the cons are, one of the major disadvantages of the manual ontology

construction are the efforts required from ontology designers. Indeed, for an

ontology engineer, to construct manually, without any “help”, a complex

knowledge representation system as an ontology, means spending a lot of time in:

reading documents, extracting the most relevant keywords for each document,

inserting the keywords within the ontology in a proper manner (without creating

confusion between classes-keywords and instancies-keywords), inferring relations

between terms (represented as classes or as instances) etc. Furthermore manual

ontology building implies that the ontology builder is a specialist in the ontology

construction or, at least, a knowledge domain expert with skills of ontology

engineering and familiar with the relevant ontology editor softwares (e.g. Protegé,

OntoStudio etc.). The question is how many domain specialists in various domains

also know something about the ontologies?

In order to overcome these problems, different systems able to build ontologies in

an automatic or a semi-automatic way have been developed in the last years. They

may start by using dictionaries, knowledge bases, semi-structured schemata,

relational schemata or from unstructured textual documents (see the Chapter 2 for a

rapid overview). One of these, briefly described in the previous chapter, is Ontogen.

It has been developed by Blaz Fortuna, Marco Grobelnik and Dunja Mladenic of

64

Joseph Stefan Institute, Ljubljana (Fortuna et al. 2007). It is a semi-automatic, data

driven topic ontology9 editor. It integrates machine learning and text mining

algorithms to reduce both the time spent by the user in the ontology building and

the complexity of the task itself (Fortuna et al 2006). It is “semi-automatic” because

it automatically performs some tasks (e.g. concepts and concepts names suggestion,

instancies assignment to the concepts etc.), but, at the same time, allows the user to

keep the full control of the system because he/she can accept, adjust (even

manually) or reject the modifications that are made automatically. It is “data

driven” because the system is “guided” by the data given by the user during the

initial phase of the ontology construction. The data provided by the users is a corpus

of documents given in input to the system in order to generate the ontology and, for

that reason, they reflect the domain knowledge for which the user is building the

ontology. The documents are represented in a bag of words (BOW) representation,

in which each document is encoded as a vector of terms’ frequencies. The terms are

weighted with TFxIDF measure and the similarity between two documents is

calculated with the cosine similarity measure, which is defined by the cosine of the

angle between two “bags of words” vectors. TFxIDF is a quite common linguistic

frequency measure. It integrates the terms frequency within a document (TF) with

the frequency of that term in a corpora (hence integrating relative importance within

a document with how well the term represent the document in the corpora).

Ontogen’s work is based on the automatic extraction of keywords from the

documents given as input by the user for the ontology generation. There are two

keywords extraction methods applied by the system. The first, shown under the 9 A topic ontology is a set of topics (or concepts) connected with different types of relations. Each topic includes a set of related documents.

65

label “keywords” in the system’s interface (see figure 3.12), uses centroid vectors.

Centroid is the sum of all the vectors of the documents inside the topic. The

keywords selected have the highest weights in the centroid vector.

Figure 3.12. Keywords extraction methods with centroid vectors.

The set of keywords extracted with this method is composed by the most

descriptive words of the concept’s documents (or instances).

The second method, shown under the label SVM keywords (see figure 3.13.), uses

the Support Vector Machine classifier. This method is used to extract keywords

describing a selected concept. The classifier is trained as follows: consider that A is

the topic to be described with the keywords. All the documents from the concepts

that have A as subtopics are marked as negative while the documents under A are

marked as positive. Then the linear SVM classifier is trained by these documents

and classify the centroid of the topic A. With this method the set of the extracted

keywords is composed by the most distinctive words for a selected concept with

regards to its sibling concepts in the hierarchy (Fortuna et al 2005).

66

Figure 3.13. Keyword extraction methods with Support Vector Machine Classifier.

The keywords by one of the above methods are used to suggest possible topics,

subtopics or name of topics for the ontology building. The suggestions generation is

one of the main features of OntoGen. It can be provided in two different manners:

with unsupervised and supervised methods. In the unsupervised approach the

system provides concept and sub-concepts suggestion using keywords extracted

from the documents of the selected topics with Latent Semantic Indexing (LSI) or

K-means algorithm10. The main advantage of the unsupervised method is that it

require very little input from the user (it must be indicated only the number of

clusters). With the supervised method, instead, the concept suggestion is “driven”

by the user, in the sense that the use of this method implies that the user has an

initial idea about what a concept or a sub-concept of the ontology should be. Having

10 The algorithm method is chosen by the user.

67

this knowledge background he/she can enter a query and use it to train the system

using relevant documents, as explained below (see figure 3.14).

Figure 314. An example of user query with the supervised method for the suggestions generation in

Ontogen.

Once the user enter his/her query in form of keywords or keyphrases the system

start asking him/her questions in which asks if a particular document belongs to the

selected concept. And the user can selects Yes or No buttons as answers (see figure

3.15).

68

Figure 3.15. Training phase in Ontogen using the supervised method for the concepts suggestion

In other words, after the user’s query, the system starts an active learning process

based on the feedback provided by the user during a training phase. In this phase the

user have to provide information to the system answering with positive or negative

feedback (Yes or no buttons) to the question: “Does this document belongs to the

concept?”. On the other hand, the system uses a machine learning technique for

semi automatic acquisition of user knowledge. It refines the suggested concepts for

each reply of the user and, after the user is satisfied with the suggestions, a concept

is constructed and is added to the ontology as sub-concept of a selected concept.

69

3.3.1 Semi Automatic Ontology Generation

Ontogen has been used for the semiautomatic ontology generation of a part of

energy domain ontology. The part chosen is the energy security class that is a sort

of micro-ontology within the general energy domain (it is formed by 14 classes in

total). The semi automatic ontology for the pilot study experiments has been created

with 25 selected domain documents about energy security. The system was able to

build an ontology of 10 concepts. A screenshot of the generated concepts is

presented by figure 3.16. Figure 3.17 shows the semi-automatically generated

ontology exported in OWL format within Protégé.

Figure 3.16: A screenshot of the semi-automatic generated concepts

70

Figure 3.17. The Semi-automatic generated ontology exported in OWL

71

Chapter 4: Evaluation and Experiments

There are different approaches to the ontologies evaluation and comparison. The

major distinction is between qualitative and quantitative ones (Brewster et al, 2004).

Qualitative evaluations could be done by presenting the users an ontology or a part

of an ontology and ask them to rate it. The problem of that approach is that it

doesn’t use any standard criteria for the evaluation. Quantitative approach, instead,

allows to evaluate and compare a semi-automatic or automatic generated ontology

with an existing “gold standard”, represented for instance by a manually built

ontology, using different measures. Maedche and Staab (2002) used an evaluation

method based on comparing the ontology to a “golden standard” which may be an

ontology itself. Porzel and Malaka (2004) support the idea of an ontology

evaluation through the concrete results of its application. Lozano-Tello and Gomez

Perez (2004) proposed an evaluation approach based on the human evaluation of

ontology in which the humans try to assess how well the ontology fits a set of

predefined criteria and standards. Brank et al (2005) support the idea that it is often

more practical to evaluate different levels of the ontology separately. They

identified the following levels for an evaluation: lexical, vocabulary or data layer

level in which the focus is on which concepts or instances have been included into

the ontology and the vocabulary used to identify this concepts; hierarchy or

taxonomy level: here the evaluation is focused on the is-a relations; other semantic

relations: the evaluation is focused on different types of semantic relation (is a

relations are not considered); context or application level: at his level the context of

which an ontology can be part is taken into account (e.g. an ontology may be part of

72

a larger collection of other ontologies); syntactic level: the ontology is described in

a particular formal language and is evaluated if the language used for the ontology

matches the syntactic requirement of that language; structure, architecture, design

level: this kind of evaluation usually performed entirely manually and consists of

the observation of certain pre-defined design principles. The structure evaluation

involves the organization of the ontology and its suitability for further

developments. Guarino and Welty (2005) consider different aspects to the ontology

evaluation. They point out that several philosophical notions (such as essentiality,

rigidity, unity etc) can be used to better understand and evaluate the nature of

different types of semantic relations that commonly appear in an ontology. A

drawback of this approach is that it requires manual intervention by trained humans

familiar with the above mentioned notions.

We evaluated the semiautomatic ontology by comparing it with the manual

generated one and used precision and recall measures, following, as reported by

Gangemi et al. (2005) one of the most common approach used for the ontology

evaluation due to its simplicity.

4.1 Experiments for the Pilot Study

We conducted a pilot case study for Energy Domain, using 25 documents about

energy security. The pilot study allowed us to evaluate the semi-automatic approach

of ontology building (we used Ontogen for the semiautomatic ontology generation).

It allowed uncovering a set of pros and cons with respect to a manual approach. The

pros were: the ontology construction task is simplified by the semiautomatic system

73

and, in this way, it doesn’t require a specific expertise in ontology engineering.

Ontogen offers the suggestions for topics or subtopics for the ontology as a set of

keywords extracted from a corpus of documents. This, in turn, allows the

discovering of relevant classes that even were not considered with the manual

approach. In this view Ontogen could be used in parallel with the manual ontology

construction, suggesting the ontology engineer classes and instances missed during

the manual step. Regarding the cons: ontology relations among concepts are very

poor (only is-a relations and “similarity” relations between concepts are allowed)

other meaningful relations among different concepts are missed. Furthermore, often

there isn’t an alignment between keywords extracted by Ontogen, which represents

also classes’ names, and those assigned manually by humans. This problem could

be common to all keywords extraction systems integrated with Ontogen, and,

maybe, could be reduced when using more linguistically based keyword extraction

systems such as LAKE (D’Avanzo, Kuflik 2005).

A deeper evaluation of the semi-automatically energy generated ontology with the

manual gold standard one has been done with two experiments. We compared the

two ontologies using a quantitative approach and calculating precision and recall

measures in order to see if there is a partial overlapping among them.

74

4.2 Precision and Recall for a quantitative evaluation

Precision and recall are two standard, well known measures within the Information

Retrieval field. They are used, for instance, to evaluate the efficiency of a search

engine in order to understand if the documents retrieved after a query are relevant or

not with respect to the information need of the user which did the query (see the

table 4.1)

They are calculated as follow:

• PRECISION= Relevant Retrieved/ retrieved (R,R/ (R,R+NR,R)

• RECALL= Relevant Retrieved/ relevant (R,R/ (R,R+R,NR)

These two measures have been adapted for the evaluation task regarding the

comparison between manual versus semi-automatic ontology building. So, in this

case, the precision value indicates how many relevant semi-automatically generated

concepts are generated on the total of the semi-automatically generated concepts.

Recall, instead, reflect how many relevant generated concepts are generated on the

Table 4.1. Precision and recall

total of relevant, manually built, concepts (it’s a “coverage” measure). The

evaluation has been conducted only on a part of the energy ontology: on the energy

security class. This class, as shown previously, have other subclasses and many

horizontal relations with other classes within the energy domain ontology.

RELEVANT NOT RELEVANT

RETRIEVED R,R NR,R

NOTRETRIEVED R,NR NR,NR

75

The results of the first experiment present a precision value of 40% and a recall of

30% (figure 4.1). In this case have been considered as “good” not only all the

generated classes that fully overlapped the manual generated concepts, but even the

synonyms of the terms representing a class These values are aligned with other

comparative results for the evaluation of semi-automatic generated concepts with

gold standard one. Weber and Buitelaar (2005), in fact, evaluated an automatically

generated ontology of Sports Events with a manual “gold standard” generated one

presented similar results: 32% of precision and 46% of recall. So they have a better

recall value and a worst precision one. We also have considered the F Measure

values for a better comparison (the “F measure” represents the harmonic mean

between precision and recall values) . The results for this first experiment are the

following: we have a F Measure of 0,34 while they present a value of 0,37.

Experiments n. 1: Precision and Recall Values

Precision; 40%

Recall; 30%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

1

Precision Recall

Figure 4.1. Experiment n.1: Precision and Recall results.

After the first, a second pilot experiment has been done considering semantically

based matching criteria for the measurement. In the second experiment, in fact, even

76

the manual instances of the energy security class has been considered as “concepts”

of the class itself. That operation was possible because Ontogen doesn’t make any

conceptual difference between the classes (representing the concepts) and the

instances of the manual ontology. And so, to have a real comparison between the

relevant conceptual entities generated from the two approaches, this kind of

operation has been considered allowed. Moreover even some semantically related

concepts of the energy security class, formally not presented into the ontological

space of that class but in other levels of the ontology, have been considered as

“concepts” of the energy security class (e.g. the concept “geopolitical problems”,

not directly presented within the hierarchy of energy security but semantically

linked to this concept with many horizontal links, has been considered as “concept”

of energy security). Of course, with these manipulations, the number of the

numerator and of the divisor for the recall calculation was different. And so we had

a different, better, recall value with respect to the first experiment. And even the

precision value had an improvement with these adjustments because, considering as

relevant even other entities not considered in the previous measurement, the number

of the numerator increase while that of the denominator is equal. The new results

are shown below in figures 4.2. The comparison between the results of the

experiment n1 and n.2 are presented in figures 4.3 and 4.4.

77


Precision; 45%Recall; 40%

0%5%

10%15%20%25%30%35%40%45%50%

1

Precision Recall

Figure 4.2. Experiment n. 2. Precision and Recall Results

Experiments n. 1 and 2: Recall Values Comparison

First Recall Value; 30%

Second Recall Value; 40%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

1

First Recall Value Second Recall Value

Figure 4.3. Experiments n 1-2: Recall value comparison considering semantic adjustment

78

Experiments n. 1 and 2: Precision Values Comparison

First Precision Value; 40%

Second Precision Value; 45%

0%5%

10%15%20%25%30%35%40%45%50%

1

First Precision Value Second Precision Value

Figure 4.4. Precision Values Comparison considering semantic adjustments.

The results of this first pilot study on Energy Domain ontology are quite aligned

with other evaluation results made for this task of comparison between manual

“gold standard” ontologies versus semi-automatic or automatic generated ones (e.g.

Weber et al., 2005).

In the next paragraph will be shown the results of new experiments conducted with

more documents given in input to Ontogen system in order to let it to work better.

Our expectations before these new experiments was that the number of generated

concepts and of relevant generated concepts would be increased. With

repercussions even on the precision and recall measures. These expectations has

been substantially confirmed by the new results.

79

4.3 New Experiments

After the first pilot study two new experiments have been done for the evaluation of

the semi-automatic ontology building with Ontogen. In these experiments have

been used and provided to the systems more documents for the ontology generation

(50 instead of the previous 25). The class considered for the evaluation was always

the “energy security class” within the energy domain ontology.

As expected the results have had some improvement because, generally, the more

are the documents given in input to this kind of systems and the better they work.

For the new experiment we followed the same method used for the pilot study: at

the first step (in the first experiment) we calculated the overlapping between manual

versus semi-automatic ontology exclusively considering the “concept name

alignment” between this two (and, as in the pilot study, the synonyms were

considered as “good” and counted). So, in this first case, the comparison was

mainly based on a sort of “terms matching” between the manual terms used to

represent a concept within the ontology and those semi-automatically created.

Then, for the second experiment, more semantically motivated matching criteria

were considered for the calculation of precision and recall values and, as verificated

in the pilot study (and as expected), these two values augmented.

The figure 4.5 shows the results of the precision and recall in the first new

experiment. We note that the precision value (42%) is quite similar to the precision

value of the first experiment conducted for the pilot study (40%). The recall value,

instead, have had a meaningful increment (38% compared to the previous 30% of

the pilot study).

80


Precision; 42%Recall; 38%

0%5%

10%15%20%25%30%35%40%45%50%

1

Precision Recall

Figure 4.5. Precision an Recall results with 50 documents as input

The results of the second experiment, shown in figure 4.6, tend, instead, to confirm

the trend of the pilot study: both precision and recall are grown considering more

tolerant semantic based matching criteria.


Precision; 50%

Recall; 42%

0%5%

10%15%20%25%30%35%40%45%50%

1

Precision Recall

Figure 4.6. Experiment n 4. Results

Through the comparison of the experimental results emerge two kind of

informations which can be considered as relevant. The first one is represented by

the fact that the precision value, if calculated without any semantic adjustment,

81

seems do not be too much affected by the increment of the number of documents

given in input to the system for the semi-automatic ontology generation. The new

results, in fact, are not so different from that obtained using 25 document to

generate the ontology: the range of variation is only of 2 points in percent for this

value (from 40% to 42%, see figure 4.7).

Figure 4.7. Precision Value Trend for the Experiments 1and 3

Exp. n.1 and 3: Precision Value Trend

40%42%

35%

40%

45%

50%

55%

60%

Precision Values for Exp. n1 and 3

Precision Value Trend

The second relevant information, instead, can be represented by the opposite

behaviour of the recall value passing from the ontology built with 25 documents to

that built with 50 (we considering the results obtained without semantic

adjustments). Recall, in fact, seems to be more influenced by the augmented

number of documents (see figure 4.8). This value is passed by the 30% of the first

experiment conducted for the pilot study to the 38% of the experiment n.3 (the first

of the two new experiments).

So, at a first sight, it seems that, giving more input documents to the system, it

works better on the recall measure rather than on the precision one.

82

Exp. n. 1 and 3: Recall Value Trend

30%

38%

0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2

Recall Value Trend

Figure 4.8: Experiments n.1 and 3, Recall Value Trend

Considering the precision and recall results with semantic adjustments (passing

from 25 to 50 documents) we can observe that the situation doesn’t change a lot. In

fact, both the measures register an improvement. About the precision: its value

passes from the 45% of the experiment n.2 of the pilot study (with 25 documents) to

the 50% of the experiment n.4 with 50 documents (figure 4.9). The recall value,

instead passes from the 40% of the pilot study to the 42% (figure 4.10).

Exp. n.2 and 4: Precision Value Trend

45%

50%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

Precision Value Trend

Figure 4.9: Precision Value Trend for the Experiments 2 and 4.

83

Exp. n. 2 and 4: Recall Value Trend

40%

42%

35%

37%

39%

41%

43%

45%

1 2

Recall Value Trend

Figure 4.10. Recall Value Trend for the Experiments 2 and 4

Putting the obtained results in a more global vision, e.g. comparing it with other

similar evaluation research results, we can extract other interesting considerations

about the experiments. The first one is represented by the fact that the measured

precision values (both considering non semantically enhanced results and

semantically enhanced) always keep a better result with respect to other similar

evaluation such as the Weber et al.(2005)’s research. On the other hand, instead, the

recall values always present a lower level if compared with the Weber et al. result.

Considering F Measure for both the experimental situations (with non semantic and

semantic enhanced criteria) allowed us to have a better comparison between the

obtained results and the Weber et al. results. The figure 4.11 shows the “F” values

obtained considering the results of the situation with non semantic criteria (e.g. the

experiment n.1 and 3) compared with the “F” of Weber et al. research.

84

F Measure Comparison considering non semantic criteria

0,340,37

0,42

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1

F Value for Exp 1 Weber et al F Value F Value for Exp 3

Figure 4.11.. F Measure Comparison considering non semantic criteria

The figure show an improvement of F passing from the situation of 25 documents

(experiment n 1) to that of 50 documents (experiment n.3). The “F” obtained with

the experiment n.3 score higher with respect to the F obtained by Weber et al

research, and this can be considered a quite satisfacing result.

The figure 4.12 shows the values of “F” obtained considering the semantic

enhanced criteria situation (experiments n 2 and 4). As shown in the graph the F

values are better than the Weber result.

85

F Measure Comparison considering semantic criteria

0,39 0,37

0,45

00,050,1

0,150,2

0,250,3

0,350,4

0,450,5

1

F Value for Exp 2 Weber et al F Value F Value for Exp 4

Figure 4.12. F Measure comparison considering semantic adjustment

A global overview of F Measure results is given in figure 4.13. We note that only

the F Measure value of the first experiment score a bit less with respect to the

Weber et. al result (represented by the second bar in the graph), while, the other

three values presents always a better result.

F-Measure (Harmonic Mean)

0,340,37

0,420,39

0,45

0

0,1

0,2

0,3

0,4

0,5

1

F M

esu

res

Val

ues

F Measure First Exp F Measure Weber et al. F Measure Sec. Exp

F Measure Third Exp F-Measure Fourth Exp

Figure 4.13. F-Measure

86

Chapter 5. Discussion and Conclusions

This last chapter is divided as follow: in the first part (paragraph 5.1) will be briefly

discussed some aspects concerning the methodology used for the experimental

research. Within this part we will try to give answers to some potential critical

questions that could be done about the work conducted. Then (in the paragraph 5.2)

will be presented and motivated a concrete proposal to improve Ontogen’s activity.

The last part of the chapter (par. 5.3) is dedicated at the conclusion of the whole

work.

5.1 About the methodology

In this paragraph we will try to aliment a discussion about the methodology used for

the experiments simulating a sort of questions and answers dialogue with an

“advocatus diaboli”. The scope of this paragraph is to clarify some aspects of the

methodology used for the research and trying to answer at some hypothetical

critical questions that could be done reading this work. The questions we will try to

answer are presented in the following list:

1. Do you think it is justifiable the chose of calculate the precision and recall

value considering even what you have called “semantic based criteria”?

2. Do you think that the semantic criteria you choose can be considered as

universally accepted?

3. You have evaluated a micro-ontology. Do you think it is possible to

generalize the results?

87

4. Do you think that other evaluation methods for the comparison of manual

versus semi-automatic approach could be used?

5. The semi-automatic generated ontology has been built with 50 documents.

Don’t you think that they are too few?

6. Do you think that the documents given as an input to the system have all the

same value about the knowledge that can be extracted?

7. How could be extended this research for future works? What could be the

interesting topics to investigate?

8. According with the obtained results how do you think that Ontogen

software could be improved?

Answers:

1. Precision and recall are classical measures in information retrieval hence

redefining and using them in the context of the case of semi automatic

ontology generation makes the results intuitively understandable to the IR

community. Calculate precision and recall considering even aspects related

to the semantics of the concepts have sense, because it allows having a

deeper evaluation regarding the comparison of the conceptual entities

generated by the two different ontology building approaches. If we stopped

only at the first “terms matching based” evaluation (e.g. the results of the

experiments n.1 and n.3) we wouldn’t have a complete vision of what are

the real differences between the manual and semi-automatic approach.

Because, for example, a semi-automatically generated term, semantically

closed to a term of the manual ontology but not formally represented within

88

it, wouldn’t be considered as a relevant (it is the case of the concept

“geopolitical problems” which we encountered before). And, that, would be

a mistake because wouldn’t be counted as “good” all those terms

representing relevant “aspects of meaning” (De Mauro, 1992) of a concept

of an ontology.

2. Our evaluation has been done on a particular class: energy security. In

literature there isn’t a clear definition of this concept because it involves

different aspects and fields. In our case, for example, we considered (in the

experiment n.2 and 4) the subclass “geopolitical problems” as a sub-concept

of energy security because it is semantically related to it. Another

consideration that could be done is the following: concepts such as

“geopolitical problems” are not included formally within the hierarchy of

energy security because the objective followed with the creation of the

ontology was to build an ontology for the energy domain. And not for

energy security. But surely, if the target would be to build an energy security

ontology, then this kind of concepts would be inserted formally within the

hierarchy of the energy security class. So, about the acceptance of the

criteria, they probably wouldn’t be accepted in general, but in our opinion

this kind of choices (and its evaluation), can be done mainly by the

ontologists and by the domain experts involved concretely in the domain

ontology building because these problems are referred to a specific task in a

specific domain.

3. The obtained results can be considered a sort of direction indicator. They are

specific to a given domain and to Ontogen. Of course they must be verified

89

in an enlarged scale to have a higher level of generalization strength. The

fact that has been evaluated a micro-ontology doesn’t represent a problem

for the generalization of the results if the methodology followed for the

evaluation is clear and accepted. To do that we have based the evaluation on

a well known approach for ontology evaluation based on two quantitative

measures well known in Information Retrieval such as precision and recall.

4. As briefly mentioned in the previous chapter there are many approaches for

the ontology evaluation. And, therefore, other aspects could be considered

for the evaluation. An interesting issue could be the following: to use an

automatic evaluation system for the calculation of precision and recall based

on the “term matching criteria” (the system only should do a “term

matching” between the concept-terms used in the two ontologies and, then

calculate the values).

5. The documents considered to build the ontology are the same considered by

the humans for the manual task. The pilot study has been done on an

ontology generated with 25 annotated document (representing our

independent variable) because, as mentioned in different parts of this work,

the main activity which was based on Ontogen for the concepts creation was

the extraction of keywords from the documents given in input. Keyword

extraction systems usually need only few training documents to learn a

model and start to operate with an acceptable margin of error (e.g. Witten et

al. 1999 demonstrated that the keyword extraction system based on KEA

algorithm, see Frank et al. 1999, only need 20-25 documents to perform

very good results). Of course we also did other experiments with more

90

documents (experiments n.3 and 4) to evaluate the degree of variation of

precision and recall measures.

6. The sources given in input to the system can be considered as important for

the knowledge extraction operations. They could be selected and provided

by a domain expert. A future experiment that could be done regarding the

creation of a semi-automatic ontology exclusively with documents selected

by a domain expert. In order to see if there are meaningful differences in the

knowledge extracted, and then represented in an ontology, in comparison

with the situation of a semi-automatic ontology built without considering

documents provided by domain experts..

7. An interesting aspect is to investigate until what number of documents there

is an improvement of precision and recall measure (when do the differences

stabilize).

8. Ontogen activities could be improved considering different approaches to

the keyword extraction. This issue will be explained in the next paragraph.

5.2 Proposal: a linguistically motivated keyphrases extraction

system for Ontogen

The keyword extraction process represent one of the major feature on which is

based Ontogen software for the semi-automatic ontology building. This module of

the system, in fact, allows to extract, from the documents given in input, a set of

keywords that represent a list of “candidate concepts” for the ontology which is

91

under construction11. Moreover the extracted keywords represent also the “default

value” assigned to the name of the new concepts of the semi-automatically

generated ontology. And that, as pointed out in the previous pages, is one of the

cons of Ontogen’s activity if compared with the manual concept name assignment.

This kind of dis-alignment can be improved making some changes in the Ontogen

keyword extraction module. In fact, that would allow a better keyword extraction

process and, therefore, better result about the concept name assignment and, then,

about the concepts-name alignment with the manual concepts.

One of the major cons of Ontogen’s keyword extraction module is that it mainly

retrieve uni-grams12 (terms composed by only one word) that aren’t representative

of the concept terms used by the humans to classify a domain. Our hypothesis is

that the attention should be focus on keyphrase extraction rather than on the uni-

gram keywords. The manual extracted concepts, in fact, are mainly represented by

keyphrases rather than by single words. The keyphrases are “linguistic unit usually

larger than a word but smaller than a full sentence” (Caropreso et al., 2002).

According to this view an interesting proposal for the improvement of Ontogen

results could be represented by the integration, within the system’s package, of a

keyphrase extraction system linguistically motivated and evaluated in international

competitions such as LAKE (D’Avanzo et al 2005).

LAKE’s methodology (the acronym of the system stands for Linguistic Analysis

based Knowledge Extractor) is based on Keyphrase Extraction (KE) and makes use

of a learning algorithm to select linguistically motivated keyphrases from a list of

11 Then is the user that chose if consider these candidate keywords as representative and “good” to become concept of the ontology. 12 The system allow to decide the length of the words to extract but mainly it extract uni-grams.

92

candidates, which are then merged to form a document summary. The underlying

hypothesis of this system is that linguistic information is beneficial for the task, an

assumption completely new with respect to the KE state of the art. KE is performed

as a supervised machine learning task and a classifier is trained by using documents

annotated with known keyphrases. The trained classifier is subsequently applied to

documents for which no keyphrases are assigned: each defined term from these

documents is classified either as a keyphrase or as a non-keyphrase. Both training

and extraction processes choose a set of candidate keyphrases (i.e. potential terms)

from their input document, and then calculate the values of the features for each

candidate.

LAKE system works as follows: first, a set of linguistically motivated candidate

phrases is identified. Then, a learning device chooses the best phrases. Finally,

keyphrases at the top of the ranking are merged to form a summary. The candidate

phrases generated by LAKE are sequences of Part of Speech containing Multiword

expressions and Named Entities (e.g. proper names, locations and dates). These

elements are defined as ”patterns” and are stored in a patterns database; once there,

the main work is done by a learner device. The linguistic database makes LAKE

unique in its category. The system consists of three main components (see the

LAKE’s architecture in figure 5.1): a Linguistic Pre-Processor, a Candidate Phrase

Extractor and a Candidate Phrase Scorer.

93

Figure 5.1. LAKE Architecture.

The system accepts a document as an input. The document is processed first by

Linguistic Pre-Processor which tags the whole document, identifying Named

Entities and Multiwords as well. Then candidate phrases are identified based on the

pattern database. Until this moment the process is the same for training and

extraction stages. In training stage, however, the system is furnished with annotated

document13. Candidate Phrase Scorer module is equipped with a procedure which

looks, for each author supplied keyphrase, for a candidate phrase that could be

matched, identifying positive and negative examples. The model that come out from

this step is, then, used in the extraction stage. The last module, once extracted

keyphrases, uses a score mechanism to select the most representatives.

13 For which keyphrases are supplied by the author of the document.

94

LAKE system has been evaluated with encouraging results in many international

competitions, such as the DUC (Document Understanding Conferences) sponsored

by DARPA, and in different fields (e.g. knowledge management, see D’Avanzo et

al. 2007) and tasks: Single Document Summarization (D’Avanzo et al 2004), Text

categorization (2005), Summaries for Small Screen Devices (D’Avanzo and Kuflik,

2005), Multi Document summarization (D’Avanzo et al. 2007a).

Especially its application in the multi-document summarization task could be very

useful for the realization of relevant short summaries of documents trough the

extraction of representative keyphrases.

A future work that can be done to verify the consistency of the proposal consists in

the evaluation of the LAKE system compared with the keyword extraction module

of Ontogen. All the necessary data to do the experiment of comparison are

available: there are the manual “gold standard” keyphrases extracted to name the

concepts of the ontology, and the Ontogen extracted keywords for each document

of for cluster of documents. What should be done is evaluate the LAKE results on

the same documents given in input to Ontogen. And, then, make a double

comparison between LAKE and Ontogen module results versus the manual

extracted keyphrases inserted within the ontology.

5.3 Conclusions

The previous chapter started with a set of questions regarding the relations between

manual versus semi-automatic approach to the ontology building. Some of them

are reported here: “…Are this approaches (manual vs semi-automatic) alternative

or can they be integrated?... Is it realistic, in the near future, to think at a complete

95

automatization or, at least, at a semi-automatization of the whole process of

ontology building?”.

In order to answer these questions this work presented a concrete comparison

between manual building versus semi-automatic building of domain specific

ontology. The comparison has been made trough an evaluation of the results

obtained by the two different methods.

Precision and Recall, adapted to the specific task and calculated considering two

different experimental situations, have been used to measure the degree of

overlapping between the manual and the semi-automatic generated ontology.

The results showed that it is possible to semi automatically generate a domain

ontology from documents in a specific domain, that the ontology is improved as

more documents are used and that it may improved further by introducing some

semantics into the process. However, there is still a great difference between the

semi automatic ontology and a manual one, result that is similar to what is already

known. From the obtained results some indications can be extracted. The first one

concern the gap between the human and the semi-automatic systems results

(Ontogen in the specific case): in the ontology building task this difference appear

still high and difficult to bridge. But, the results appear to be quite encouraging if

read in perspective and compared with other similar research evaluations for the

same task (e.g. the work of Weber et al., 2005). However, the use of semantics

seem to be the right way to improve the process, so finally a motivated proposal,

that need to be tested, to improve the keyword extraction module of Ontogen

system has been done.

96

Nowadays an interesting use of the semi-automatic systems like Ontogen can be

represented by their integration with the manual approach. One of the main

capabilities of Ontogen, is its ability to discover potential relevant concepts that the

humans do not have considered during their categorization. So it can be useful to

support the manual ontology building. However, in our opinion, that aid can be

provided only in a second phase: after the realization of a first skeleton model of

the world of the domain to represent within an ontology. In this way these systems

can be very useful to enrich and populate existing ontologies or initial skeletons of

ontologies.

Of course one of the major challenges for the Semantic Web research in the years

to come is represented by the improvement of the automatic and semi-automatic

systems used for the ontology building in order to allow an easy construction and

proliferation of ontologies. And, in order to have a real, continuous, improvement

of those systems, and a gradual slow alignment to the manual “gold standard”

results, the evaluation phase play a role of strategic importance. Only evaluating

the results of these systems, in fact, will be possible to discover where and in what

manner to modify them in order to let them to work better. An important aspect

regarding this issue is linked to the lack of a global standardized approach for the

ontology evaluation. In this field there is a sort of fragmentation of different

approaches and, that situation, make difficult the task of comparing the results of

different researches if they are done using different methods. So what should be

done is try to create a comprehensive evaluation framework accepted and used as

standard for the evaluation. And, of course, the way to follow to achieve this

objective is always the same: research.

97

References

Agrawal R., Imielinsky , Swami A. “Mining Associations rules between sets of

items in large database”. In Proceedings of the ACM Sigmond Conference on

Management of Data, pp. 207-216, 1993.

Alfonseca E., Manandhar S. “Extending a Lexical Ontology by a Combination of

Distributional Semantic Signatures”, EKAW02, Siguenza, Spain, Published

Lecture Notes in Artificial Intelligence 2473, 2002.

Aussenac-Gilles N., Biebow B., Szulman S. “D’une methòd à un guide pratique

de modélisation de connaissances à partir da textes”. Terminologie and IA, TIA

2003. Ed. Rousselout. Strasbourg (pp.41-53).

Aussenac-Gilles N., Biebow B., Szulman S. ”Corpus Analysis for Conceptual

Modelling “. Workshop on Ontologies and Text, Knowledge Egineering and

Knowledge Management: Methods, Models and Tools, 12th International

Conference EKAW00, Juan-les Pins-France. Springer Verlaag. 2000.

Bachimont B., Isaac A., Troncy R. “Semantic commitment for designing

ontologies: a proposal”. In A. Gomez-Perez and V.R. Benjamins (eds.): EKAW

2002, LNAI 2473, pp 114-121, Springer Verlag, 2002.

Berners Lee T, Lassila H, Hendler J. (2001). “The Semantic Web”. Scientific

American 284(5): 34-43.

Berners Lee T. “Weaving the web: the original design and the ultimate destiny of

the World Wide Web by its inventor”. HarperCollins Publishers. New York. 1999.

Berners-Lee T. “L'architettura del nuovo Web”, Feltrinelli, 2001.

Biebow B., Szulman S. (1999). “TERMINAE: a linguistic based tool for the

building of a domain ontology”. In EKAW ’99. Proceedings of the 11th European

98

Worksop on Knowledge Acquisition, Modelling and Management. Dagsthul,

Germany, LCNS, pp 49-66. Berlin. Springer-Verlag

Bisson G., Nedellec C., Cañamero D. “Designing Clustering Methods for

Ontology Building. The Mo’K Workbench”, 2000.

Brank J., Gorbelnik M., Mladenjc D. “A survey on Ontology Evaluation

Techniques”. In: SIKDD 2005 at multiconference IS 2005, Ljubljana, Slovenia,

October 2005.

Brewster C., Alani H., Dashmahapatra S, Wilkis . S. “Data Driven Ontology

Evaluation”. Proceedings of LREC, 2004

Buitelaar P., Cimiano P., Grobelnik M., Sintek M.. “Ontology Learning from

Text”. Tutorial at ECML/PKDD, Porto, October 2005.

Buitelaar. P, Olejnik D., Sintek M. “A Protegé Plugin for Ontology Extraction

from Text based on Linguistic Analisys”. In Proceedings of the 1st European

Semantic Web Symposium (ESWS), Heraclion, Greece, 2004.

Caropreso M.F., Matwin M.F., Sebastiani F., (2001). A learner-independent

evaluation of the usefulness of statistical phrases for automated text categorization.

In Amita G. Chin (Ed.), Text Databases and Document Management: Theory and

Practice (pp. 78–102) Hershey (US) Idea Group Publishing.

Chaelandar G., Grau B. “A system to classify words in context – Svetlan”. In S.

Staab, A. Maedche, C. Nedellec, P Wiemer Hastings (eds). Proceedings on the

Workshop on Ontology Learning, 14th European Conference on Artificial

Intelligence. ECAI’00, Berlin, Germany, August 20-25. 2000

99

Cimiano P., Frank A., Racioppa S., Buitelaar P. “SOBA: SmartWeb Ontology-

based Annotation”, Proceedings of the Demo Session at the International Semantic

Web Conference, November 2006.

Craven M., Di Pasquo D., Freitag D., McCallum A., Mitchell T., Nigam K.,

Slattery S., “Learning to Construct Knowledge Bases from the World Wide Web”,

Artificial Intelligence, 118(1-2), pp 69-114, 2000.

D'Avanzo E., Elia A., Kuflik T., Vietri S .. LAKE system at DUC 2007. In

Proceedings of Human Language Technology Conference/North American chapter

of the Association for Computational Linguistics Annual Meeting (HLT-NAACL

2007). Rochester, NY, April 22-27, 2007.

D'Avanzo E., Kuflik T., Elia A., Lieto A., Preziosi R. Where Does Text Mining

Meets Knowledge Management? A Case Study. In itAIS07, Springer, 2007.

D'Avanzo E., Kuflik T ., Linguistic Summaries on Small Screens. In: Data Mining

VI. Zanasi A., Brebbia C., Ebcken N. (eds.), Wit Press, p. 195-204, 2005.

D'Avanzo E., Magnini B. A Keyphrase-Based Approach to Summarization: the

LAKE System at DUC-2005. DUC Workshop. Proceedings of Human Language

Technology Conference / Conference on Empirical Methods in Natural Language

Processing ( HLT/EMNLP 2005). Vancouver, B.C., Canada, October 6-8, 2005

D'Avanzo E., Magnini B., Vallin A.. Keyphrase Extraction for Summarization

Purposes: The LAKE System at DUC-2004. DUC Workshop. Proceedings of

Human Language Technology conference / North American chapter of the

Association for Computational Linguistics annual meeting (HLT/NAACL 2004).

Boston, May 2-7, 2004

De Mauro T. Minisemantica. Laterza. Roma. 1992.

100

Deitel A., Faron C., Dieng R. “Learning ontologies from RDF annotations”. In

Proceedings of the IJCAI Workshop in Ontology Learning, Seattle. 2001.

Engels R (2001). Corporum Ontoextract. Ontology extraction Tool. Delivarable 6.

Ontoknowledge. http://www.ontoknowledge.org/del.shtml

Engels R (2002) Corporum-Ontowrapper. Extraction of structured informations

from web based resources. Delivarable 7. Ontoknowledge.

http://www.ontoknowledge.org/del.shtml

Faure D, Poibeau T., “First experiment using semantic knowledge learning by

Asium for information extraction using INTEX”. In: S. Staab, A. Maedche, C.

Nedellec, P. Wiemer –Hastings. Proceedings of the Workshop on Ontology

Learning, 14th European conference on Artificial Intelligence. ECAI00, Berlin.

2000.

Fortuna B., D. Mladenic, M. Grobelnik. “Visualization of text document

corpus”. Informatica 29 (2006), 497-502

Fortuna B., M. Grobelnik, D. Mladenic. “Background Knowledge for Ontology

Construction”. Poster at WWW 2006, May 23.26, 2006, Edinburgh, Scotland.

Fortuna B., M. Grobelnik, D. Mladenic. “OntoGen: Semi-automatic Ontology

Editor”. HCI International 2007, July 2007, Beijing.

Fortuna B., M. Grobelnik, D. Mladenic. “Semi-automatic Construction of Topic

Ontology” . Semantics, Web and Mining, Joint International Workshop, EWMF

2005 and KDO 2005, Porto, Portugal, October 3-7, 2005, Revised Selected Papers.

Fortuna B., M. Grobelnik, D. Mladenic. “Semi-automatic Data-driven Ontology

Construction System”. Proceedings of the 9th International multi-conference

Information Society IS-2006, Ljubljana, Slovenia.

101

Fortuna B., M. Grobelnik, D. Mladenic. “ System for Semi-automatic Ontology

Construction”. Demo at ESWC 2006, June 11-14, 2006, Budva, Montenegro.

Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning C.G.

"Domain-specific keyphrase extraction" Proceedings of Sixteenth International

Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San

Francisco, CA, pp. 668-673, 1999.

Frege G. “Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache

des reinen Denkens”, Halle a. S.: Louis Nebert. Translated as Concept Script, a

formal language of pure thought modelled upon that of arithmetic, by S. Bauer-

Mengelberg in J. vanHeijenoort (ed.), From Frege to Gödel: A Source Book in

Mathematical Logic, 1879-1931, Cambridge, MA: Harvard University Press, 1967.

Gangemi A., Catenacci C., Ciaramita M., Lehmann J. “A Theoretical

Framework for Ontology Evaluation and Validation”. In Proceedings of SWAP.

2005.

Gómez-Pérez, A. and D. Manzano-Macho. “A survey of ontology learning

methods and techniques”. OntoWeb Project. 2003.

Gruber T. “ A Translation Approach to Portable Ontology Specifications” .

Knowledge Acquisition, 5(2), 1993, pp. 199-220.

Guarino N, Welti C. “Evaluationg Ontological Decisions with Ontoclean” Comm.

Of the ACM, 45(2): 61-66, February 2002.

Hahn U., Schultz S. “Towards very large terminological Knowledge Bases. A

Case study from medicine”. Canadian Conference on AI. 2000. 176-186.

102

Harris Z. S. “Papers in Structural and Transformational Linguistics”, Reidel,

Dordrect. 1970.

Hearst, M. “Automatic acquisition of hyponyms from large text corpora”. In

Proceedings of the 14th International Conference on Computational Linguistics.

Nantes, France, 1992

Jannink J., Wiederhold G. “Ontology maintenance with an algebraic

methodology: A case study”. In Proceedings of AAAI Workshop on ontology

management, July 1999.

Kahn L., Luo F. “Ontology Construction for Information Systems”. In

Proceedings of 14 th IEEE International Conference and Tools with Artificial

Intelligence, pp.122-127, Washinton DC, November 2002.

Kashyap V. “Design and Creation of Ontologies for Environmental Information

Retrieval”. 20th Workshop on Knowledge Acquisition, Modeling and Management.

Alberta. Canada. October 1999.

Kietz J., Maedche A., Volz R. “A method for Semi-Automatic Ontology

Acquisition from a Corporate Intranet”. In: Aussenac-Gilles N., Biebow B.,

Slzuman S., EKAW00. Workshop on Ontologies and Text. Juan-Les-Pins. France.

CEUR Workshop Proceedings 51:4.1-4.14., Amsterdam, 2000.

Lozano-Tello A., Gomez-Perez A. “Ontometric: A Method to chose the

appropriate ontology”, Journal of Database Management, 15(2):1-18, 2004.

Maedche A, Staab S. “Ontology Learning”. IEEE Intelligent Systems, Special

Issue on the Semantic Web 16(2). 2001.

Maedche A, Staab S. 2001 “Ontology Learning for the Semantic Web”. IEEE

Intelligent Systems, Special Issue on the Semantic Web.

103

Maedche A, Voltz R. “The Text-to-Onto Ontology Extraction and Maintenance

Environment”. Proceedings of the ICDM Workshop on integrated data mining and

knowledge management, San Jose, California, USA.

Maedche A., Staaab S., “Measuring similarities between ontologies”. Proceedings

CKM02, LNAI vol.2473, 2002.

Modica G., Gal A., Jamil H.M. “The Use of Machine Generated Ontologies in

Dynamic Information Seeking”. In Proceedings of the Sixth National Conference

on Cooperative Information, Springer-Verlaag, 2001.

Navigli R., Velardi P. “Learning Domain Ontologies from Document Warehouses

and Dedicated WebSites”. Computational Linguistic (30-2), MIT Press, 2004.

Nobecourt J. “A method to build formal ontologies from text”. In EKAW00,

Workshop on ontologies and text, Juan-Les-Pins, France. 2000.

Oliveira A, Pereira F.C., Cardoso A. “Automatic Reading and Learning from

Text”. In Proceedings of International Symposium on Artificial Intelligence,

ISAI01. December 2001.

Oliveira A., Pereira F.C., Cardoso A. “Extracting Concept Maps with Clouds”.

Argentin Symposium on Artificial Intelligence. ASAI00, Buenos Aires, Argentina,

2000.

Papatheodorou C., Vassilious A., Simon B. “Discovery Ontologies for learning

resources using word-based clustering”. Copyright by AACE. Reprinted from the

EDMEDIA with the permission of AACE. Denver USA. 2002

Pereira F.C., “Modelling Divergent Production: a Multi Domain Approach”.

European Conference on Artificial Intelligence, ECAI 98, Brighton, UK, 1998.

104

Poesio M. “Domain Modelling and NLP: Formal ontologies? Lexica? Or a bit of

both?” Journal of Applied Ontology. pp.27-33. IOS press. 2005.

Portzel R., Malaka R. “A task based approach for ontology evaluation”,

Proceedings of ECAI, 2004.

Reinberger M.M., Spyns P. “Discovering knowledge in text for the learning of

DOGMA inspired ontologies”. Proceedings of OLP04 (workshop on ontology

learning and population at ECAI), 2004.

Rigau G., Rodriguez H. and Agirre E. “Building accurate semantic taxonomies

from Monolingual MRDs”. Proceedings of the 17th International Conference on

Computational Linguistics, Montreal, Canada, 1998.

Rosh E., "Cognitive representation of semantic categories" Journal of

Experimental Psychology 104: 573-605, 1975

Rubin D., Hewett M., Oliver D.E., Klein T.E., Altman R.B.“Automatic data

acquisition into ontologies from pharmacogenethics relational data sources”,

Proceedings of the Pacific Symposium on Biology, Lihue, HI, 2002.

Schutz, A., Buitelaar, P. “RelExt: A Tool for Relation Extraction from Text in

Ontology Extension”. 4th ISWC, Galway, 2005.

Snoussi H., Magnin L. and Nie J.Y., “Towards an Ontology-based Web Data

Extraction”, Centre de recherche informatique de Montréal (Canada), Université

de Montréal (Canada), 2002.

Soergel D., Aussenac-Gilles N. “Text analysis for ontology and terminology

engineering”. Journal of Applied Ontology. pp. 35-46. IOS press. 2005.

105

Stojanovic L, Stojanovic N., Voltz R. “Migrating Data Intensive Web Sites into

the Semantic Web”. Proceedings of the 17 ACM symposium on applied computing

(SAC), ACM Press, pp 1100-1107. 2002.

Suryanto H., Compton P. “Discovery of Ontologies from Knowledge Bases”. In

Proceedings of the First International Conference on Knowledge Capture, Victoria,

BC, Canada, October 2001.

Weber N., Buitelaar P., “Web based ontology learning with ISOLDE” In

Proceedings of the Workshop on Web Content Mining with Human Language at the

International Semantic Web Conference, Athens GA, USA, Nov. 2006.

Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G.

"KEA: Practical automatic keyphrase extraction." Proceedings of DL '99, pp. 254-

256, 1999.

Wittgenstein L., “Philosophical Investigations”. Edited by G. E. M. Anscombe,

and R. Rhees, and translated by G.E.M. Anscombe. Oxford: Blackwell, 1953.

Yamaguchi T. “Constructing Domain Ontologies based on concept drift analysis”.

In Proceedings of IJCAI’99, Workshop on Ontologies and Problem Solving

Methods: Lesson Learned and Future Trends. Stockholm, Sweden, 1999.