manually vs semiautomatic domain specific ontology buildingdsc.unisa.it/tesi pubblicate sul web/tesi...
TRANSCRIPT
Facoltà di Lettere e Filosofia
Corso di Laurea Specialistica in Comunicazione d’impresa e pubblica
Tesi di Laurea in
Informatica per il Commercio Elettronico
Manually vs semiautomatic domain specific ontology building
Relatore Candidato
Prof. Ernesto D’Avanzo Antonio Lieto matr. 0320400079
Correlatore Prof. Tsvi Kuflik
Anno accademico 2007-2008
2
Acknowledgements
There are many people that I have to thank for the realization of this work. First of
all I want to say thank you to my advisor, Prof. Ernesto D’Avanzo. He always has
been available with me during this year of research thesis, giving me crucial
councils for the development of this work (both regarding the theoretical and the
experimental part). Another special person that I wish to thank is the Prof. Tsvi
Kuflik. His comments and considerations about my work were very important for
its improvement. So if there are some errors or mistakes I’m the only guilty. Their
guide was excellent.
Then I want to thank the Dr. Brenda Schaffer for her suggestions for the manual
ontology building, the Professors Roberto Cordeschi and Annibale Elia, for their
constant interest to the whole project work that have involved even other Master
Degree students of the University of Salerno, and the Prof. Marcello Frixione for
his interesting seminar regarding the evolution of the semantic networks.
Finally (last but not least) I have to thank my family that always has been with me
in happy and difficult moments.
This work is dedicated to my grandmother Maria that, unfortunately, is no longer
among us.
3
Contents Chapter 1. The problem and the research question p. 5 Chapter 2. Methods and Tools for the semi-automatic or automatic ontology generation 2.1 Different approaches to the ontology generation p. 12 2.1.1 Methods and techniques for the ontology generation from text p. 14 2.1.2 Methods and techniques for the ontology generation from dictionaries p. 18 2.1.3 Methods and techniques for the ontology generation from a knowledge base p. 10 2.1.4 Methods and techniques for the ontology from semi-structured schema p. 21 2.1.5 Methods and techniques for the ontology generation from relational schemata p. 23 2.2 Research projects and tools for the ontology generations p. 25 Chapter 3. The Energy Domain Case Study 3.1 Energy Domain Modelling Process
3.1.1 The modelling approach p. 38
3.1.2 Information Sources and Tool p. 41
3.1.3 Definition of Domain Concepts p.44
3.1.4 Horizontal Links p. 45
3.1.5 Ontology Population p.47
3.1.6 Logical Expressions p.48
4
3.1.7 Racer Pro p. 51 3.2 Manual approach: Energy Ontology Description p. 53 3.2.1 Energy Domain p. 53 3.2.2 Energy Sources p. 55 3.2.3 The case of the Hydrogen: Renewable or not Renewable? p.56 3.2.4 Country p. 57 3.2.4 Energy Security Class p. 57 3.2.6 Infrastructures p. 60 3.2.7 Environmental consequences p. 61 3.2.8 Energy Use p.63 3.3 A Semi automatic approach for ontology building p. 64 3.3.1 Semi Automatic Ontology Generation p. 71 Chapter 4 Evaluation and Experiments 4.1 Pilot Study Experiments p. 73 4.2 Precision and Recall for a quantitative evaluation p. 74 4.3 New Experiments p. 79 Chapter 5 Discussions and Conclusions 5.1 About the methodology p. 85 5.2 Proposal: a linguistically motivated keyphrases extraction system for Ontogen p. 89 5.3 Conclusions p. 93
5
Chapter 1. The problem and the research question
Ontologies gain a lot of attention in recent years as tools for knowledge
representation. However, in the context of the information and computer science
there are many definitions regarding “what an ontology is”. Gruber (1993), for
example, define the ontology as “an explicit specification of a conceptualization”
(where a conceptualization “is an abstract, simplified view of the world that we
wish to represent for some purpose”), pointing out the relative simplicity of the
represented knowledge in comparison with the complexity of the knowledge it self.
Tim Berners Lee (2001) give a more concrete definition of this concept defining an
ontology as “a document or file that formally defines the relations among terms”,
underlying, in this way, the importance of the relational aspect (formally defined)
between the elements composing the ontology. In few and simplest words an
ontology can be defined as a formal knowledge representation system (KRS)
composed by three main elements: classes (or concepts or topics), instances (which
are individuals which belongs to a class) and properties (which link classes and
instances allowing to insert informations regarding the world represented into the
ontology).
To obtain a structured representation of the information through the ontologies is
one of the main objectives in order to realize the so called Semantic Web1 (T.B. Lee
et al., 2001). In the context of the Semantic Web, in fact, ontologies are expected to
play an important role in helping automated processes to access information. In
1 According to Tim Berners Lee (1999) the Semantic Web is an extension of the current web in which information is given in a well defined meaning. And that, according to his vision, should enable the machines to “understand” the semantics of the web resources and, therefore, to have a more “intelligent” behaviour in their activities of search.
6
particular, ontologies are expected to be used to provide structured vocabularies that
explicate the relationships between different terms, allowing intelligent agents (and
humans) to interpret their meaning. Another important aspect regarding the role of
ontologies is linked to the issue of the information overload. One of the problem of
the actual World Wide Web, in fact, regards the fact that large part of the
information provided to the users as output of an explicit query are un-relevant.
Through the implementation of ontologies within dedicated information systems
(e.g. search engines) this problem can be reduced or, in the future, solved
completely (at least in theory) because ontologies architecture (which usually is
hierarchic) should be able to design the unique path that the information required
through queries must follow to arrive at the web resources containing the desired
information.
Since the 2006 RDF2, RDF Schema3 and OWL4 are generally considered as
standard Semantic Web languages. In particular the OWL, which is the language
used for the concrete manual ontology building case study (which will be
introducted in the following pages), is the most expressive language. That means
that it allows to augment the number (and the quality) of inferences that the
software agents are able to do.
Ontology building is a very complicated activity for several reasons. First, because
it requires time consuming work of experts. Then because classification task is not
simple as it seems. Finally because the incredible speed with which the knowledge
developed itself in the real world, constraint the ontology engineers to continuously
2 See http://www.w3.org/RDF/ 3 See http://www.w3.org/TR/rdf-schema/ 4 See http://www.w3.org/OWL
7
update and enrich the generated ontologies with new concepts, terms and lexicon. In
this way an ontology often becomes a “never ending work” which constantly
requires manual efforts and resources to be built and maintained. In recent years,
tools and methods have been developed to try and solve automatically or semi-
automatically the problems related to manual ontology building (an overview of
such approaches, tools and techniques will be presented in the second chapter). The
research question of this work is the following: is it possible, with the actual tools
and methods, to substitute (fully or even partially) the human activity in a complex
task such as the ontology building? We will try to answer this question trough
experimental results conducted on a concrete case study where a manual domain
specific ontology has been compared with a semi-automatically built one.
The objective of this work is to present a concrete case study regarding the
evaluation of the manual approach to ontology building compared with the semi-
automatic one. This thesis work has been developed on a three step phases: in the
first one a manual energy domain ontology has been created. Then a part of this
ontology has been semi automatically generated using the software Ontogen. And,
finally, a comparison between the two approaches at the ontology building has been
done trough a quantitative evaluation considering precision and recall measures.
This work is structured as follow: in the chapter 2 will be examined some methods
and tools used for the semi-automatic or automatic ontology generation, then
chapter 3 is dedicated to the description of the energy domain case study and
chapter 4 to the experiments and to the evaluation phase. The last chapter is
dedicated to the results discussion and to the conclusions.
8
Chapter 2: Methods and Tools for the semi-automatic or
automatic ontology generation
Manual ontology building is a time consuming activity that requires a lot of efforts
for knowledge domain acquisition and knowledge domain modeling. In order to
overcome these problems many methods have been developed, including systems
and tools that automatically or semi-automatically, using text mining and machine
learning techniques, allows to generate ontologies. The research field which study
this issues is usually called “ontology generation” or “ontology extraction” or
“ontology learning” (Maedche et al., 2001). It studies the methods and techniques
used to:
• construct automatically or semi-automatically an ontology ex-novo
• enrich or adapt an existing ontology using different sources
The ontology learning process is useful for different reasons. First of all to
accelerate the process of knowledge acquisition, second to reduce the time for the
updating of an existent ontology, and finally to accelerate the whole process of
ontology building. Buitelaar (2005) propose an incremental stratification of this
process (see figure 2.1):
9
Figure 2.1. Stratification of the ontology learning process in (Buitelaar 2005)
Starting from the lowest level each phase can be considered as input of the next
phase. The phases identified are the following:
1. Extraction of relevant terms and their synonyms in a textual corpus for a
target domain;
2. Concepts identification (is the third step of the scale proposed by Buitelaar);
3. Derivation of a hierarchy of the concepts previously identified;
4. Identification of non taxonomic relations between the concepts;
5. Adjustment of the ontology with new instances, concepts and properties
(Ontology Population);
6. Discovering of new rules and axiomatic relations between concepts and
properties.
The major part of the approaches iterate this steps both to integrate the user
feedback in the process of ontology generation and to re-use the new acquired
knowledge as knowledge base for future iterations.
10
Different approaches for the ontology generation
Maedche and Staab (2001), proposed a classification of the systems used for the
automatic or semiautomatic ontology building. It is based on the type of input that
the systems consider to initiate the process of ontology generation. The authors
distinguished between: ontology generation from text, dictionary, knowledge base,
semi-structured schemata and relational schemata. The most widely used
approaches at the ontology extraction from text, as reported in Perez (2004), are the
following:
Pattern-based Extraction (Hearst 1992): this approach usually uses heuristic
methods that examine the text with distinctive lexicon-syntactic pattern. A relation
is recognized and extracted if a sequence of words within the text matches a pattern.
The basic idea of this approach is very simple: to define a regular pattern that is able
to capture the expression presented by the text and able to map the results of the
matching into a semantic structure such as a taxonomy of relations among concepts.
Association Rules: initially defined to extract informations from databases (in the
data mining field, (Agawall et al., 1993) has been used in (Maedche, 2001) to
discover non taxonomic relations between concepts using a concepts hierarchy as
knowledge base.
Conceptual Clustering: in this approach the concepts are grouped according to the
their semantic similarity to build hierarchies. The semantic similarity can be
calculated with different methods. For example it may be calculated according to
the distributional approach: less distance between the linguistic distributions of two
words means more similar concept (see Faure et al. 2000)
11
Ontology pruning: the aim of this approach is to build a domain ontology based on
various sources (Kietz at al., 2000). It includes the following steps: first, a generic
core ontology is used as basic infrastructure for a domain specific ontology. Then, a
dictionary with important domain terms is used for the domain concept acquisition
and these concepts are classified into the generic core ontology. Finally, domain
specific and general corpora of text are used for the process of non domain specific
concepts removal, following the heuristics that domain specific concepts should be
more frequent in a domain-specific corpus than in a generic one.
Concept learning: with this approach a given taxonomy in incrementally enriched
acquiring new concepts from textual documents (see Hahn et al 2000).
The above are the main approaches for the ontology generation from text. However,
as mentioned earlier, the Maedche and Staab classification identifies four other
classes of sources used as input for the ontology extraction. Amongst the ontology
generation from dictionaries is based on the use of machine readable dictionaries to
extract relevant concepts and relations among them. The methods and tools used for
this task are based on linguistic and semantic analysis and usually Wordnet is used
in this approach as dictionary; the ontology learning from a Knowledge Base aims
at generate an ontology using as source an existing knowledge base; the ontology
extraction from semi structured data, instead, have as objective the extraction of an
ontology from sources with a pre-defined structure such as the XML Schema and,
finally, the ontology learning from relational schema aims to learn an ontology
extracting relevant concepts and relations from knowledge in database. In the
following pages will be examined different methods, tools and techniques available
12
in literature for the automatic or semi-automatic ontology generation starting from
different data sources.
2.1 Methods and techniques for ontology generation from text
Since the late fifties many approaches have been proposed to extract terms,
concepts and relationship from text. But since the mid nineties these efforts have
taken a new momentum thanks to the availability of more sophisticated statistical
and NLP techniques. These techniques are now used in many approaches for the
ontology generation (Soergel et al, 2005).
Reinberger’s method (Reinberger et al, 2004) supports ontology extraction from
text. Its aim is to create an initial skeleton of ontology to be then refined by
analysts. It is based on a three step process. First, a textual domain corpus is parsed
with a shallow parser, then noun phrases and their relations with other verb phrases
or noun phrases are listed. At the third step clustering techniques group terms that
share similar relations into the same classes and the ontology is created. Khan and
Luo’s method (Kahn and Luo, 2002) aims to build a domain ontology starting from
text documents and using clustering techniques and Wordnet ontology. The
hierarchy is created by grouping documents similar in content (the documents are
provided by user) within the same cluster and then putting them into a hierarchy
using an algorithm called SOTA. After building a hierarchy of clusters, a concept
(or topic) is assigned to each cluster with a bottom up concept assignment
mechanism (the assignment starts from the leaf nodes of the hierarchy). Then the
assigned topic is associated with the appropriate concept is WordNet. And finally
13
the concepts of the internal nodes are assigned using the descendent nodes and their
hypeyms in WordNet.
Another method for building ontologies take into account linguistic techniques that
come from the Differential semantics was proposed by Bachimont (2002). In this
method the construction of the ontology is made in a three step process. First, there
is a Semantic Normalization where the user chooses the relevant terms of a domain
and normalizes their meaning, expressing similarity and differences of each notions
with respect to its neighbours. Then these terms are placed into a hierarchy (and the
user has to justify her/his decisions). At the second step there is a Knowledge
Formalization phase in which, using the taxonomy obtained in the first step, the
various terms are disambiguated for a domain expert to carry out a formalization of
the knowledge. The third step consists in the Operationalization: the created
taxonomy is transcribed into a specific knowledge representation language.
Nobecourt (Nobecourt, 2000) presents an approach to build domain ontologies from
text using text mining techniques and a corpus. This method is based on two
activities: modelling and representation. The modelling activity is based on the
extraction of relevant domain terms (the “conceptual primitives”) from a corpus.
After this operation, domain experts look for relevant terms of the domain in the list
and identify the main sub-domains of the ontology. These terms are modelled as
concepts and constitute the first skeleton of the ontology. In a second moment these
concepts are described in natural language, constituting, in this way, a new source
of documents (a new corpus) from which a new list of primitives can be extracted in
an iteratively process used to gradually refines the skeleton of the ontology. The
representation activity consists of the translation of the modelling schemata into an
14
implementation language. This method is technologically supported by the platform
TERMINAE (Biebow et al, 1999) which will be discussed further later.
Kietz et al. (2000) proposed a generic method to discover a domain ontology from
given heterogeneous sources using natural language analysis techniques. It is a
semi-automatic approach for ontology building in the sense that the user takes
active part in the process. The authors propose to extract ontologies starting from a
core ontology (e.g. SENSUS, Wordnet etc.) and enriching it with new specific
domain concepts. The user has to specify which documents should be used to refine
the core ontology. New concepts are identified using NL techniques applied to the
suggested documents. Then the resulting enriched core ontology is pruned and
focused to a specific domain through the removal, with several statistical
approaches, of general concepts. Finally relations between concepts are learnt
applying learning methods and added to the resulting ontology. This process is
cyclic because the resulting ontology can be refined applying this method
iteratively.
Aussenac-Gilles et al. (2000), suggest a method that allows to create a domain
model using NLP tools and linguistic techniques for the analysis of corpora. This
method uses text as starting point, but may use also other existing ontologies or
terminological resources to build the ontology. It performs the ontology learning in
three levels: linguistic level, normalization level and formal level. The first one is
composed of terms and lexical resources extraction from text. These elements are
then clustered and converted into concepts and semantic relations at the
normalization level. Finally concepts and relations are formalized by means a
formal language. The process is composed by 4 phases. The first two are referred to
15
as the linguistic level and include: the corpus constitution (the authors point out the
importance of a domain expert aid for the domain specific corpus selection) and the
linguistic study that focuses on the selection of adequate linguistic tools for the
analysis of the text. This phase yields domain terms, lexical relations and a set of
synonyms. The third phase is the normalization. The result of this phase is a
conceptual model expressed in form of semantic network. It is divided into two sub-
phases: a linguistic phase and a conceptual one. During the linguistic phase the
ontology engineer chose the terms and the lexical relations (e.g. hyponyms) that
have to be modelled, then, s/he adds a natural language definition for these terms
considering the senses that they have in the text and defining, for each sense, some
identificative labels. If there are several meanings, the most relevant for the domain
are kept. During the conceptual phase, concepts and semantic relations are defined
in a normalized form using the labels of concepts and relations. Finally the last
phase is the formalization. It includes ontology validation and implementation . The
evaluation of the knowledge learnt is made by the user and by a domain expert.
Once the ontology has been evaluated it can be implemented (e.g. following this
approach an ontology for a private company about fiber glass manufacturing was
built and implemented, see Aussenac-Gilles et al. 2003)
Hearst (1992) method aims to acquire automatically lexical relations of hyponyms
from a corpora to built a general domain thesaurus, using WordNet to verify its
performance. The process exploits a set of predefined lexico-syntactic patterns
easily recognizable. The method aims at discover this patterns and the authors
suggests that other lexical relations will be acquirable in this way. All of them will
16
be used to build the thesaurus. The method proposed the following five steps
procedure to automatically discover new patterns:
• Decide on a lexical relation of interest.
• Gather a list of terms for which this relation is known to hold (this step can
be done automatically using this method).
• Find documents in the corpus where these expressions occur syntactically
one near the other and record the whole environment (the environment is
defined by the linguistic space where the defined expressions appear: e.g. in
the simplest case “Poets such as Shakespeare” is the linguistic environment
of the pattern “such as” used to indicate a hyponym relation between “poet”
and “Shakespeare”. In more complicated cases the environment can be
represented by sentences or full periods) .
• Find commonalities among these environments and the initial hypothesis.
• Once a new pattern has been identified it is used to gather more instances of
the target relation and the process restart from step 2.
To validate this acquisition method, the author proposed to compare it with the
information found in Wordnet. For example: if two terms of the thesaurus presented
in hyponymy relations are also linked hirarchically in Wordnet, then the thesaurus
is verified.
Alfonseca and Manadhar (2002) proposed a method based on the Distributional
Semantic Hypothesis (Harris, 1971) which states that “the meaning of a word is
highly correlated to the linguistic context in which it appears”. From this point of
view the context of a certain concept can be encoded as a vector of context words
containing the words that co-occour with that concept and their frequency (topic
17
signature). These topic signatures are then clustered and a distance measures such
as TFIDF o chi-square is used to separate the different word senses. The obtained
signatures can be compared with the topic signatures of an existing ontology (e.g.
WordNet) identifying, in this manner, the hyperyms candidates. To do that a top
down classification alghoritm is used.
2.1.2 Methods and techniques for the Ontology generation from
dictionaries
2.1.2.1 Jannink and Wiederhold’s approach
This approach (Jannink and Wiederhold, 1999), aims to convert dictionary data into
a graph structure to support the generation of a domain or task ontology. It uses an
algebra extraction technique to generate the graph structure and to create the
thesaurus entries for all the word defined into the graph. According to its purpose
only headwords and definition having many-to-many relations are considered. This
resulted in a directed graph that has two properties: each headword and definition is
grouped in a node and each word in a definition node is an arc to the node having
that headword. The basic hypothesis of this approach is that structural relationship
between terms are relevant to their meaning. They are extracted in a 3 step process
where a statistical approach and PageRank alghorithm are used to develop, as
output, a set of terms related by the strength of the association in the arcs that they
contain.
18
2.1.2.2 Rigau’s method
This method (Rigau et al, 1998) consists in the extraction of lexical ontologies
from dictionaries. Its main goal is to semi-automatically develop versions of
Wordnet for the Spanish language. It is based on two procedures: analysis of
dictionary definitions and Word Sense disambiguation of genus words. So, in a first
phase, in a monolingual dictionary each definition is analyzed to find an hyperonym
of the word which has been defined (also called genus word). Then a Word-sense
disambiguation procedure is used on the genus word to discover what meaning is
used for it. This method was developed as part of the EuroWordNet Project which
aimed at developing lexical ontologies for several european languages.
2.2. Methods and techniques for the Ontology generation from a
knowledge base
2.2.1 Suryanto and Compton’s approach
This approach (Suryanto and Compton 2001) aims at generating an ontology from a
knowledge base of rules. The authors propose an algorithm to extract a taxonomy of
classes where a class is a set of different path of rules that arrive at the same
conclusion and a rule path for node n consists of all the conditions from all
predecessors rules plus the new conditions of the particular rule of node n. The
approach takes the initial trees and create a set of classes trying to discover relations
among them. Three types of relations are considered: subsumption, mutual
exclusivity and similarity. The central idea of this approach is to group all the rules
in each class and calculate a quantitative measure for each relation between each
19
couple of classes. This quantitative measure provides the confidence to whether this
relation exists (for example: the class A subsume the class B with certain
confidence measure if the class A only exists when the class B exists but not the
other way around). With the set of classes and relations created the class taxonomy
is built up. The whole process in evaluated by an expert.
2.3. Methods and techniques for the Ontology generation from
semi-stuctured schema
2.3.1 Papatheodoru and colleagues’ method
This method (Papatheodoru, 2002) aims to build taxonomies using a data mining
approach called cluster mining from domain repositories written in XML or RDF.
The cluster mining approach firstly tries to group similar metadata in a cluster, and
then, processing this cluster, extract a controlled vocabulary used to build the
taxonomy (Perez 2004). The steps proposed by this method are:
1. Data collection and pre-processing where the main objective is to select the
appropriate keywords from the metadata files. That enables to discover
similarities between the documents and, for that purpose, words such as
articles or prepositions are dropped.
2. Pattern Discovery in this step cluster mining approach is used to discover
and build clusters of similar documents and extract representative keywords
of the documents content itself.
20
3. Pattern post processing and evaluation: the keywords extracted in the
previous step are examined and measured with a statistical approach and the
best keywords (the most representative of the content of the clusters) are
selected. These keywords provide the vocabulary necessary to form the
concepts of the taxonomy.
2.3.2 Deitel and colleagues’s approach
Deitel et al. (2001), present an approach for learning ontologies from RDF
annotations of web sources. It focuses on learning, from the whole RDF graph new
domain concepts, enriching, in this way, the ontology to which the RDF annotation
belong. To extract the description of a resource from the graph this approach
follows a criterion called description of length n of a resource that is “the largest
connected subgraph in the whole RDF graph containing all possible paths of length
smaller or equal to n starting from ending to the considered resource”. The proposed
steps used for building a hierarchy based on resource description are the following:
1. Extract resource description of length one and repeat the process
incementing the lenght until covered the maximum path in the graph.
2. Extraction of resource description of lenght one from the whole RDF graph
(these descriptions form a set of RDF triple: resources, properties and
values).
3. Iterative generalization of all possible pairs of triple: the generalization of
two triples is the most specific triples subsuming them.
21
4. Construction of the intension5 of length one. The triples sharing a same
extension are grouped together.
5. Build the generalization hierarchy based on the inclusion relations between
the node extension.
6. Repeate the process incrementing the length of the resource description to
be extracted.
2.4 Methods and techniques for Ontology generation from
relational schemata
2.4.1 Kashyap’s method
Kashyap (1999), uses database schemas to build an initial ontology that will be then
refined through a collection of queries made by users. The process is interactive,
because an expert is involved in the process of deciding which classes and
properties are important for the domain ontology, and iterative because it is
repeated as many times as necessary. The process has two phases. At the first one
the database schemas are analyzed in detail and, at the end of the process, a new
database schema is created and, trough reverse engineering techniques, its content is
mapped into an ontology. At the second phase the ontology built from the database
schemas is refined by means of user queries that allow to add or delete attributes, to
create new entities etc.
5 An intension may include redundant triples, one being more general than another. It is cleaned up by deleting triples subsuming another one (Perez, 2004).
22
2.4.2 Rubin and colleagues’ approach
Rubin et al. (2002) has as objective to automate the process of creating instancies
and their value using the data extracted from external relation sources. This method
uses XML Schema as interface between the ontology and the data sources. The
process allows the automatization of updating the links between ontology and data
acquisition when the ontology changes. This approach needs the following
components: an ontology (with domain classes and relations among them), an XML
Schema (is the interface between the ontology and the data acquisition), and an
XML translator (to convert external incoming data in XML). The method is based
on a four step process:
1. An ontology model of a domain must be created (e.g. with Protegé)
2. An XML Schema must be generated from the ontology (once the onology is
built and the constraint on the properties are explicitated the XML Schema
is sufficiently determined and can be written directly from the ontology).
3. The data acquired from the external resources must be put into an XML
document using the syntax specified into the XML Schema.
4. Ontology updating and propagating changes.
2.4.3 Stojanovic and colleagues’ approach
Stojanovic et al (2002), try to build light ontologies from conceptual database
schemas using a mapping process. The ontology generation follow a five step
process:
23
1. The information from a relational schema is captured trough a reverse
engineering. This process tries to preserve as much information as possible
from the database schema.
2. The information obtained is analyzed applying a set of mapping rules in
order to built ontological entities. These rules specify the way with which to
migrate elements from the database into the ontology. These rules are
applied in the following order: concept creation, inheritance and relations
creation to incrementally create the ontology.
3. The ontology is created through the application of the rules mentioned in the
previous step.
4. The ontology is evaluated and refined.
5. The ontological instancies are created on the base of the tuples of the
relational database.
2.5 Research Projects and tools for the ontology generation
Automatic and semi-automatic generation of ontology from documental corpus or
from other types of data collections is nowadays one of the research challenges of
the Semantic Web. Many research projects have been developed and many
prototypes and tools have been created for that task. The following section
overviews the majors tools.
Mo’K Workbench
Mo’K Workbench (Bisson, 2000) is a tool that semi-automatically creates ontology
from a textual corpus using different techniques of conceptual clustering. It doesn’t
24
need of a previous semantic knowledge (e.g. existing ontology) and, by applying
NLP techniques, extracts from the documental corpus sets of triples. Each of them
is formed by a verb, a word and by the syntactic role of that word within the
sentence. Using the various triples, Mo’K calculates the number of the occurrences
for each one, removing from the list the triples with too much and too few
occurrences. Finally it calculates the semantic distances between the triples to form
conceptual clusters.
Text to Onto
This system (Maedche and Volz, 2000; Maedche and Staab, 2004) integrates the
KAON environment, an Open source ontology management infrastructure, with a
tool suite for building ontologies from an initial core ontology. It combines
knowledge acquisition and machine learning techniques to discover conceptual
structures. Terms are extracted according to their occurance frequency and
distribution criteria. Semantic and hierarchical relations are extracted with
association rules or linguistic patterns and the relationship are weighted according
to support and confidence criteria. The result of Text to Onto is a domain ontology.
The whole process is supervised by an ontologist.
Text Storm and Clouds
This system (Pereira, 1998) has been developed for semi-automatic construction of
a semantic network using relevant text for target domain. It is composed of two
modules: TextStorm (Oliveira et al. 2001) and Clouds (Pereira et al. 2000) that
perform complementary activities. TextStorm is NL tool that extract binary
predicates from text using syntactic and discourse knowledge. The predicates on
25
which the tool is focused are those that relate two concepts in a sentence. The
process works as follows: target domain text is provided to the system and is tagged
using Wordnet to find all parts of speech to which a word may belong. The text is
then parsed using an augmented grammar to obtain a lexical classification of the
words during the parsing process. Finally. TextStorm create a list with extracted
terms that becomes an input for the Clouds tools. Clouds is responsible for the
construction of the semantic network in an interactive way. Using the previous list
of binary predicates extracted by the text, Clouds builds a hierarchical tree of
concepts, learning some particulars of the domain using two techniques: best
current hypothesis based algorithm (to learn the categories of the arguments of each
relation) and Inductive Logic Programming based algorithm (to learn the recurrent
context in each relation).
OntoLT
OntoLT is Protegé plugin developed by Buitelaar et al. (2004) that allows to extract
automatically concepts (classes in Protegé), and relations (properties in Protegé),
from a collection of linguistically annotated texts. To do that OntoLT uses some
rules that allow mapping the linguistic entities within the text with the
classes/properties in Protegé. For using this tool there is the need for a collection of
texts automatically annotated with linguistic information provided in XML format.
The annotation includes: “part of speech” tags, morphological analysis, sentences
and predicates-arguments analysis. The annotation allows extracting automatically
linguistic entities that can be used to build an ontology of concepts, subconcepts and
26
relations of a specific domain. The mapping rules are defined through X-Path
expressions (used to extract requested elements or attributes from an XML
document). A mapping rule defines how to “map” linguistic entities within a corpus
(a collection of documents in XML) with classes and slots of Protegé. These rules
are implemented through the use of pre-conditions (X-Path). If all the pre-
conditions are satisfied then a set of linguistic units will be generated and one or
plus operators will be activated to define in which way each of them will be
“mapped” into the correspondent class/slot of Protegé. OntoLT includes statistics
analysis functions for the construction of syntactic rules finalized to allow the
identification of linguistic units relevant for the target domain. For each linguistic
entity a system of ratings is computed and the linguistic entities more specific for
the domain corpus receive the higher rating.
CORPORUM Ontobuilder
Corporum Ontobuilder (http://ontoserver.cognit.no) extracts ontologies and
taxonomies from natural language texts. The tools uses many linguistic techniques
that drive the analysis and the information extractions. It extracts information from
structured and unstructured documents using Ontowrapper (Engels, 2002) and
Ontoextract (Engels, 2001). Ontowrapper extract infomations from texts and
OntoExtract obtain taxonomies from natural language texts in RDF format and is
also able to refine existing concepts taxonomies.
JATKE
JATKE (available open source on http://jatke.opendfki.de/cgi-bin/trac.cgi) is a
Protégé plug-in that offers an unifed platform for ontology construction. The
27
platform has high degree of flexibility and allows to combine a variety of
approaches and even to create new ones. In fact, it allows the combination of pre-
existing modules guaranteeing a personalized set-up for each scenario. JATKE is
composed of three different modules: (Information/Source, Evidence, Proposal) and
its main characteristics are as follows:
• It allows the final user to define the correct mix of learning algorithms that
better fits the domain or the objective desired, allowing a personalized set-
up of the different modules;
• Its modular design allows to develop rapidly new modules, re-using the
existing ones;
• It allows a semi-automatic ontology building process, and generating
proposals for the modifications of an exististing ontology, while the user has
to decide if to accept the system proposals or not (each one has a certain
degree of confidence).
The communication between the different modules is possible because of an
internal ontology that represents the internal domain of the system in which every
type of data or command is represented as an instance. Within the ontology of the
system there is the button “Proposal” (last button in figure 2.2) that allows
activating a learning process. The proposals are divided in three categories:
• Class: is a proposal concerning a class, a concept, that can be created,
deleted, re-named or moved;
• Slot: is a proposal referred to a relation or to a property;
• Instance: is a proposal related to an instance.
28
All the proposals are analyzed by the user that accepts or rejects them. If the user
accepts the proposal then the adjustment is communicated to all the modules
participating it the process and, in this way, the ontology is updated.
Figure 2.2. The slot proposal in Jatke
Ontogen
Ontogen (Fortuna et al 2005) is a semi-automatic data driven topic ontology editor
which integrates machine learning and text mining algorithms. OntoGen’s mains
features are represented by automatic keyword extraction from documents given as
an input to the system (the extracted keywords are “candidate concepts” of the
ontology) and by the concepts suggestions generation. This system will be
discussed in a detailed manner in the next chapter.
Ontobuilder
This tool (Modica et al. 2001) helps the users in the ontology building process using
as sources semi-structured data coded in XML or in HTML. It has been designed to
work as a web browser (see figure 2.3). Once the URL of a web page is defined and
the page has been downloaded by the system, it is possible to extract an ontology
from that web source. The system is composed of three main modules: the user
29
interaction module, the observer module and the ontology modelling module. The
process that enables building the ontology has two phases: a training phase in which
an initial domain ontology is built using the data provided by the user (e.g. the user
suggest browsing websites that contains relevant domain informations) and an
adaptation phase in which the obtained ontology is gradually refined by the way
(for each new site candidate ontology is extracted and merged with the existing one)
Figure 2.3. The browser interface of Ontobuilder
OntoLearn
OntoLearn (Velardi 2004) is an ontology learning system based on NLP and
machine learning techniques and is based on three phases: Term extraction:
relevant terms (both n gram and bi grams as “credit card”) are extracted with NLP
statistical techniques. Specific and generic corpora are used to prune the non
specific terminology of the domain. The domain documents are used as input and
the system, after a parserization, extracts a list of terms “syntactically plausible”
(e.g. Adj+N). Two measures based on the entropy are used to evaluate the
importance of each term: Domain Consensus and Domain relevance. The first is
used to select only the terms referred at the documental corpus. The second is used
to select the terms belonging to the domain of interest, and is calculated considering
as basic point a set of terms of different domains. Finally the extracted terms are
filtered using the lexical cohesion measure that quantifies the degree of association
30
of all the words in a string of terms. The second phase consists on the semantic
interpretation of terms which is based on the principle of compositional
interpretation according to which, for example, the meaning of a composite word as
“business plan” is derived by the association of each single term at the correct
identificator of an existing ontology (e.g. Wordnet). In this phase a WordSense
disambiguation task is executed applying an algorithm called SSI (Structural
Semantic Interconnections) that is based on a syntactic patter matching. SSI
produces a semantic graph which includes the selected senses and the semantic
interconnection between them. The third phase is represented by the extension and
refinement of the starting ontology. Once the terms are have been semantically
interpreted they are organized in sub-trees and inserted on the appropriate node of
the initial ontology. Furthermore some nodes of the initial ontology are pruned to
create a specific view of the ontology. Finally the new ontology is converted in
OWL format.
WeDaX tool
WeDaX (Snoussi 2002) was developed by a group of researchers of the Montreal
University (Canada). Its aim is to extract information from web pages using an
ontology to model the data to be extracted. More specifically: the web page is
converted to XML format, then a mapping with an ontological model is done (the
model definition and the mapping are manually done by the user trough a graphic
interface). An automatic process extracts the information and the result is an XML
document containing a set of standardized data that allows to execute queries.
WebKB
31
WebKB (Craven 2000) has been developed by Carnegie Mellon University (USA),
and its objective is to automatically create a knowledge base understandable by a
computer extracting informations from web documents. In particular this system
extracts the information it needs for knowledge base creation from an initial set of
web pages and then search new web sites in order to automatically populate the
knowledge base with new assertions. The tool needs two inputs: an ontology that
specifies classes, and examples of web pages representing the classes or the
instancies of such ontology, for mapping the ontology to the Web.
SOBA (SmartWeb Ontology-based Annotation)
SOBA is a component of the SmartWeb system developed at the University of
Kalshruhe (Germany). It automatically populates a knowledge base starting from
the information extracted from web pages about the football. It is composed of a
web crawler, components for linguistic annotations and a final module for the
transformation of the linguistic annotations in an ontological representation (see
Cimiano 2006).
RelExt tool
RelExt (http://www2.dfki.de/web/) is a tool for the relationship extraction from a
collection of texts used for ontology building. Usually, domain ontology rarely
model verbs as concept relations. The basic assumption of this system is that the
role of the verbs as element of connection between the concepts is evident. RelExt
is a system able to automatically identify relevant triples (a couple of concepts
connected with a relation) within the concepts of an existing ontology. The system
work by extracting relevant verbs and their arguments from a collection of domain
32
specific documents and calculate, through a combination of statistical and
linguistics techniques, relevant relations with the concepts.
Doddle
This system (Yamaguchy 1999), aims to construct domain ontologies, in particular
a hierarchically structured set of domain terms without concepts definitions, reusing
a machine readable dictionary (MRD) and adjusting it for specific domains. Since
Doddle just generates a hierarchically structured sets of domain terms, it support the
user in the concept categorization task and concept name suggestion. The tools
deals with the concepts drift (the senses of the concept change depending on
applications domain). For this purpose, two strategies have been followed (match
result analysis and trimmed result analysis). Both try to identify which part may
stay or should be moved by the initial ontology. In order to analyze the concept drift
between a MRD and a domain ontology there are involved two main activities. The
first one is based on the building of an initial model from a MRD extracting
informations about relevant terms in a given domain. The second one regards the
management of concepts drifts making an initial model adjusted to the specific
domain.
SVETLAN
Svetlan (Chaelendar and Grau, 2000), is a domain independent tool that creates
clusters of the words appearing in the text. The scope of this tool is to build a
hierarchy of concepts. Its learning method is based on distributional approach:
nouns playing the same syntactic role in sentences with the same verb are grouped
together in the same class. The learning process follows three steps: syntactic
analysis, aggregation and filtering. At the first step the tools retrieve sentences from
33
the original text (it only accept French texts in natural language) in order to find the
verbs inside the sentence since the basic assumption is that verbs allow categorizing
nouns. The output of this step is a list of triplets of verbs, nouns and syntactic
relations between them. The aggregation create cluster of nouns with similar
meanings (using conceptual clustering techniques) and the filtering step is based on
the weights of the nouns inside the classes (the cluster created) and on the removal
of nouns not relevant inside the groups.
TERMINAE
This system (Biebow et al. 1999), integrates linguistic and knowledge engineering
tools. The linguistic one allows defining terminological form from the analysis of
term occurrences in a corpus: the ontologists analyze the uses of the terms in a
corpus to define their meaning. The knowledge engineering tool, instead, helps to
represent terminological forms as concepts. TERMINAE uses a method to build
concepts from the study of the corresponding terms in a corpus. First it establishes,
using an extractor tool, a list of candidate terms which are proposed to the
ontologists (which select a set of terms). Then the ontologists conceptualize the
terms and analyses the uses of the terms in corpus to define all their meanings.
Finally the ontologists give a definition in natural language for each meaning and
then translate it into an implementation language.
The development of this kind of systems, and their alignment to the manual
“gold standard” ontologies, represents one of the main challenges for the next
future in the Semantic Web. To achieve this objective the evaluation phase of
the systems and of their results plays a crucial role. In the next chapter will be
34
presented the results of an evaluation of a semi-automatically generated
domain ontology with a manually built one.
35
Chapter 3: The Energy Domain Case Study
In this chapter, a concrete case of comparison between manual versus semi-
automatic ontology construction is presented. The main objective is to present a
comparative evaluation of those two different approaches. The questions we’ll try to
answer are: which approach is more useful in the ontology building task? Are these
approaches alternatives or can they be integrated? What are the pros and the cons
for each one? And, finally, is it realistic, in the near future, to think of a complete
automatization or, at least, at a semi-automatization of the whole process of
ontology building? In order to answer these questions, a set of experiments was
performed, in which a manually built ontology is compared with a semi-
automatically generated one. The ontology compared is a domain ontology for
energy security. It represents our “case study” and allows us to formulate some
conclusions about the two approaches. The reason why we chose to focus our
attention and our research efforts on energy domain is represented by the
importance that, nowadays, energy issues gain importance in the global institutional
and economical agenda. In this way, the realization (both with manual and
semiautomatic approach) and the evaluation of a dedicated ontology for this field
represents, at the same time, a big challenge for the ontology engineers and a great
opportunity for the institutional and economic decision makers of the energy
domain. In fact, the implementation of such ontology in a dedicated information
36
system may improve both the precision and recall6 of search results in information
retrieval tasks. This improvement may help decision makers of energy field, getting
the relevant information which they need for making their decisions (in this view
the ontology would be used to create a sort of “dedicated ontology-based decision
support system for energy domain”). In the following paragraphs we will explain
the methodology used for the energy domain modelling (paragraph 3.1). Then, a
more detailed description of the energy ontology manually built will be presented
(paragraph 3.2). Ontogen, the tool used for the semi automatic ontology generation
will be presented next (paragraph 3.3), followed by a presentation of the semi-
automatic ontology. Finally, the experimental results of the manually built
ontology, will be compared with the results obtained by semi-automatic ontology
building process (paragraph 3.4).
3.1 Energy Domain Modelling Process
3.1.1 The modelling approach
Poesio (2005) states that there are, at least, two different research traditions in the
domain modelling literature. One school of thought supports the thesis of the need
of more rigorous logical and philosophical foundations for domain modeling
formalisms. It’s aim is both to establish a “Tarskian Semantics” for the formalism
used in the domain ontologies (leading to description logics) and to have cleaner
6 Precision and recall are two standard measures of the Information Retrieval field. In few words they represent, respectively, how much relevant are the documents retrieved by a system after a query (precision) and how many relevant documents a system is able to retrieve on the total of the relevant documents (recall). These two measures will be presented widely and in a more detailed manner in the following pages.
37
domain ontologies (where the expression “clean ontology” stands for “ontology
with a clear semantics and based on sound philosophical and scientific principles”).
Whoever supports this line of research argue that the ontologies built according to
formalist principles are very beneficial for many NLP (Natural Language
Processing) applications such as, for example, information extraction from a
database using natural language or, vice versa, information extraction from texts
and their addition to a database. The second school of thought, instead (that Poesio
defines as “cognitive”), argue that the best way to identify epistemological
primitives is to study concept formation and learning in humans. Conversely, the
best approach to the construction of domain ontologies is by the use of machine
learning techniques to automatically extract ontologies from language corpora. This
approach has its philosophical foundations in the work of the late Wittgenstein
(Philosophicae Investigationes, 1959) and especially in its implicit critic, explicited
in the psychological perspective by the work of Eleonoire Rosch (Rosch, 1973), to
the classic Aristotelian theory of concepts7 trough the introduction of the concepts
of “language games” and “family resemblance”. These two concepts, in fact, had
indirectly dealt a mortal blow to the Aristotelian theory bringing out that the
humans, and their memory, aren’t able to classify language and concept according
with what the theory stated, even for the complexity intrinsic at the language itself.
The failure of this theory, and in general of the pure formal logic approach as key to
explain the way in which the humans thinks and categorize (Thaggard, 1998), has
been used as argument against the “formalistic approach” to the domain modelling
(the first one mentioned above). Regarding the specific case of energy modelling, 7 According to this theory a concept can be defined exclusively with a finite set of necessary and sufficient conditions.
38
it’s not easy to classify our work within one of these two categories (Formalism vs
Empiricism),because it has been used both a bottom up strategy and a top down one
for the energy ontology building. To be clearer: mainly a “bottom-up" ontology
building strategy8 was used. It was bottom up because the domain knowledge
acquisition, and then the process of ontology creation, started from corpora of
textual documents and not from abstract conceptual knowledge or conjectures made
a priori by the ontologists about the world of the domain to be represented (in this
view we could say that we privileged an empiric and language-based point of view).
On the other hand, “top-down” approach was applied by the guidance of an energy
domain expert (Dr. Brenda Schaffer of the University of Haifa). This help has been
crucial for our work and allowed us to modify the ontology and the research
directions in a more meaningful manner, allowing us to to better focus on specific
relevant aspects. For example: after the reviews of Dr. Shaffer, we mainly focused
our attention on the class “energy security” (which is only one of the 54 classes of
the ontology) and on its implications regarding, for example, geopolitical,
economical and environmental problems (we will see the class description and its
ontological implication in the next paragraph). To summarize, for manual energy
ontology building we mainly used a corpora based strategy. However, at the same
time, we have been guided by a domain expert hence applying a to-down strategy.
So our approach can maybe be considered as a “middle up-down” one. In the
8 The work of manual ontology building has been made in sharing with other Master Degree students of the University of Salerno which also worked, with different objectives and perspectives, at the same project I worked on. In order to facilitate the informations circulation about the common relevant issues of the project we created a wiki which revealed to be an excellent instrument for knowledge sharing. To know more about our different thesis work projects see http://energyontologythesiswork.wikispaces.com.
39
following paragraphs will be describe, step by step, the whole process of manual
energy ontology building.
3.1.2 Information Sources and Tools
The work has been based on information extracted and inferred from a document
database of about 200 documents about energy. The database’s documents has been
recovered from the web; and mainly (but not only) from the online resources of the
most important energy agencies and associations in the world. The strategy used for
the documents recovery was the following: in a first phase the documents selected
were those recovered by means of the results of the search queries on search
engines about: energy policies, energy sources and energy use. In this first phase we
didn’t have a specific knowledge about energy domain and so even the search
queries weren’t precise. The first 30 documents of the database were recovered
following this strategy. After readings them we started to acquire initial knowledge
of the domain and that allowed us to better evaluate the relevance of documents
found by search queries. So even the queries strategies changed and started to
become more precise. In this second phase, in fact we started to have queries using
the first classes inserted into the ontology (e.g. energy security + resources
affordability or energy security + reliability of supply) in order to retrieve more
relevant and specific documents. Then, even another strategy was followed. In fact
we started to browse a list of websites considered “authoritative sources” about the
energy issues and started to do a sort of “human crawling” work in the sense that,
40
starting from the first pages of the selected sources (and following all the links
presented there), we recovered other documents and informations. The full list of
the sources used is the following: the EIA (Energy International Association, see
http://eia.gov) which is the USA’s premier source of unbiased energy data, analysis
and forecasting; the IEA (International Energy Agency, http://iea.org) which acts as
energy policy advisor to 26 member countries; the ENI (Ente Nazionale
Idrocarburi), which is one the most important integrated energy companies in the
world operating in the sectors of oil, gas, power generation etc.; the OECD
(Organization for economic co-operation and development), that is one of the
world's largest provider of sources of comparable statistics, and economic and
social data; the Oil & Gas Journal , which delivers the latest international Oil and
Gas news; the FERC (Federal Energy Regulatory Commission only for USA area)
which regulates and oversees energy industries in the economic, environmental, and
safety interests of the American public; the NREL (National Renewable Energy
Laboratory, referred to USA area) which is the America’s primary laboratory for
renewable energy and energy efficiency research and development (R&D); the
British Petroleum, another important global energy company in the world and,
finally, the World Bank which is one of the most important institution interested in
the themes of energy security and of their impact on the global economic aspects..
The reason why we privileged these sources and not others is strictly linked to the
problem of the “trust” of the information provided on the web. In our opinion the
informative sources selected by the ontology engineers to discover information and
to acquire specific domain knowledge about the knowledge field to represent, are a
sort of “foundational bricks” for the ontology infrastructure. And so they have a
41
very important role. According to this view we choose to have this kind of selection
because we retained these sources as “authoritative” in the energy field and,
therefore, with a good degree of trust. According with the domain information
extracted from the selected documents we started to create the manual ontology
using Protegé software editor, version 3.3.1 (http://www.protégé.stanford.edu). We
choose this software for the following reasons:
• It is an extensible knowledge model. The internal representational primitives in
Protégé can be redefined declaratively. Protégé’s primitives - the elements of its
knowledge model - provide classes, instances of these classes, slot representing
attributes of classes and instances, and facets expressing additional information
about slots.
• Have a customizable user interface. The standard Protégé user interface
components for displaying and acquiring data can be replaced with new
components that fit particular types of ontologies best (e.g., for OWL).
• Allows importing ontologies in different formats. There are several plug-ins
available for importing ontologies in different formats into Protégé, including XML,
RDF, and OWL.
• Support data entry. Protégé provides facilities whereby the system can
automatically generate data entry forms for acquiring instances of the concepts
defined by the source ontology.
• Have a lot of ontology authoring and management tools. The PROMPT tools are
Protégé plug-ins that allow developers to merge ontologies, to track changes in
ontologies over time, and to create views of ontologies. The Protégé internal
42
knowledge representation can be translated into the various representations used in
the different ontologies. Protégé has different back end storage mechanisms,
including relational database, XML, and flat file.
• Presents an extensible architecture that enables integration with other
applications. Protégé can be connected directly to external programs in order to use
its ontologies in intelligent applications, such as reasoning and classification
services.
• Availability of the Java Application Programming Interface (API). System
developers can use the Protégé API to access and programmatically manipulate
Protégé ontologies. Protégé can be run as a stand-alone application or through a
Protégé client in communication with a remote server.
3.1.3 Definition of Domain Concepts
Explicit presentation of the subdivision of the different steps followed for the
manual ontology generation it’s not easy, because different operations took place in
parallel (by one or more persons involved into the project). However, in general, the
process of the energy ontology modeling was the following: in a first phase, after
initial documents reading and initial domain knowledge acquisition, a first
hierarchy of concept classes was constructed (the first version of the manual
ontology was formed by 16 classes, 11 properties, inverse functions included and 40
instances, see the table 3.1). The process followed for the domain concepts
definition was based on the manual extraction of the main keywords presented into
the read texts and on the inference we made on them (e.g. for each keyword was
considered its hyponyms, hyperonyms, meronyms, and, finally, the potential related
43
concepts that was possible to extract). Then, after creation of the “infrastructure” of
the ontology concepts, we started to insert more detailed knowledge into the
ontology, defining instances and different type of relationships (properties in
Protegé) in order to put more meaningful information within the ontology.
Classes Instances Properties
Energy Domain, Risks,
Solutions, Energy Security,
Reliability to Supply,
Friendliness to Environment,
Energy Sources, Primary and
Secondary Sources, Nuclear,
Nuclear Weapon Proliferation,
Nuclear Energy Proliferation
Country, Infrastructure,
Renewable and Non Renewable
Sources.
Alchohol Fuel, Biofuel,
Ethanol, Coke, Diesel Fuel,
Gasoline, JetFuel, Biomass,
Corn,Geothermal, Photovoltaic,
Solid Waste, Solar, Waste,
Water, Wind, Wood,
Anthracite, Coal, Bituminous
Coal, Lignite, Oil, Propane,
Natural Gas, Peat,, Belarus,
USA, China, Russia, India,
Canada, Brazil, Venezuela,
Nigeria, Saudi Arabia, Norway,
Mexico, UK, Uzbekistan,,
Algeria, Kuwait.
Is a type of, includes, is mostly
exported by, exports, is one of
the major producer, is mostly
produced by, is used to, is made
into, cause, is caused by, is
useful in case of.
Table 3.1: Classes, Properties and Instances of the first version of the manual Ontology
3.1.4 Horizontal Links
After the properties and relations definition we started to create some horizontal
links between different classes, subclasses and instances. The horizontal links are
properties or relations which link individuals or classes which aren’t in hierarchical
44
relationships. This kind of relations are very important because allow to represent,
within the ontology, types of information of higher level if compared with the sub-
sumption (“is-a” relations) and similarity information usually provided by the
knowledge representation systems. An example of the usual information presented
is the following “Margherita is a type of pizza”. Through horizontal links, instead,
is possible to give more information representing, for example, information as the
following: “In Naples there is the local X, in front of the place Y, that prepares a
fantastic pizza Margherita”. So it seems evident the improvement given by the
horizontal links to the quality of information that is possible to provide within these
system. To create the horizontal links we have used logical properties with one or
more arguments. An example of a logical property with one argument is the
following: “Mario is a funny guy”, where “Mario” is the argument and “to be a
funny guy” is the property. An example of logical properties with two arguments
(this kind of properties are usually named, in the Logic literature, “relations”) is the
following: ““ Saudi Arabia is one of the major exporter of oil” where “Saudi
Arabia” and “oil” represents the “arguments” and “to be one of the major exporter
of” is the relation asserted.
This kind of properties allowed us to create relations within the ontology integrating
up to 3 classes (a concrete example will be shown in the next paragraph with the
ontology description). The table 3.2 show the full stats of the horizontal links
presented within the ontology. The major part of them link 2 classes and, then, is
based on properties with two arguments.
45
Horizontal Links between 2 classes Horizontal Links up to 3 classes
96 12
Table 3.2. Horizontal Links Data
Figure 3.1 shows an example of the high degree of interconnection created within
the ontology using horizontal links. An example easily seen in figure (on the right
part) is represented by the link between the Class “Solutions” and the class
“Friendliness to Environment” It show, and give us the information, that there is
one instance of the class Solution (we recognize that is an instance because the
colour of the link is pink) that is linked to the issue of friendliness to environment
(and probably is linked in a positive manner in the sense that one of the solution for
the energetic problems goes in the direction of the friendliness to environment).
Figure 3.1. Horizontal Links within the ontology
46
3.1.5 Ontology Population
Once the ontology structure was defined, ontology population phase started
launching crawlers on the web in order to retrieve other meaningful documents
about energy domain, both using automatic lexical acquisition systems in order to
update the ontology with new concepts and new lexicon. The first strategy, with the
crawlers, allowed us to recover relevant documents not previously retrieved with
the search queries. The second strategy was followed through the use of an
automatic keyphrase extraction system called LAKE (D’Avanzo et al. 2004) that
will be described in detail in the next chapter. This system was used to extract
relevant keywords from the documents given in input (the input documents were
that retrieved using the crawlers strategy) and new lexicon and new concepts, then
manually inserted into the ontology, were recovered using the extracted keywords.
This kind of work has been mainly done for a specific class of the ontology: the
energy security class.
3.1.6 Logical Expressions
In order to insert into the ontology different types of information, we used logic
expressions. To create these expressions we used the quantifiers of the first order
logic, as introduced by Frege (1879), to introduce more and better inferential
mechanisms into the system. The introduction of descriptive logic expressions of
first order was possible because of the choice of building the energy ontology using
OWL DL. This language, in fact, allows enriching the ontology by inserting formal
descriptive expressions. OWL DL can be viewed as an expressive Description
47
Logics, and an OWL DL ontology is the equivalent of a Description Logic
knowledge base. We used two kinds of expressions to better specify, in a formal
manner, the information provided to the system: expressions with the Existential
and Universal Quantifiers:
1. (∃) Expressions with the Existential quantifier : which specify, for a set of
individuals, the existence of a (at least one) relationship along a given property to
an individual that is a member of a specific class.
2. (∀∀∀∀) Expressions with the Universal quantifier: which constrain the
relationships along a given property to individuals that are members of a specific
class (in other terms is used to state that “all the individuals of the class X have a
certain property Y”).
A simple example of a “double translation”, first in formal logic and then in OWL
DL construction, of a statement in natural language about the energy domain is the
following:
• Verbal Proposition: Some Fossil Fuels cause some environmental
consequences or some Risks for Energy domain
• First Order Predicate Logic: ∃x (Fx → Ce V Cr)
• Protégé OWL DL construction: ∃ Fossil Fuels cause some (Environmental
Consequences or Risks, see figure 2 for example).
48
Figure 3.2. A logic expression inserted into the Energy Ontology
Another reason that guided our choice of OWL DL language is represented by the
fact that there are many implementations for reasoning testing and that was
important for the evaluation of the consistency of the ontology. The OWL Plugin of
Protégé, in fact, provides direct access to DL reasoners such as Racer Pro (that will
be discussed in the next paragraph).
The current user interface supports two types of DL reasoning: Consistency
checking and classification (subsumption).
Consistency checking (i.e., the test whether a class could have instances) can be
invoked either for all classes with a single mouse click, or for selected classes only.
Inconsistent classes are marked with a red bordered icon.
Classification (i.e., inferring a new subsumption tree from the asserted definitions)
can be invoked with the classify button on a one-shot basis. When the classify
button is pressed, the system determines the OWL species, because some reasoners
49
are unable to handle the ontologies in OWL Full. If the ontology is in OWL Full
(e.g., because metaclasses are used) the system attempts to convert the ontology
temporarily into OWL DL. The OWL Plugin supports editing some features of
OWL Full (e.g., assigning ranges to annotation properties, and creating
metaclasses). These are easily detected and can be removed before the data are sent
to the classifier. Once the ontology has been converted into OWL DL, a full
consistency check is performed, because inconsistent classes cannot be classified
correctly.
Finally, the classification results are stored until the next invocation of the classifier,
and can be browsed separately. Classification can be invoked either for the whole
ontology, or for selected sub-trees only. In the latter case, the transitive closure of
all accessible classes is sent to the classifier. This may return an incomplete
classification because it does not take incoming edges into account, but in many
cases it provides a reasonable approximation without having to process the whole
ontology. OWL files store only the subsumptions that have been asserted by the
user. However, experience has shown that, in order to edit and correct their
ontologies, users need to distinguish between what they have asserted and what the
classifier has inferred. Many users may find it more natural to navigate the inferred
hierarchy, because it displays the semantically correct position of all the classes.
The OWL Plugin addresses this need by displaying both hierarchies and making
available extensive information on the inferences made during classification. After
classification the OWL Plugin displays an inferred classification hierarchy beside
the original asserted hierarchy. The classes that have changed their superclasses are
highlighted in blue, and moving the mouse over them explains the changes.
50
Furthermore, a complete list of all changes suggested by the classifier is shown in
the upper right area, similar to a list of compiler messages. A click on an entry
navigates to the affected class. Also, the conditions widget can be switched between
asserted and inferred conditions. All this allows the users to analyze the changes
quickly. The manual ontology correctness has been evaluated using the first above
mentioned task: consistency checking.
3.1.7 Racer Pro
RacerPro stands for Renamed ABox and Concept Expression Reasoner
Professional. Its origins are within the area of description logics. Since description
logics provide the foundation of international approaches to standardize ontology
languages in the context of semantic web, RacerPro is also used as a system for
managing semantic web ontologies based on OWL. It can be used as a reasoning
engine for ontology editors such as Protégé. An important aspect of this system is
the ability to process OWL documents. The following services are provided for
OWL ontologies and RDF data descriptions:
• Check the consistency of an OWL ontology and a set of data descriptions.
Find implicit subclass relationships induced by the declaration in the
ontology.
• Find synonyms for resources (either classes or instance names).
• Since extensional information from OWL documents (OWL instances and
their interrelationships) needs to be queried for client applications, an OWL-
51
QL query processing system is available as an open-source project for
RacerPro.
• HTTP client for retrieving imported resources from the web. Multiple
resources can be imported into one ontology.
• Incremental query answering for information retrieval tasks (retrieve the
next n results of a query).
In addition, RacerPro supports the adaptive use of computational resource: answers
which require few computational resources are delivered first, and user applications
can decide whether computing all answers is worth the effort. In order to evaluate
the manually built ontology we firstly connect this tool to the ontology in OWL (the
process is shown in figure 3.3) and then run it to check the ontology consistency.
Figure 3.3. Ontology connection with RacerProReasoner
The process of ontology evaluation has been executed during the whole process of
ontology building until the last updating of the ontology. And that allowed us to
52
correct immediately the logical inconsistencies we found during the process. During
the last evaluation no logical inconsistency or taxonomy errors were found. That
means that the manually built ontology have an it’s internal logical coherence.
3.2 Manual approach: Energy Ontology Description
In this paragraph will be described the classes of the manual energy ontology. Some
of them revealed to be more “meaningful” with respect to others because of their
high level of interconnection with other classes and instances. The full data of the
manually built ontology are in the table 3.3.
Table 3.3. Energy Ontology Data
Documents Classes Instancies Properties
200 54 121 30 (inv. Functions included)
3.2.1 Energy Domain
For the Energy ontology we have adopted a super-class named Energy_Domain
composed of 6 sub-classes: Country, EnergySecurity, EnergySources,
Infrastructures, Market and EnvironmentalConsequences (see the figures 3.4 and
3.5 below).
53
Figure 3.4: Energy domain Subclasses in the “Classes view”tab of Protégé
Figure 3.5 a visualization of the Energy domain subclasses with Ontoviz plugin
At the same level of the energy domain super-class two transverse classes such as
Risks and Solutions have been inserted (figure 3.6). This choice has been made
because these two categories were presented in many of the classes of the ontology
and so it wasn’t possible to clearly decide in which class to include them and in
54
which not. With this strategy, instead, it was possible to create many horizontal
relations between these classes and the other classes of the ontology.
For the concept “Risks” we considered as subclasses different type of issues such
as:: economic problems (e.g. market instabilities, market cartels, growth in energy
price etc.), geopolitical problems (e.g. natural gas, oil and nuclear dispute) and
technical problems.
While, for Solutions, we mainly considered as subclasses the governments action
and policies used to avoid problems about energy. We divided this class in 3 sub-
classes. long term-medium term and short term policies.
Figure 3.6: Energy Ontology Superclasses
3.2.2 Energy Sources
We divided this class into Primary, Secondary and Nuclear sources following the
most common distinction made in the literature about these issues. Primary sources
don’t need to be subjected to any conversion or transformation process to be used.
Secondary Energy Sources, instead, have to be transformed from one form to
another to be used. Each of these two classes have other subclasses represented by
the distinction between renewable and not renewable sources. For the nuclear class
the subclasses nuclear energy sources and nuclear weapons proliferations have been
OWL Thing
Energy Domain Risks Solutions
55
identified. This last class allowed us to create horizontal links with the class
“geopolitical problems” which is a subclass of the superclass “Risks”.
Figure 3.7. Energy Sources Classification
3.2.3 The case of the Hydrogen: Renewable or not Renewable?
One of the main problems encountered during the ontology building task was the
difficulty to classify some entities as belonging to one class or to another one. A
meaningful example of this type of classification problem was what we have called
the “case of the Hydrogen”. The problem was the following: different sources
which we consulted classified the hydrogen in different manners because the
hydrogen can be both renewable and non renewable. It depends from what source it
is extracted. So we had the problem that one concept of the ontology was member,
at the same time, of one class and of its opposite. And, of course, that isn’t allowed.
In order to overcome this problem we created different “hydrogen entities”. One of
that was “hydrogen by renewable sources (solar, wind etc)”. And the other one was
“hydrogen by non renewable (hydrocarbon etc.)”. In the figure 3.8 is shown the way
in which we solved the problem.
Energy Sources
Primary Sources
Nuclear Secondary Sources
Renewable Not renewable
Renewable Not Renewable
Nuclear En.
Sources
Nuclear W.
Profilerat.
56
Figure 3.8: the Classification of Hydrogen
3.2.4 Country
Country class has 2 subclasses that include OPEC members and NON OPEC
members. It is an important class because it allows to create horizontal relationship
between with energy sources class or with the economic and environmental policies
aspects. This class can be modified and modelled in a different manner considering
different classifications criteria. For example: if we want to focus on the
geopolitical problems (which, of course, involve the countries) related to the
natural gas energy source, a different classification can be proposed by a domain
expert and the number of the subclasses can be easily enlarged
3.2.5 Energy Security Class
Energy Security is the class of the ontology on which we mainly focused during the
last part of manual ontology building, following the suggestions of the energy
domain expert Dr. Brenda Schaffer. This issue has risen to the top of the agenda
among policy makers, international organizations and business because of in the last
decade there was a sustained growth in demand for energy that seriously concerns
the long-term availability of reliable and affordable supplies.
Renewable Sources
Hydrogen (water, solar)
Non Renewable Sources
Hydrogen (hydrocarbon)
57
Energy security has broad economic, political, and societal consequences. A lack of
energy security can exacerbate geopolitical tensions and impede development.
Because of the interconnection of this concept with many other aspects (geopolitical
and economic problems, technical aspect etc.) we spent time and efforts in the
classification of this class because there isn’t a clear definition of it. It is a sort of
“multiface concept” because it regards different issues. So we created for this
concept an infrastructure able to link it with other concepts within the ontology.
Relations created allowed the connection of this concept until with 3 other classes.
The main direct subclass of energy security are presented in figure 3.9.
Figure 3.9: Energy Security direct subclasses
Figure 3.10 exemplifies the way we created horizontal links with the other classes
of the ontology. For example: we linked the concept “energy security” with the
concept “geopolitical problems” which is a sub-concept of the superclass “Risks”.
Moreover, creating these links, we were able to connect three classes of the
ontology: Energy security (reliability of supply is its subclass), Country and
Geopolitical Problems. The created infrastructure, indeed, allowed us to provide
this kind of information: “Ex. Cut off in supply for USA (which is an energy
Energy Security
Reliability of supply
Scarce Resources Dependency
Resources Affordability
Friendliness to Environment
58
security subclass) is caused by USA vs Venezuela dispute (that is subclass of
geopolitical problems) and, this fact, involves USA and Venezuela (which are
instances of the Class country)”. In the figure the arrows represent the horizontal
links (the position of the classes in figure do not show their real position within the
ontology: e.g Risks and Energy Security aren’t at the same level). USA and
Venezuela are two instances and are shown in blue. The classes are in yellow. The
property which link “Cut off in supply for USA and the subclass “Venezuela vs
USA dispute” is “is caused by” (and there is also the inverse property “cause”) and
is represented by the black arrow in the figure. The property which links the
instances “USA” and “Venezuela” with the classes in figure (red arrows) is
“involve” (with its inverse function “is involved in”) and an example of information
provided with this links is the following: “Venezuela and USA are involved in
dispute among them that caused cut off in supply for USA”.
59
Figure n 3.10: Ontology connection between Energy Security and Geopolitical Problems
Another example of horizontal connection created is shown in figure 3.11. The link
is between an energy security subclass (scarce resources dependency) and
Economic Problems such as market cartels, growth of energy price, and market
instabilities (Economic Problems is one of the subclass of the superclass Risks)..
The horizontal links are represented, in figure, by the bidirectional arrows. The
arrows indicates the property “cause” (and its inverse function “is caused by”). So,
in this case, the created links allowed us to provide, for example, this kind of
information within the ontology: “Scarce Resources Dependency cause growth of
energy price” and its inverse “Growth of energy price is caused by scarce resources
dependency”. The same is for the other classes linked in figure.
60
Figure 3.11 Energy security and Economic Problems
3.2.6 Infrastructures
This class represents the infrastructure used for the energy transformation,
extraction and transportation. For this class have been identified 10 instances which
are listed here: Oil Tanker, Pipelines Turbine, Drilling Equipment, Barge,
Heliostats, Methane Pipelines, Ship, Train, Refinery.
3.2.7 Environmental Consequences
“Environmental Consequences” are a very important issue regarding the thematic
of the energy domain. Within the energy ontology we created this class identifying
14 instances referred to this issue. They are listed here: climate change, pollution,
deforestation, desertification, global warming, higher global temperatures, bird
61
flight patterns, CO2 Emissions, Damage to views, Flooding, Droughts, Increased
Rains, Greenhouse gas emissions, impacts weather, urbanization.
Many of these instances are highly correlated between them. For example: climate
change refers to any significant change in measures of climate (such as
temperature, precipitation, or wind) lasting for an extended period (decades or
longer). And it may result from:
1. natural factors, such as changes in the sun's intensity or slow changes in
the Earth's orbit around the sun;
2. natural processes within the climate system (e.g. changes in ocean
circulation);
3. human activities that change the atmosphere's composition (e.g. through
burning fossil fuels) and the land surface (e.g. deforestation,
reforestation, urbanization, desertification, etc.)
This last possibility allows to link the instances “desertification”,
“deforestation”, “urbanization” and “climate change” through an information as
the following: “Deforestation is one of the cause of climate change”. Even this
information is provided by means of an horizontal link between the instances
belonging to the same class (in fact even in this case the link is horizontal
because the instances are all at the same level of the hierarchy).
62
3.2.8 Energy Use
For the class energy use we followed the major distinction made in literature about
this matter identifying 5 subclasses. One for each sector on which the energy is
consumed: The subclasses identified are: Commercial, Electric, Industrial
Transportation, Residential. For this subclasses we created some horizontal links
regarding the energy sources mainly utilized for the different uses (linked, in this
way, the concepts of “Energy Sources” with the above mentioned “Energy Use
subclasses”). For example: for the energetic use mainly are consumed non
renewable sources and the information provided through the horizontal link is the
following “Residential Use mainly make use of Non Renewable Sources” (the
property identified for this link is “mainly use” and its inverse function “is mainly
used”).
63
3.3 A Semi automatic approach for ontology building
There are pros and cons to manual ontology construction. The pros are that human
expertise and background knowledge of a specific domain directs the definition of
the concepts taxonomy and the creation of relevant properties. Moreover, in enables
the insertion of meaningful information into the ontological system. On the other
hand, the cons are, one of the major disadvantages of the manual ontology
construction are the efforts required from ontology designers. Indeed, for an
ontology engineer, to construct manually, without any “help”, a complex
knowledge representation system as an ontology, means spending a lot of time in:
reading documents, extracting the most relevant keywords for each document,
inserting the keywords within the ontology in a proper manner (without creating
confusion between classes-keywords and instancies-keywords), inferring relations
between terms (represented as classes or as instances) etc. Furthermore manual
ontology building implies that the ontology builder is a specialist in the ontology
construction or, at least, a knowledge domain expert with skills of ontology
engineering and familiar with the relevant ontology editor softwares (e.g. Protegé,
OntoStudio etc.). The question is how many domain specialists in various domains
also know something about the ontologies?
In order to overcome these problems, different systems able to build ontologies in
an automatic or a semi-automatic way have been developed in the last years. They
may start by using dictionaries, knowledge bases, semi-structured schemata,
relational schemata or from unstructured textual documents (see the Chapter 2 for a
rapid overview). One of these, briefly described in the previous chapter, is Ontogen.
It has been developed by Blaz Fortuna, Marco Grobelnik and Dunja Mladenic of
64
Joseph Stefan Institute, Ljubljana (Fortuna et al. 2007). It is a semi-automatic, data
driven topic ontology9 editor. It integrates machine learning and text mining
algorithms to reduce both the time spent by the user in the ontology building and
the complexity of the task itself (Fortuna et al 2006). It is “semi-automatic” because
it automatically performs some tasks (e.g. concepts and concepts names suggestion,
instancies assignment to the concepts etc.), but, at the same time, allows the user to
keep the full control of the system because he/she can accept, adjust (even
manually) or reject the modifications that are made automatically. It is “data
driven” because the system is “guided” by the data given by the user during the
initial phase of the ontology construction. The data provided by the users is a corpus
of documents given in input to the system in order to generate the ontology and, for
that reason, they reflect the domain knowledge for which the user is building the
ontology. The documents are represented in a bag of words (BOW) representation,
in which each document is encoded as a vector of terms’ frequencies. The terms are
weighted with TFxIDF measure and the similarity between two documents is
calculated with the cosine similarity measure, which is defined by the cosine of the
angle between two “bags of words” vectors. TFxIDF is a quite common linguistic
frequency measure. It integrates the terms frequency within a document (TF) with
the frequency of that term in a corpora (hence integrating relative importance within
a document with how well the term represent the document in the corpora).
Ontogen’s work is based on the automatic extraction of keywords from the
documents given as input by the user for the ontology generation. There are two
keywords extraction methods applied by the system. The first, shown under the 9 A topic ontology is a set of topics (or concepts) connected with different types of relations. Each topic includes a set of related documents.
65
label “keywords” in the system’s interface (see figure 3.12), uses centroid vectors.
Centroid is the sum of all the vectors of the documents inside the topic. The
keywords selected have the highest weights in the centroid vector.
Figure 3.12. Keywords extraction methods with centroid vectors.
The set of keywords extracted with this method is composed by the most
descriptive words of the concept’s documents (or instances).
The second method, shown under the label SVM keywords (see figure 3.13.), uses
the Support Vector Machine classifier. This method is used to extract keywords
describing a selected concept. The classifier is trained as follows: consider that A is
the topic to be described with the keywords. All the documents from the concepts
that have A as subtopics are marked as negative while the documents under A are
marked as positive. Then the linear SVM classifier is trained by these documents
and classify the centroid of the topic A. With this method the set of the extracted
keywords is composed by the most distinctive words for a selected concept with
regards to its sibling concepts in the hierarchy (Fortuna et al 2005).
66
Figure 3.13. Keyword extraction methods with Support Vector Machine Classifier.
The keywords by one of the above methods are used to suggest possible topics,
subtopics or name of topics for the ontology building. The suggestions generation is
one of the main features of OntoGen. It can be provided in two different manners:
with unsupervised and supervised methods. In the unsupervised approach the
system provides concept and sub-concepts suggestion using keywords extracted
from the documents of the selected topics with Latent Semantic Indexing (LSI) or
K-means algorithm10. The main advantage of the unsupervised method is that it
require very little input from the user (it must be indicated only the number of
clusters). With the supervised method, instead, the concept suggestion is “driven”
by the user, in the sense that the use of this method implies that the user has an
initial idea about what a concept or a sub-concept of the ontology should be. Having
10 The algorithm method is chosen by the user.
67
this knowledge background he/she can enter a query and use it to train the system
using relevant documents, as explained below (see figure 3.14).
Figure 314. An example of user query with the supervised method for the suggestions generation in
Ontogen.
Once the user enter his/her query in form of keywords or keyphrases the system
start asking him/her questions in which asks if a particular document belongs to the
selected concept. And the user can selects Yes or No buttons as answers (see figure
3.15).
68
Figure 3.15. Training phase in Ontogen using the supervised method for the concepts suggestion
In other words, after the user’s query, the system starts an active learning process
based on the feedback provided by the user during a training phase. In this phase the
user have to provide information to the system answering with positive or negative
feedback (Yes or no buttons) to the question: “Does this document belongs to the
concept?”. On the other hand, the system uses a machine learning technique for
semi automatic acquisition of user knowledge. It refines the suggested concepts for
each reply of the user and, after the user is satisfied with the suggestions, a concept
is constructed and is added to the ontology as sub-concept of a selected concept.
69
3.3.1 Semi Automatic Ontology Generation
Ontogen has been used for the semiautomatic ontology generation of a part of
energy domain ontology. The part chosen is the energy security class that is a sort
of micro-ontology within the general energy domain (it is formed by 14 classes in
total). The semi automatic ontology for the pilot study experiments has been created
with 25 selected domain documents about energy security. The system was able to
build an ontology of 10 concepts. A screenshot of the generated concepts is
presented by figure 3.16. Figure 3.17 shows the semi-automatically generated
ontology exported in OWL format within Protégé.
Figure 3.16: A screenshot of the semi-automatic generated concepts
70
Figure 3.17. The Semi-automatic generated ontology exported in OWL
71
Chapter 4: Evaluation and Experiments
There are different approaches to the ontologies evaluation and comparison. The
major distinction is between qualitative and quantitative ones (Brewster et al, 2004).
Qualitative evaluations could be done by presenting the users an ontology or a part
of an ontology and ask them to rate it. The problem of that approach is that it
doesn’t use any standard criteria for the evaluation. Quantitative approach, instead,
allows to evaluate and compare a semi-automatic or automatic generated ontology
with an existing “gold standard”, represented for instance by a manually built
ontology, using different measures. Maedche and Staab (2002) used an evaluation
method based on comparing the ontology to a “golden standard” which may be an
ontology itself. Porzel and Malaka (2004) support the idea of an ontology
evaluation through the concrete results of its application. Lozano-Tello and Gomez
Perez (2004) proposed an evaluation approach based on the human evaluation of
ontology in which the humans try to assess how well the ontology fits a set of
predefined criteria and standards. Brank et al (2005) support the idea that it is often
more practical to evaluate different levels of the ontology separately. They
identified the following levels for an evaluation: lexical, vocabulary or data layer
level in which the focus is on which concepts or instances have been included into
the ontology and the vocabulary used to identify this concepts; hierarchy or
taxonomy level: here the evaluation is focused on the is-a relations; other semantic
relations: the evaluation is focused on different types of semantic relation (is a
relations are not considered); context or application level: at his level the context of
which an ontology can be part is taken into account (e.g. an ontology may be part of
72
a larger collection of other ontologies); syntactic level: the ontology is described in
a particular formal language and is evaluated if the language used for the ontology
matches the syntactic requirement of that language; structure, architecture, design
level: this kind of evaluation usually performed entirely manually and consists of
the observation of certain pre-defined design principles. The structure evaluation
involves the organization of the ontology and its suitability for further
developments. Guarino and Welty (2005) consider different aspects to the ontology
evaluation. They point out that several philosophical notions (such as essentiality,
rigidity, unity etc) can be used to better understand and evaluate the nature of
different types of semantic relations that commonly appear in an ontology. A
drawback of this approach is that it requires manual intervention by trained humans
familiar with the above mentioned notions.
We evaluated the semiautomatic ontology by comparing it with the manual
generated one and used precision and recall measures, following, as reported by
Gangemi et al. (2005) one of the most common approach used for the ontology
evaluation due to its simplicity.
4.1 Experiments for the Pilot Study
We conducted a pilot case study for Energy Domain, using 25 documents about
energy security. The pilot study allowed us to evaluate the semi-automatic approach
of ontology building (we used Ontogen for the semiautomatic ontology generation).
It allowed uncovering a set of pros and cons with respect to a manual approach. The
pros were: the ontology construction task is simplified by the semiautomatic system
73
and, in this way, it doesn’t require a specific expertise in ontology engineering.
Ontogen offers the suggestions for topics or subtopics for the ontology as a set of
keywords extracted from a corpus of documents. This, in turn, allows the
discovering of relevant classes that even were not considered with the manual
approach. In this view Ontogen could be used in parallel with the manual ontology
construction, suggesting the ontology engineer classes and instances missed during
the manual step. Regarding the cons: ontology relations among concepts are very
poor (only is-a relations and “similarity” relations between concepts are allowed)
other meaningful relations among different concepts are missed. Furthermore, often
there isn’t an alignment between keywords extracted by Ontogen, which represents
also classes’ names, and those assigned manually by humans. This problem could
be common to all keywords extraction systems integrated with Ontogen, and,
maybe, could be reduced when using more linguistically based keyword extraction
systems such as LAKE (D’Avanzo, Kuflik 2005).
A deeper evaluation of the semi-automatically energy generated ontology with the
manual gold standard one has been done with two experiments. We compared the
two ontologies using a quantitative approach and calculating precision and recall
measures in order to see if there is a partial overlapping among them.
74
4.2 Precision and Recall for a quantitative evaluation
Precision and recall are two standard, well known measures within the Information
Retrieval field. They are used, for instance, to evaluate the efficiency of a search
engine in order to understand if the documents retrieved after a query are relevant or
not with respect to the information need of the user which did the query (see the
table 4.1)
They are calculated as follow:
• PRECISION= Relevant Retrieved/ retrieved (R,R/ (R,R+NR,R)
• RECALL= Relevant Retrieved/ relevant (R,R/ (R,R+R,NR)
These two measures have been adapted for the evaluation task regarding the
comparison between manual versus semi-automatic ontology building. So, in this
case, the precision value indicates how many relevant semi-automatically generated
concepts are generated on the total of the semi-automatically generated concepts.
Recall, instead, reflect how many relevant generated concepts are generated on the
Table 4.1. Precision and recall
total of relevant, manually built, concepts (it’s a “coverage” measure). The
evaluation has been conducted only on a part of the energy ontology: on the energy
security class. This class, as shown previously, have other subclasses and many
horizontal relations with other classes within the energy domain ontology.
RELEVANT NOT RELEVANT
RETRIEVED R,R NR,R
NOTRETRIEVED R,NR NR,NR
75
The results of the first experiment present a precision value of 40% and a recall of
30% (figure 4.1). In this case have been considered as “good” not only all the
generated classes that fully overlapped the manual generated concepts, but even the
synonyms of the terms representing a class These values are aligned with other
comparative results for the evaluation of semi-automatic generated concepts with
gold standard one. Weber and Buitelaar (2005), in fact, evaluated an automatically
generated ontology of Sports Events with a manual “gold standard” generated one
presented similar results: 32% of precision and 46% of recall. So they have a better
recall value and a worst precision one. We also have considered the F Measure
values for a better comparison (the “F measure” represents the harmonic mean
between precision and recall values) . The results for this first experiment are the
following: we have a F Measure of 0,34 while they present a value of 0,37.
Experiments n. 1: Precision and Recall Values
Precision; 40%
Recall; 30%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1
Precision Recall
Figure 4.1. Experiment n.1: Precision and Recall results.
After the first, a second pilot experiment has been done considering semantically
based matching criteria for the measurement. In the second experiment, in fact, even
76
the manual instances of the energy security class has been considered as “concepts”
of the class itself. That operation was possible because Ontogen doesn’t make any
conceptual difference between the classes (representing the concepts) and the
instances of the manual ontology. And so, to have a real comparison between the
relevant conceptual entities generated from the two approaches, this kind of
operation has been considered allowed. Moreover even some semantically related
concepts of the energy security class, formally not presented into the ontological
space of that class but in other levels of the ontology, have been considered as
“concepts” of the energy security class (e.g. the concept “geopolitical problems”,
not directly presented within the hierarchy of energy security but semantically
linked to this concept with many horizontal links, has been considered as “concept”
of energy security). Of course, with these manipulations, the number of the
numerator and of the divisor for the recall calculation was different. And so we had
a different, better, recall value with respect to the first experiment. And even the
precision value had an improvement with these adjustments because, considering as
relevant even other entities not considered in the previous measurement, the number
of the numerator increase while that of the denominator is equal. The new results
are shown below in figures 4.2. The comparison between the results of the
experiment n1 and n.2 are presented in figures 4.3 and 4.4.
77
Experiments n. 2: Precision and Recall Values
Precision; 45%Recall; 40%
0%5%
10%15%20%25%30%35%40%45%50%
1
Precision Recall
Figure 4.2. Experiment n. 2. Precision and Recall Results
Experiments n. 1 and 2: Recall Values Comparison
First Recall Value; 30%
Second Recall Value; 40%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1
First Recall Value Second Recall Value
Figure 4.3. Experiments n 1-2: Recall value comparison considering semantic adjustment
78
Experiments n. 1 and 2: Precision Values Comparison
First Precision Value; 40%
Second Precision Value; 45%
0%5%
10%15%20%25%30%35%40%45%50%
1
First Precision Value Second Precision Value
Figure 4.4. Precision Values Comparison considering semantic adjustments.
The results of this first pilot study on Energy Domain ontology are quite aligned
with other evaluation results made for this task of comparison between manual
“gold standard” ontologies versus semi-automatic or automatic generated ones (e.g.
Weber et al., 2005).
In the next paragraph will be shown the results of new experiments conducted with
more documents given in input to Ontogen system in order to let it to work better.
Our expectations before these new experiments was that the number of generated
concepts and of relevant generated concepts would be increased. With
repercussions even on the precision and recall measures. These expectations has
been substantially confirmed by the new results.
79
4.3 New Experiments
After the first pilot study two new experiments have been done for the evaluation of
the semi-automatic ontology building with Ontogen. In these experiments have
been used and provided to the systems more documents for the ontology generation
(50 instead of the previous 25). The class considered for the evaluation was always
the “energy security class” within the energy domain ontology.
As expected the results have had some improvement because, generally, the more
are the documents given in input to this kind of systems and the better they work.
For the new experiment we followed the same method used for the pilot study: at
the first step (in the first experiment) we calculated the overlapping between manual
versus semi-automatic ontology exclusively considering the “concept name
alignment” between this two (and, as in the pilot study, the synonyms were
considered as “good” and counted). So, in this first case, the comparison was
mainly based on a sort of “terms matching” between the manual terms used to
represent a concept within the ontology and those semi-automatically created.
Then, for the second experiment, more semantically motivated matching criteria
were considered for the calculation of precision and recall values and, as verificated
in the pilot study (and as expected), these two values augmented.
The figure 4.5 shows the results of the precision and recall in the first new
experiment. We note that the precision value (42%) is quite similar to the precision
value of the first experiment conducted for the pilot study (40%). The recall value,
instead, have had a meaningful increment (38% compared to the previous 30% of
the pilot study).
80
Experiments n. 3: Precision and Recall Values
Precision; 42%Recall; 38%
0%5%
10%15%20%25%30%35%40%45%50%
1
Precision Recall
Figure 4.5. Precision an Recall results with 50 documents as input
The results of the second experiment, shown in figure 4.6, tend, instead, to confirm
the trend of the pilot study: both precision and recall are grown considering more
tolerant semantic based matching criteria.
Experiments n. 4: Precision and Recall Values
Precision; 50%
Recall; 42%
0%5%
10%15%20%25%30%35%40%45%50%
1
Precision Recall
Figure 4.6. Experiment n 4. Results
Through the comparison of the experimental results emerge two kind of
informations which can be considered as relevant. The first one is represented by
the fact that the precision value, if calculated without any semantic adjustment,
81
seems do not be too much affected by the increment of the number of documents
given in input to the system for the semi-automatic ontology generation. The new
results, in fact, are not so different from that obtained using 25 document to
generate the ontology: the range of variation is only of 2 points in percent for this
value (from 40% to 42%, see figure 4.7).
Figure 4.7. Precision Value Trend for the Experiments 1and 3
Exp. n.1 and 3: Precision Value Trend
40%42%
35%
40%
45%
50%
55%
60%
Precision Values for Exp. n1 and 3
Precision Value Trend
The second relevant information, instead, can be represented by the opposite
behaviour of the recall value passing from the ontology built with 25 documents to
that built with 50 (we considering the results obtained without semantic
adjustments). Recall, in fact, seems to be more influenced by the augmented
number of documents (see figure 4.8). This value is passed by the 30% of the first
experiment conducted for the pilot study to the 38% of the experiment n.3 (the first
of the two new experiments).
So, at a first sight, it seems that, giving more input documents to the system, it
works better on the recall measure rather than on the precision one.
82
Exp. n. 1 and 3: Recall Value Trend
30%
38%
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2
Recall Value Trend
Figure 4.8: Experiments n.1 and 3, Recall Value Trend
Considering the precision and recall results with semantic adjustments (passing
from 25 to 50 documents) we can observe that the situation doesn’t change a lot. In
fact, both the measures register an improvement. About the precision: its value
passes from the 45% of the experiment n.2 of the pilot study (with 25 documents) to
the 50% of the experiment n.4 with 50 documents (figure 4.9). The recall value,
instead passes from the 40% of the pilot study to the 42% (figure 4.10).
Exp. n.2 and 4: Precision Value Trend
45%
50%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
Precision Value Trend
Figure 4.9: Precision Value Trend for the Experiments 2 and 4.
83
Exp. n. 2 and 4: Recall Value Trend
40%
42%
35%
37%
39%
41%
43%
45%
1 2
Recall Value Trend
Figure 4.10. Recall Value Trend for the Experiments 2 and 4
Putting the obtained results in a more global vision, e.g. comparing it with other
similar evaluation research results, we can extract other interesting considerations
about the experiments. The first one is represented by the fact that the measured
precision values (both considering non semantically enhanced results and
semantically enhanced) always keep a better result with respect to other similar
evaluation such as the Weber et al.(2005)’s research. On the other hand, instead, the
recall values always present a lower level if compared with the Weber et al. result.
Considering F Measure for both the experimental situations (with non semantic and
semantic enhanced criteria) allowed us to have a better comparison between the
obtained results and the Weber et al. results. The figure 4.11 shows the “F” values
obtained considering the results of the situation with non semantic criteria (e.g. the
experiment n.1 and 3) compared with the “F” of Weber et al. research.
84
F Measure Comparison considering non semantic criteria
0,340,37
0,42
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
1
F Value for Exp 1 Weber et al F Value F Value for Exp 3
Figure 4.11.. F Measure Comparison considering non semantic criteria
The figure show an improvement of F passing from the situation of 25 documents
(experiment n 1) to that of 50 documents (experiment n.3). The “F” obtained with
the experiment n.3 score higher with respect to the F obtained by Weber et al
research, and this can be considered a quite satisfacing result.
The figure 4.12 shows the values of “F” obtained considering the semantic
enhanced criteria situation (experiments n 2 and 4). As shown in the graph the F
values are better than the Weber result.
85
F Measure Comparison considering semantic criteria
0,39 0,37
0,45
00,050,1
0,150,2
0,250,3
0,350,4
0,450,5
1
F Value for Exp 2 Weber et al F Value F Value for Exp 4
Figure 4.12. F Measure comparison considering semantic adjustment
A global overview of F Measure results is given in figure 4.13. We note that only
the F Measure value of the first experiment score a bit less with respect to the
Weber et. al result (represented by the second bar in the graph), while, the other
three values presents always a better result.
F-Measure (Harmonic Mean)
0,340,37
0,420,39
0,45
0
0,1
0,2
0,3
0,4
0,5
1
F M
esu
res
Val
ues
F Measure First Exp F Measure Weber et al. F Measure Sec. Exp
F Measure Third Exp F-Measure Fourth Exp
Figure 4.13. F-Measure
86
Chapter 5. Discussion and Conclusions
This last chapter is divided as follow: in the first part (paragraph 5.1) will be briefly
discussed some aspects concerning the methodology used for the experimental
research. Within this part we will try to give answers to some potential critical
questions that could be done about the work conducted. Then (in the paragraph 5.2)
will be presented and motivated a concrete proposal to improve Ontogen’s activity.
The last part of the chapter (par. 5.3) is dedicated at the conclusion of the whole
work.
5.1 About the methodology
In this paragraph we will try to aliment a discussion about the methodology used for
the experiments simulating a sort of questions and answers dialogue with an
“advocatus diaboli”. The scope of this paragraph is to clarify some aspects of the
methodology used for the research and trying to answer at some hypothetical
critical questions that could be done reading this work. The questions we will try to
answer are presented in the following list:
1. Do you think it is justifiable the chose of calculate the precision and recall
value considering even what you have called “semantic based criteria”?
2. Do you think that the semantic criteria you choose can be considered as
universally accepted?
3. You have evaluated a micro-ontology. Do you think it is possible to
generalize the results?
87
4. Do you think that other evaluation methods for the comparison of manual
versus semi-automatic approach could be used?
5. The semi-automatic generated ontology has been built with 50 documents.
Don’t you think that they are too few?
6. Do you think that the documents given as an input to the system have all the
same value about the knowledge that can be extracted?
7. How could be extended this research for future works? What could be the
interesting topics to investigate?
8. According with the obtained results how do you think that Ontogen
software could be improved?
Answers:
1. Precision and recall are classical measures in information retrieval hence
redefining and using them in the context of the case of semi automatic
ontology generation makes the results intuitively understandable to the IR
community. Calculate precision and recall considering even aspects related
to the semantics of the concepts have sense, because it allows having a
deeper evaluation regarding the comparison of the conceptual entities
generated by the two different ontology building approaches. If we stopped
only at the first “terms matching based” evaluation (e.g. the results of the
experiments n.1 and n.3) we wouldn’t have a complete vision of what are
the real differences between the manual and semi-automatic approach.
Because, for example, a semi-automatically generated term, semantically
closed to a term of the manual ontology but not formally represented within
88
it, wouldn’t be considered as a relevant (it is the case of the concept
“geopolitical problems” which we encountered before). And, that, would be
a mistake because wouldn’t be counted as “good” all those terms
representing relevant “aspects of meaning” (De Mauro, 1992) of a concept
of an ontology.
2. Our evaluation has been done on a particular class: energy security. In
literature there isn’t a clear definition of this concept because it involves
different aspects and fields. In our case, for example, we considered (in the
experiment n.2 and 4) the subclass “geopolitical problems” as a sub-concept
of energy security because it is semantically related to it. Another
consideration that could be done is the following: concepts such as
“geopolitical problems” are not included formally within the hierarchy of
energy security because the objective followed with the creation of the
ontology was to build an ontology for the energy domain. And not for
energy security. But surely, if the target would be to build an energy security
ontology, then this kind of concepts would be inserted formally within the
hierarchy of the energy security class. So, about the acceptance of the
criteria, they probably wouldn’t be accepted in general, but in our opinion
this kind of choices (and its evaluation), can be done mainly by the
ontologists and by the domain experts involved concretely in the domain
ontology building because these problems are referred to a specific task in a
specific domain.
3. The obtained results can be considered a sort of direction indicator. They are
specific to a given domain and to Ontogen. Of course they must be verified
89
in an enlarged scale to have a higher level of generalization strength. The
fact that has been evaluated a micro-ontology doesn’t represent a problem
for the generalization of the results if the methodology followed for the
evaluation is clear and accepted. To do that we have based the evaluation on
a well known approach for ontology evaluation based on two quantitative
measures well known in Information Retrieval such as precision and recall.
4. As briefly mentioned in the previous chapter there are many approaches for
the ontology evaluation. And, therefore, other aspects could be considered
for the evaluation. An interesting issue could be the following: to use an
automatic evaluation system for the calculation of precision and recall based
on the “term matching criteria” (the system only should do a “term
matching” between the concept-terms used in the two ontologies and, then
calculate the values).
5. The documents considered to build the ontology are the same considered by
the humans for the manual task. The pilot study has been done on an
ontology generated with 25 annotated document (representing our
independent variable) because, as mentioned in different parts of this work,
the main activity which was based on Ontogen for the concepts creation was
the extraction of keywords from the documents given in input. Keyword
extraction systems usually need only few training documents to learn a
model and start to operate with an acceptable margin of error (e.g. Witten et
al. 1999 demonstrated that the keyword extraction system based on KEA
algorithm, see Frank et al. 1999, only need 20-25 documents to perform
very good results). Of course we also did other experiments with more
90
documents (experiments n.3 and 4) to evaluate the degree of variation of
precision and recall measures.
6. The sources given in input to the system can be considered as important for
the knowledge extraction operations. They could be selected and provided
by a domain expert. A future experiment that could be done regarding the
creation of a semi-automatic ontology exclusively with documents selected
by a domain expert. In order to see if there are meaningful differences in the
knowledge extracted, and then represented in an ontology, in comparison
with the situation of a semi-automatic ontology built without considering
documents provided by domain experts..
7. An interesting aspect is to investigate until what number of documents there
is an improvement of precision and recall measure (when do the differences
stabilize).
8. Ontogen activities could be improved considering different approaches to
the keyword extraction. This issue will be explained in the next paragraph.
5.2 Proposal: a linguistically motivated keyphrases extraction
system for Ontogen
The keyword extraction process represent one of the major feature on which is
based Ontogen software for the semi-automatic ontology building. This module of
the system, in fact, allows to extract, from the documents given in input, a set of
keywords that represent a list of “candidate concepts” for the ontology which is
91
under construction11. Moreover the extracted keywords represent also the “default
value” assigned to the name of the new concepts of the semi-automatically
generated ontology. And that, as pointed out in the previous pages, is one of the
cons of Ontogen’s activity if compared with the manual concept name assignment.
This kind of dis-alignment can be improved making some changes in the Ontogen
keyword extraction module. In fact, that would allow a better keyword extraction
process and, therefore, better result about the concept name assignment and, then,
about the concepts-name alignment with the manual concepts.
One of the major cons of Ontogen’s keyword extraction module is that it mainly
retrieve uni-grams12 (terms composed by only one word) that aren’t representative
of the concept terms used by the humans to classify a domain. Our hypothesis is
that the attention should be focus on keyphrase extraction rather than on the uni-
gram keywords. The manual extracted concepts, in fact, are mainly represented by
keyphrases rather than by single words. The keyphrases are “linguistic unit usually
larger than a word but smaller than a full sentence” (Caropreso et al., 2002).
According to this view an interesting proposal for the improvement of Ontogen
results could be represented by the integration, within the system’s package, of a
keyphrase extraction system linguistically motivated and evaluated in international
competitions such as LAKE (D’Avanzo et al 2005).
LAKE’s methodology (the acronym of the system stands for Linguistic Analysis
based Knowledge Extractor) is based on Keyphrase Extraction (KE) and makes use
of a learning algorithm to select linguistically motivated keyphrases from a list of
11 Then is the user that chose if consider these candidate keywords as representative and “good” to become concept of the ontology. 12 The system allow to decide the length of the words to extract but mainly it extract uni-grams.
92
candidates, which are then merged to form a document summary. The underlying
hypothesis of this system is that linguistic information is beneficial for the task, an
assumption completely new with respect to the KE state of the art. KE is performed
as a supervised machine learning task and a classifier is trained by using documents
annotated with known keyphrases. The trained classifier is subsequently applied to
documents for which no keyphrases are assigned: each defined term from these
documents is classified either as a keyphrase or as a non-keyphrase. Both training
and extraction processes choose a set of candidate keyphrases (i.e. potential terms)
from their input document, and then calculate the values of the features for each
candidate.
LAKE system works as follows: first, a set of linguistically motivated candidate
phrases is identified. Then, a learning device chooses the best phrases. Finally,
keyphrases at the top of the ranking are merged to form a summary. The candidate
phrases generated by LAKE are sequences of Part of Speech containing Multiword
expressions and Named Entities (e.g. proper names, locations and dates). These
elements are defined as ”patterns” and are stored in a patterns database; once there,
the main work is done by a learner device. The linguistic database makes LAKE
unique in its category. The system consists of three main components (see the
LAKE’s architecture in figure 5.1): a Linguistic Pre-Processor, a Candidate Phrase
Extractor and a Candidate Phrase Scorer.
93
Figure 5.1. LAKE Architecture.
The system accepts a document as an input. The document is processed first by
Linguistic Pre-Processor which tags the whole document, identifying Named
Entities and Multiwords as well. Then candidate phrases are identified based on the
pattern database. Until this moment the process is the same for training and
extraction stages. In training stage, however, the system is furnished with annotated
document13. Candidate Phrase Scorer module is equipped with a procedure which
looks, for each author supplied keyphrase, for a candidate phrase that could be
matched, identifying positive and negative examples. The model that come out from
this step is, then, used in the extraction stage. The last module, once extracted
keyphrases, uses a score mechanism to select the most representatives.
13 For which keyphrases are supplied by the author of the document.
94
LAKE system has been evaluated with encouraging results in many international
competitions, such as the DUC (Document Understanding Conferences) sponsored
by DARPA, and in different fields (e.g. knowledge management, see D’Avanzo et
al. 2007) and tasks: Single Document Summarization (D’Avanzo et al 2004), Text
categorization (2005), Summaries for Small Screen Devices (D’Avanzo and Kuflik,
2005), Multi Document summarization (D’Avanzo et al. 2007a).
Especially its application in the multi-document summarization task could be very
useful for the realization of relevant short summaries of documents trough the
extraction of representative keyphrases.
A future work that can be done to verify the consistency of the proposal consists in
the evaluation of the LAKE system compared with the keyword extraction module
of Ontogen. All the necessary data to do the experiment of comparison are
available: there are the manual “gold standard” keyphrases extracted to name the
concepts of the ontology, and the Ontogen extracted keywords for each document
of for cluster of documents. What should be done is evaluate the LAKE results on
the same documents given in input to Ontogen. And, then, make a double
comparison between LAKE and Ontogen module results versus the manual
extracted keyphrases inserted within the ontology.
5.3 Conclusions
The previous chapter started with a set of questions regarding the relations between
manual versus semi-automatic approach to the ontology building. Some of them
are reported here: “…Are this approaches (manual vs semi-automatic) alternative
or can they be integrated?... Is it realistic, in the near future, to think at a complete
95
automatization or, at least, at a semi-automatization of the whole process of
ontology building?”.
In order to answer these questions this work presented a concrete comparison
between manual building versus semi-automatic building of domain specific
ontology. The comparison has been made trough an evaluation of the results
obtained by the two different methods.
Precision and Recall, adapted to the specific task and calculated considering two
different experimental situations, have been used to measure the degree of
overlapping between the manual and the semi-automatic generated ontology.
The results showed that it is possible to semi automatically generate a domain
ontology from documents in a specific domain, that the ontology is improved as
more documents are used and that it may improved further by introducing some
semantics into the process. However, there is still a great difference between the
semi automatic ontology and a manual one, result that is similar to what is already
known. From the obtained results some indications can be extracted. The first one
concern the gap between the human and the semi-automatic systems results
(Ontogen in the specific case): in the ontology building task this difference appear
still high and difficult to bridge. But, the results appear to be quite encouraging if
read in perspective and compared with other similar research evaluations for the
same task (e.g. the work of Weber et al., 2005). However, the use of semantics
seem to be the right way to improve the process, so finally a motivated proposal,
that need to be tested, to improve the keyword extraction module of Ontogen
system has been done.
96
Nowadays an interesting use of the semi-automatic systems like Ontogen can be
represented by their integration with the manual approach. One of the main
capabilities of Ontogen, is its ability to discover potential relevant concepts that the
humans do not have considered during their categorization. So it can be useful to
support the manual ontology building. However, in our opinion, that aid can be
provided only in a second phase: after the realization of a first skeleton model of
the world of the domain to represent within an ontology. In this way these systems
can be very useful to enrich and populate existing ontologies or initial skeletons of
ontologies.
Of course one of the major challenges for the Semantic Web research in the years
to come is represented by the improvement of the automatic and semi-automatic
systems used for the ontology building in order to allow an easy construction and
proliferation of ontologies. And, in order to have a real, continuous, improvement
of those systems, and a gradual slow alignment to the manual “gold standard”
results, the evaluation phase play a role of strategic importance. Only evaluating
the results of these systems, in fact, will be possible to discover where and in what
manner to modify them in order to let them to work better. An important aspect
regarding this issue is linked to the lack of a global standardized approach for the
ontology evaluation. In this field there is a sort of fragmentation of different
approaches and, that situation, make difficult the task of comparing the results of
different researches if they are done using different methods. So what should be
done is try to create a comprehensive evaluation framework accepted and used as
standard for the evaluation. And, of course, the way to follow to achieve this
objective is always the same: research.
97
References
Agrawal R., Imielinsky , Swami A. “Mining Associations rules between sets of
items in large database”. In Proceedings of the ACM Sigmond Conference on
Management of Data, pp. 207-216, 1993.
Alfonseca E., Manandhar S. “Extending a Lexical Ontology by a Combination of
Distributional Semantic Signatures”, EKAW02, Siguenza, Spain, Published
Lecture Notes in Artificial Intelligence 2473, 2002.
Aussenac-Gilles N., Biebow B., Szulman S. “D’une methòd à un guide pratique
de modélisation de connaissances à partir da textes”. Terminologie and IA, TIA
2003. Ed. Rousselout. Strasbourg (pp.41-53).
Aussenac-Gilles N., Biebow B., Szulman S. ”Corpus Analysis for Conceptual
Modelling “. Workshop on Ontologies and Text, Knowledge Egineering and
Knowledge Management: Methods, Models and Tools, 12th International
Conference EKAW00, Juan-les Pins-France. Springer Verlaag. 2000.
Bachimont B., Isaac A., Troncy R. “Semantic commitment for designing
ontologies: a proposal”. In A. Gomez-Perez and V.R. Benjamins (eds.): EKAW
2002, LNAI 2473, pp 114-121, Springer Verlag, 2002.
Berners Lee T, Lassila H, Hendler J. (2001). “The Semantic Web”. Scientific
American 284(5): 34-43.
Berners Lee T. “Weaving the web: the original design and the ultimate destiny of
the World Wide Web by its inventor”. HarperCollins Publishers. New York. 1999.
Berners-Lee T. “L'architettura del nuovo Web”, Feltrinelli, 2001.
Biebow B., Szulman S. (1999). “TERMINAE: a linguistic based tool for the
building of a domain ontology”. In EKAW ’99. Proceedings of the 11th European
98
Worksop on Knowledge Acquisition, Modelling and Management. Dagsthul,
Germany, LCNS, pp 49-66. Berlin. Springer-Verlag
Bisson G., Nedellec C., Cañamero D. “Designing Clustering Methods for
Ontology Building. The Mo’K Workbench”, 2000.
Brank J., Gorbelnik M., Mladenjc D. “A survey on Ontology Evaluation
Techniques”. In: SIKDD 2005 at multiconference IS 2005, Ljubljana, Slovenia,
October 2005.
Brewster C., Alani H., Dashmahapatra S, Wilkis . S. “Data Driven Ontology
Evaluation”. Proceedings of LREC, 2004
Buitelaar P., Cimiano P., Grobelnik M., Sintek M.. “Ontology Learning from
Text”. Tutorial at ECML/PKDD, Porto, October 2005.
Buitelaar. P, Olejnik D., Sintek M. “A Protegé Plugin for Ontology Extraction
from Text based on Linguistic Analisys”. In Proceedings of the 1st European
Semantic Web Symposium (ESWS), Heraclion, Greece, 2004.
Caropreso M.F., Matwin M.F., Sebastiani F., (2001). A learner-independent
evaluation of the usefulness of statistical phrases for automated text categorization.
In Amita G. Chin (Ed.), Text Databases and Document Management: Theory and
Practice (pp. 78–102) Hershey (US) Idea Group Publishing.
Chaelandar G., Grau B. “A system to classify words in context – Svetlan”. In S.
Staab, A. Maedche, C. Nedellec, P Wiemer Hastings (eds). Proceedings on the
Workshop on Ontology Learning, 14th European Conference on Artificial
Intelligence. ECAI’00, Berlin, Germany, August 20-25. 2000
99
Cimiano P., Frank A., Racioppa S., Buitelaar P. “SOBA: SmartWeb Ontology-
based Annotation”, Proceedings of the Demo Session at the International Semantic
Web Conference, November 2006.
Craven M., Di Pasquo D., Freitag D., McCallum A., Mitchell T., Nigam K.,
Slattery S., “Learning to Construct Knowledge Bases from the World Wide Web”,
Artificial Intelligence, 118(1-2), pp 69-114, 2000.
D'Avanzo E., Elia A., Kuflik T., Vietri S .. LAKE system at DUC 2007. In
Proceedings of Human Language Technology Conference/North American chapter
of the Association for Computational Linguistics Annual Meeting (HLT-NAACL
2007). Rochester, NY, April 22-27, 2007.
D'Avanzo E., Kuflik T., Elia A., Lieto A., Preziosi R. Where Does Text Mining
Meets Knowledge Management? A Case Study. In itAIS07, Springer, 2007.
D'Avanzo E., Kuflik T ., Linguistic Summaries on Small Screens. In: Data Mining
VI. Zanasi A., Brebbia C., Ebcken N. (eds.), Wit Press, p. 195-204, 2005.
D'Avanzo E., Magnini B. A Keyphrase-Based Approach to Summarization: the
LAKE System at DUC-2005. DUC Workshop. Proceedings of Human Language
Technology Conference / Conference on Empirical Methods in Natural Language
Processing ( HLT/EMNLP 2005). Vancouver, B.C., Canada, October 6-8, 2005
D'Avanzo E., Magnini B., Vallin A.. Keyphrase Extraction for Summarization
Purposes: The LAKE System at DUC-2004. DUC Workshop. Proceedings of
Human Language Technology conference / North American chapter of the
Association for Computational Linguistics annual meeting (HLT/NAACL 2004).
Boston, May 2-7, 2004
De Mauro T. Minisemantica. Laterza. Roma. 1992.
100
Deitel A., Faron C., Dieng R. “Learning ontologies from RDF annotations”. In
Proceedings of the IJCAI Workshop in Ontology Learning, Seattle. 2001.
Engels R (2001). Corporum Ontoextract. Ontology extraction Tool. Delivarable 6.
Ontoknowledge. http://www.ontoknowledge.org/del.shtml
Engels R (2002) Corporum-Ontowrapper. Extraction of structured informations
from web based resources. Delivarable 7. Ontoknowledge.
http://www.ontoknowledge.org/del.shtml
Faure D, Poibeau T., “First experiment using semantic knowledge learning by
Asium for information extraction using INTEX”. In: S. Staab, A. Maedche, C.
Nedellec, P. Wiemer –Hastings. Proceedings of the Workshop on Ontology
Learning, 14th European conference on Artificial Intelligence. ECAI00, Berlin.
2000.
Fortuna B., D. Mladenic, M. Grobelnik. “Visualization of text document
corpus”. Informatica 29 (2006), 497-502
Fortuna B., M. Grobelnik, D. Mladenic. “Background Knowledge for Ontology
Construction”. Poster at WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
Fortuna B., M. Grobelnik, D. Mladenic. “OntoGen: Semi-automatic Ontology
Editor”. HCI International 2007, July 2007, Beijing.
Fortuna B., M. Grobelnik, D. Mladenic. “Semi-automatic Construction of Topic
Ontology” . Semantics, Web and Mining, Joint International Workshop, EWMF
2005 and KDO 2005, Porto, Portugal, October 3-7, 2005, Revised Selected Papers.
Fortuna B., M. Grobelnik, D. Mladenic. “Semi-automatic Data-driven Ontology
Construction System”. Proceedings of the 9th International multi-conference
Information Society IS-2006, Ljubljana, Slovenia.
101
Fortuna B., M. Grobelnik, D. Mladenic. “ System for Semi-automatic Ontology
Construction”. Demo at ESWC 2006, June 11-14, 2006, Budva, Montenegro.
Frank E., Paynter G.W., Witten I.H., Gutwin C. and Nevill-Manning C.G.
"Domain-specific keyphrase extraction" Proceedings of Sixteenth International
Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San
Francisco, CA, pp. 668-673, 1999.
Frege G. “Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache
des reinen Denkens”, Halle a. S.: Louis Nebert. Translated as Concept Script, a
formal language of pure thought modelled upon that of arithmetic, by S. Bauer-
Mengelberg in J. vanHeijenoort (ed.), From Frege to Gödel: A Source Book in
Mathematical Logic, 1879-1931, Cambridge, MA: Harvard University Press, 1967.
Gangemi A., Catenacci C., Ciaramita M., Lehmann J. “A Theoretical
Framework for Ontology Evaluation and Validation”. In Proceedings of SWAP.
2005.
Gómez-Pérez, A. and D. Manzano-Macho. “A survey of ontology learning
methods and techniques”. OntoWeb Project. 2003.
Gruber T. “ A Translation Approach to Portable Ontology Specifications” .
Knowledge Acquisition, 5(2), 1993, pp. 199-220.
Guarino N, Welti C. “Evaluationg Ontological Decisions with Ontoclean” Comm.
Of the ACM, 45(2): 61-66, February 2002.
Hahn U., Schultz S. “Towards very large terminological Knowledge Bases. A
Case study from medicine”. Canadian Conference on AI. 2000. 176-186.
102
Harris Z. S. “Papers in Structural and Transformational Linguistics”, Reidel,
Dordrect. 1970.
Hearst, M. “Automatic acquisition of hyponyms from large text corpora”. In
Proceedings of the 14th International Conference on Computational Linguistics.
Nantes, France, 1992
Jannink J., Wiederhold G. “Ontology maintenance with an algebraic
methodology: A case study”. In Proceedings of AAAI Workshop on ontology
management, July 1999.
Kahn L., Luo F. “Ontology Construction for Information Systems”. In
Proceedings of 14 th IEEE International Conference and Tools with Artificial
Intelligence, pp.122-127, Washinton DC, November 2002.
Kashyap V. “Design and Creation of Ontologies for Environmental Information
Retrieval”. 20th Workshop on Knowledge Acquisition, Modeling and Management.
Alberta. Canada. October 1999.
Kietz J., Maedche A., Volz R. “A method for Semi-Automatic Ontology
Acquisition from a Corporate Intranet”. In: Aussenac-Gilles N., Biebow B.,
Slzuman S., EKAW00. Workshop on Ontologies and Text. Juan-Les-Pins. France.
CEUR Workshop Proceedings 51:4.1-4.14., Amsterdam, 2000.
Lozano-Tello A., Gomez-Perez A. “Ontometric: A Method to chose the
appropriate ontology”, Journal of Database Management, 15(2):1-18, 2004.
Maedche A, Staab S. “Ontology Learning”. IEEE Intelligent Systems, Special
Issue on the Semantic Web 16(2). 2001.
Maedche A, Staab S. 2001 “Ontology Learning for the Semantic Web”. IEEE
Intelligent Systems, Special Issue on the Semantic Web.
103
Maedche A, Voltz R. “The Text-to-Onto Ontology Extraction and Maintenance
Environment”. Proceedings of the ICDM Workshop on integrated data mining and
knowledge management, San Jose, California, USA.
Maedche A., Staaab S., “Measuring similarities between ontologies”. Proceedings
CKM02, LNAI vol.2473, 2002.
Modica G., Gal A., Jamil H.M. “The Use of Machine Generated Ontologies in
Dynamic Information Seeking”. In Proceedings of the Sixth National Conference
on Cooperative Information, Springer-Verlaag, 2001.
Navigli R., Velardi P. “Learning Domain Ontologies from Document Warehouses
and Dedicated WebSites”. Computational Linguistic (30-2), MIT Press, 2004.
Nobecourt J. “A method to build formal ontologies from text”. In EKAW00,
Workshop on ontologies and text, Juan-Les-Pins, France. 2000.
Oliveira A, Pereira F.C., Cardoso A. “Automatic Reading and Learning from
Text”. In Proceedings of International Symposium on Artificial Intelligence,
ISAI01. December 2001.
Oliveira A., Pereira F.C., Cardoso A. “Extracting Concept Maps with Clouds”.
Argentin Symposium on Artificial Intelligence. ASAI00, Buenos Aires, Argentina,
2000.
Papatheodorou C., Vassilious A., Simon B. “Discovery Ontologies for learning
resources using word-based clustering”. Copyright by AACE. Reprinted from the
EDMEDIA with the permission of AACE. Denver USA. 2002
Pereira F.C., “Modelling Divergent Production: a Multi Domain Approach”.
European Conference on Artificial Intelligence, ECAI 98, Brighton, UK, 1998.
104
Poesio M. “Domain Modelling and NLP: Formal ontologies? Lexica? Or a bit of
both?” Journal of Applied Ontology. pp.27-33. IOS press. 2005.
Portzel R., Malaka R. “A task based approach for ontology evaluation”,
Proceedings of ECAI, 2004.
Reinberger M.M., Spyns P. “Discovering knowledge in text for the learning of
DOGMA inspired ontologies”. Proceedings of OLP04 (workshop on ontology
learning and population at ECAI), 2004.
Rigau G., Rodriguez H. and Agirre E. “Building accurate semantic taxonomies
from Monolingual MRDs”. Proceedings of the 17th International Conference on
Computational Linguistics, Montreal, Canada, 1998.
Rosh E., "Cognitive representation of semantic categories" Journal of
Experimental Psychology 104: 573-605, 1975
Rubin D., Hewett M., Oliver D.E., Klein T.E., Altman R.B.“Automatic data
acquisition into ontologies from pharmacogenethics relational data sources”,
Proceedings of the Pacific Symposium on Biology, Lihue, HI, 2002.
Schutz, A., Buitelaar, P. “RelExt: A Tool for Relation Extraction from Text in
Ontology Extension”. 4th ISWC, Galway, 2005.
Snoussi H., Magnin L. and Nie J.Y., “Towards an Ontology-based Web Data
Extraction”, Centre de recherche informatique de Montréal (Canada), Université
de Montréal (Canada), 2002.
Soergel D., Aussenac-Gilles N. “Text analysis for ontology and terminology
engineering”. Journal of Applied Ontology. pp. 35-46. IOS press. 2005.
105
Stojanovic L, Stojanovic N., Voltz R. “Migrating Data Intensive Web Sites into
the Semantic Web”. Proceedings of the 17 ACM symposium on applied computing
(SAC), ACM Press, pp 1100-1107. 2002.
Suryanto H., Compton P. “Discovery of Ontologies from Knowledge Bases”. In
Proceedings of the First International Conference on Knowledge Capture, Victoria,
BC, Canada, October 2001.
Weber N., Buitelaar P., “Web based ontology learning with ISOLDE” In
Proceedings of the Workshop on Web Content Mining with Human Language at the
International Semantic Web Conference, Athens GA, USA, Nov. 2006.
Witten I.H., Paynter G.W., Frank E., Gutwin C. and Nevill-Manning C.G.
"KEA: Practical automatic keyphrase extraction." Proceedings of DL '99, pp. 254-
256, 1999.
Wittgenstein L., “Philosophical Investigations”. Edited by G. E. M. Anscombe,
and R. Rhees, and translated by G.E.M. Anscombe. Oxford: Blackwell, 1953.
Yamaguchi T. “Constructing Domain Ontologies based on concept drift analysis”.
In Proceedings of IJCAI’99, Workshop on Ontologies and Problem Solving
Methods: Lesson Learned and Future Trends. Stockholm, Sweden, 1999.