an incremental approach for discovering medical knowledge from texts
TRANSCRIPT
An incremental approach for discovering medical knowledge from texts
Rafael Valencia-Garcıa, Juana Marıa Ruiz-Sanchez, Pedro Jose Vivancos-Vicente,Jesualdo Tomas Fernandez-Breis*, Rodrigo Martınez-Bejar
Departamento de Ingenierıa de la Informacion y las Comunicaciones, Facultad de Informatica, Universidad de Murcia, Murcia 30071, Spain
Abstract
Vast amounts of medical knowledge reside within text documents, so that the automatic extraction of such knowledge would certainly be
beneficial for clinical activities. A user-centred approach for the incremental extraction of knowledge from text, which is based on both
knowledge technologies and natural language processing techniques, is presented in this work. In such approach, ontologies are used to
provide a formal, structured, reusable and shared knowledge representation. The system has been used to extract clinical knowledge from
texts concerning Oncology.
q 2003 Elsevier Ltd. All rights reserved.
Keywords: Knowledge acquisition; Knowledge representation; Ontologies
1. Introduction
The objective of extracting knowledge directly from free
text is a challenging task. The achievement of this objective
would allow for extracting knowledge easily and without
the intervention of knowledge engineers. In this work, we
present an approach for building ontologies from texts in a
supervised mode. This work is an extension to the work
presented in Ruiz-Sanchez, Valencia-Garcıa, Fernandez-
Breis, Martınez-Bejar, & Compton (2003), where the
authors described a software tool for ontology building
from natural language texts. The principal idea sustaining
our approach is that, in natural language, relationships
between knowledge entities are usually associated to verbs.
In this work, we also introduce a system that tries to
discover terms which represent concepts by means of a
database that relates these concepts with linguistic
expressions. This system is based on an architecture formed
by three modules: morphological analysis, concept search
phase, and inference. According to this, our system allows
to store verbs representing a relationship between concepts
in order to automatically be able to identify this knowledge
whenever it reappears. The system makes use of two
knowledge technologies as well as and the grammar
category of the words in the current sentence to infer
knowledge entities (i.e. concepts, attributes and values),
from a text fragment. The system has been applied into an
Oncology domain and the results of this experiment are
discussed in this paper.
The knowledge technologies are MCRDR and ontologies.
RDR (Compton, Horn, Quinlan, & Lazarus, 1989) is a
technology that provides case-based reasoning consisting on
using previous situations to solve current problems. Most
CBR methodologies present maintenance problems. How-
ever, RDR overcomes this problem and allows expert to build
and maintain knowledge bases without support. MCRDR
(Kang, 1996) is an extension to RDR that allows for working
with multiple conclusions. On the other hand, ontologies are
commonly defined as specifications of domain knowledge
conceptualisations (Van Heijst, Schreiber, & Wielinga,
1997). Due to the very nature of ontology, there is not a
unique (valid) manner to define ontologies (Musen, 1998).
Moreover, several definitions have historically been given to
the term ontology, although it is commonly considered to be
an enumeration of the relevant concepts in an application
area, as well as a definition of classes of concepts and
relationships among these classes (Fernandez-Breis, Castel-
lanos-Nieves, Valencia-Garcia, Vivancos-Vicente, Martı-
nez-Bejar & De las Heras-Gonzalez, 2001). According to
some authors, an advantage of ontologies is the possibility of
making a mathematical study on their properties, among
0957-4174/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2003.09.001
Expert Systems with Applications 26 (2004) 291–299
www.elsevier.com/locate/eswa
* Corresponding author. Tel.: þ34-968367345; fax: þ34-968364151.
E-mail addresses: [email protected] (J.T. Fernandez-Breis);
[email protected] (R. Valencia-Garcıa); [email protected] (J.M. Ruiz-
Sanchez); [email protected] (P.J. Vivancos-Vicente); [email protected]
(R. Martınez-Bejar).
which, shareability and reusability should be pointed out
(Gomez-Perez & Benjamins, 1999).
The structure of the paper can be described as follows. In
Section 2, the knowledge technologies used in this
approach are described. The detailed description of the
knowledge acquisition methodology developed is pointed
out in Section 3. In Section 4, new version of KAText (v.2)
is described. The validation experiment is discussed in
Section 5. Finally, Section 6 contains the conclusions of
this work.
2. Knowledge technologies applied
In this work, an approach based on ontologies and
Ripple-Down Rules has been used. On the one hand,
ontologies have been used to model the medical knowledge
extracted from the texts. In particular, the ontology model
used in this work is an extension of that presented in
(Ruiz-Sanchez et al., 2003). There, ontologies are rep-
resented though sets of concepts having the following
properties. First, concepts are defined through a set of
attributes and a set of interconceptual relations. These
relations are of the following types: taxonomy, mereology,
equivalency, dependency, topology, causality, functional-
ity, similarity, conditionality, purpose, and chronology.
These are the most common relations in problems
(Gomez-Perez, Moreno, Pazos, & Sierra-Alonso, 2000),
although there are more possible relations between knowl-
edge entities. The extension to this ontological model used
in previous works has been to allow users to define their own
types of relations. This feature allows our ontologies to
cover more domain knowledge. However, the user-defined
relations have no formal properties as the pre-defined ones
have. Moreover, structural axioms are also derived from the
proper structure of the ontologies.
On the other hand, Ripple-Down Rules (Compton et al.,
1989) have been the technology selected for supporting the
reasoning processes of this approach. This methodology has
been widely used for medical purposes (see, for instance
Buchanan, 1986; Horn, Compton, Lazarus, & Quinlan,
1985). A RDR system is based on a binary tree with rules at
each node. Each child rule replaces the conclusion given by
its parent rule. When an input case satisfies a rule, then it is
evaluated against its child rules. The last fired rule will be
the conclusion given by the RDR system, and the input case
for which that conclusion is given is called cornerstone case.
When the system provides an incorrect answer, a new rule is
added to the RDR tree to cover this situation. In case a
parent rule is satisfied but not the child one, then the new
rule is added to the false branch of the child rule in order to
preserver the rule evaluation sequence in following
executions. In our case, the antecedent of the rules will be
linguistic expression and the conclusions will be the
ontological entities (i.e. concepts, attributes, values, or
relations) extracted from the linguistic expression. This issue
will be further discussed in Sections 3 and 4. However, we
do not use RDR but an extension called MCRDR (Kang,
1996). In this case, the tree is n-ary so that each node may
have multiple successors. In particular, all children of a
satisfied parent rule are evaluated, allowing for multiple
conclusions.
Ontologies and RDR have previously been brought
together to develop KBSs (Martınez-Bejar, Ibanez-Cruz,
Compton, & Mihn Cao, 2001). However, in this work the
combination of both technologies have a different purpose.
In our approach, RDR will be use to learn and infer
knowledge from medical texts and the extracted and
suggested knowledge will be expressed in ontological
manner, that is, through the ontological entities defined in
the previously described ontology model.
3. A framework for acquiring knowledge
3.1. Assumptions
The aim of this work was to present a technique for
extracting knowledge from natural language texts. More
precisely this work has focused on the implementation of a
system, which builds an ontology from a text. So, an implicit
assumption (Assumption 1) is that ontologies can be used to
represent knowledge. As the whole text can be very long, it
seemed convenient to divide it into minor fragments in order
to facilitate its processing. Furthermore, the approach
presented here is based on the idea that relationships are
usually associated to verbs in natural language. Also, we
decided, as it is done normally, to divide texts into
sentences.
In this approach, the expert is in charge of building the
ontology. This gives rise to another assumption (Assump-
tion 2): experts can extract semantic relationships from text.
This expert must have expertise on the specific task
described along the text, and the expert is somehow
associated with the system and to the text by the task itself.
In a way, we can say that knowledge resides inside the text.
So, there is another implicit assumption (Assumption 3) in
this sentence, namely, text can contain knowledge.
Ontologies permit to divide knowledge into categories
such as concepts, attributes, relationships, rules, axioms,
etc. These knowledge entities can appear explicitly in the
text, although sometimes knowledge is only referred to
implicitly. The system attempts to find only explicit
knowledge from the text.
By taking into account the above assumptions, the
starting point is a system composed by three modules,
namely, an empty concept knowledge base, an empty verb
knowledge base and an empty MCRDR sub-system. At this
phase, the system is hence unable to find any knowledge in
the input text and the expert has to introduce knowledge
manually. The expert’s task is then to identify
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299292
the relationship associated to the main verb (if any) of the
current sentence and all the other knowledge entities of
the current fragment of text. He/she will also tell the system
the expressions in which they appear. The expressions-
knowledge associations for the concepts and relations are
stored thereafter in the system (the expressions associated
with concepts into the conceptual knowledge base, and the
expressions associated with relationships into the verbak
knowledge base) in order to be used for new knowledge
findings thereafter. An MCRDR sub-system is created then
and maintained by the system in order to be used for
acquiring the knowledge entities, which participate in the
relationship under question.
The MCRDR sub-system rules are based on the
relationship that represents the main verb in the current
sentence, the grammar category of the knowledge
entities, which appear in that sentence, and the position
of this knowledge entities with respect to the verb in the
sentence.
The system shows the relationship and the knowledge
entities in the current sentence, and the expert will just
have to confirm the results output by the system. If the
real conclusion is different from the conclusion inferred
by the system, then the expert has to introduce the
correct conclusion. After that, the system will add the
new case into the MCRDR sub-system and will update
the rules.
3.2. The knowledge acquisition process
The knowledge acquisition process is divided into three
sequential phases, which have been implemented in three
separate modules: morphological analysis, concept search
and inference (Fig. 1).
3.2.1. Morphological analysis module
The main objective of the first phase in our approach is to
get the grammar category of each word in the current
sentence. So the system has to use a tagger (i.e.
morphological analyser) for this purpose.
For example, if the system gets the sentence “Cardiac
glycosides have been used in the treatment of cardiac disease
for more than 200 years”, after the morphological analysis
phase the system will obtain all the words in the sentence
labelled with the grammar category in that sentence:
Cardiac [Adjective] glycosides [Noun] have [Verb
auxiliary] been [Verb auxiliary] used [Verb lexical] in
[Preposition] the [Determiner] treatment [Noun]
of [Preposition] cardiac [Adjective] disease [Noun]
for [Preposition] more [Adverb] than [Conjunction]
200 [Numeral] years [Noun]
As our approach was conceived language-independent,
we decided to develop a language-independent tagger (at
least in terms of Western-origin languages) that uses the
C4.5 learning algorithm (Quinlan, 1993) to infer the
grammar categories of the words in a sentence (Ruiz-San-
chez et al., 2003).
3.2.2. The concept search module
The first goal of this subsystem is to find linguistic
expressions that represent concepts. The associations
Fig. 1. The knowledge acquisition process.
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 293
between linguistic expressions and the concepts are stored
into a database, called conceptual knowledge base.
The search process is quite simple and the result of this
process is a list containing all the expressions of the
fragment already contained in the concept knowledge base.
In this approach it will be assumed that there exist some
semantically meaningless words. These words usually have
the following grammar categories: Preposition, Conjunc-
tion, Interjection, Particle, Pronoun and Determiner. So the
system will only search for concepts associated to Nouns,
Adjectives and Adverbs.
With all the system works as follows. First, it takes
each word, whose grammar category is noun, adjective
or adverb, in the current sentence and looks for similar
words in the already existing expressions in the concept
knowledge base. Then, for each expression of the
concept knowledge base similar to the current word, if
it is considered to be an acceptable expression, these
actions are performed: (1) obtain and sort the associated
knowledge to the expression present in the knowledge
base; (2) create a new expression that matches the
concept knowledge-base expression and associates pre-
viously sorted associated knowledge to it as possible
knowledge; and (3) add the new expression to the list of
fragment expressions with its associated knowledge.
Obviously, there might be cases where no good
options are found. In that case, the user has to be
provided with the possibility of defining new knowledge
associated with the expression. Alternatively, these
expressions might also be straightforwardly ignored.
This implies that the system needs to provide that
possibility to the user.
The above referred similar function is in charge of
identifying which expressions of the knowledge base are
similar to the current word of the fragment. In its simplest
case, it would be an ‘equal’ function. Nevertheless, this
function cannot deal with compound expressions by itself;
therefore a function of the type ‘isPrefix’ is needed. The
‘isPrefix’ function checks whether the current word is a
substring of another word or not.
It would also be desirable that the function could deal
with word families (types associated to a single
lemma/lexeme) and other language peculiarities. For
instance, if the expression ‘causes’ already exists in the
knowledge base and the current fragment contains the
word ‘caused’, it would be desirable that the system
realised that both words actually allude to the same verb
(lemma). This issue might be partially implemented using
parts-of-speech taggers and lemmatisers. In here, a word
in the current fragment is ‘similar’ to an expression in
the knowledge base if the expression starts with the
current word.
The also above referred acceptable function is an
extension of the ‘similar’ one. As the ‘similar’ function
can be very permissive, the ‘acceptable’ function is
introduced in order to determine whether the current word
and a similar expression are not just ‘similar by chance’.
The ‘isPrefix’ function has an important drawback: if the
current word is the adjective ‘cardiac’, any expression
starting with ‘cardiac’, as ‘cardiac arrest’, ‘cardiac disease’,
or ‘cardiac glycosides’ will be (candidates to be) considered
as similar.
Therefore, this function limits the number of acceptable
options amongst the similar ones. This function has been
designed with strong requirements: an existing expression in
the database is acceptable if it actually appears in the current
fragment.
Current words in a text fragment are always single
constituents. However, database expressions can contain
more than one word (multiple-word expressions). If a
word is acceptable, then the current fragment will
contain all the words of the database expression. That
is, the current word needs to be enlarged to cover all the
words of the database expression, creating a new object
that contains all the words.
If different association possibilities in the database exist,
the system sorts them out and displays them. The existence
of more than one possibility for associating knowledge is
likely due to the following reasons:
† Domain-dependency: the different meanings given to a
term can vary according to the domain in which it is used.
† Person-dependency: it is likely that various experts
assign different meanings to the same expression.
† Spatial-location: if an expression has been used
recently with a specific meaning and the same
expression appears again, then it is very likely that
both expressions mean the same.
Whenever different possibilities are considered as
inferred knowledge from an expression, the system
rearranges them according to the previous three factors.
Amongst those factors, the spatial location interacts in
two different ways. The system considers whether an
expression has already been used in the same text file
and/or in the current textual fragment (this case is given
the highest priority).
The various sorting criteria are characterized by three
parameters, namely, (1) who recognises the knowledge,
(2) the type of domain and (3) whether the expression
belongs to the same fragment and/or text. In particular,
there are currently 11 different possible sorting
criteria. Once knowledge sorting has concluded, the
concept search phase ends. At this point, the system is
likely to have processed both the current fragment and
the set of expressions present in its database. Addition-
ally, inferred knowledge would have been sorted out
according to the above criteria in an attempt to overcome
ambiguity.
As an example, let us suppose that the concept knowl-
edge base contains the following linguistic expression-
concept associations
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299294
Linguistic expressions Concepts
Cardiac disease Cardiac disease
Cardiac glycosides Cardiac glycoside
… …
Treatment Treatment
Treatment of cardiac disease Treatment of cardiac disease
… …
The system gets the sentence “Cardiac glycosides have
been used in the treatment of cardiac disease for more than
200 years.”.
Suppose that the tagger has previously labelled all the
words in the sentence with their grammar category:
Cardiac [Adjective] glycosides [Noun] have [Verb
auxiliary] been [Verb auxiliary] used [Verb lexical] in
[Preposition] the [Determiner] treatment [Noun] of
[Preposition] cardiac [Adjective] disease [Noun] for
[Preposition] more [Adverb] than [Conjunction] 200
[Numeral] years [Noun]
The system then will look only for concepts
associated to Nouns, Adjectives, and Adverbs. The first
word in the sentence ‘Cardiac’ has been labelled as an
adjective, so that the system will look for similar
linguistic expressions into the concept knowledge base.
The result of similar words is {cardiac disease,
cardiac glycoside}. Then the system executes the
acceptable function obtaining that Cardiac glycoside is
a concept.
The next word to be processed is ‘treatment’, because all
the previous words are labelled as neither a noun, neither an
adjective, neither an adverb.
Similar: {treatment, treatment of the cardiac disease}
Acceptable: {treatment of the cardiac disease}
The next word is ‘years’, which does not have any similar
word into de concept knowledge base.
The system will have found then two concepts in the
current sentence: ‘cardiac glycosides’ and ‘treatment of the
cardiac disease’.
3.2.3. The MCRDR module
As we have mentioned earlier, in natural language
relationships between knowledge entities are usually
associated to verbs. So, this phase is mainly concerned
with obtaining relationships between concepts. However,
the system can also inference others knowledge categories
like concepts, attributes, or values.
This subsystem is formed by a knowledge-base that
contains linguistic expressions representing relationships,
and by an MCRDR-subsystem as such that infers the
participants of the relationships.
The system gets the word(s) in the current sentence
that is labelled as a verb and looks for the linguistic
expression into the knowledge base to obtain the type of
the relationship associated to this expression (the verb in
the current sentence) which usually appears most. This
search in the verb knowledge base uses the similar and
acceptable functions exposed in the concept search
phase. Obviously, there might be cases where no good
options are found. In that case, the user has to be
provided with the possibility of defining new knowledge
associated to the expression.
Once the system has found the relationship associated
to the main verb in the current sentence, the system will
use the MCRDR sub-system to acquire knowledge by
means of the grammar category of the words, their
position in the current sentence, and the relation
associated to the verb, if any.
The system creates a case formed by the relationship,
which represents the verb, and the category of the other
words in the sentence. For example in the sentence, “Breast
cancer is a disease in which cancer cells are found in the
tissues of the breast”, the expression ‘is a’ represents an ‘IS
A’ relationship between two concepts (‘Breast cancer’ and
‘disease’).
So, the system should have the following case:
Breast[Noun] cancer[Noun] is[Verb auxiliary] a[Preposi-
tion] disease[Noun] in[Preposition] which[Pronoun] can-
cer[Noun] cells[Noun] are[Verb auxiliary] found[Verb
lexical] in[Preposition] the[Determiner] tissues[Noun]
of[Preposition] the[Determiner] breast[Noun]
This case would be processed by the MCRDR sub-
system and would generate a conclusion.
Once the system gets a conclusion, the expert
has to identify the knowledge entities and the relation-
ship(s), if any, in the current sentence in order to check
whether that conclusion is incorrect (i.e. if it is a wrong,
or an incomplete ontology). Then, the system will
identify the differences between the previously stored
case and the new one. After that, the system will update
the MCRDR sub-system rules, and introduce into
the knowledge base the relationship and the expression
associated to it.
We can illustrate how the MCRDR subsystem works
through the following example. Initially both the
MCRDR sub-system and the knowledge base, which
contains verbs and the type of relationship associated to
them, will be assumed to be empty. Then, let us assume
that the system gets the sentence “Ultrasonography is a
method that sends high frequency sound waves into the
breast” that previously had been analysed by the tagger
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 295
to obtain the grammar categories of each word in that
sentence.
Ultrasonography[Noun] is[Verb auxiliary] a[Preposi-
tion] method[Noun] that[Pronoun] sends[Verb lexical]
high[Adjective] frequency[Noun] sound[Noun] waves
[Noun] into[Preposition] the[Determiner] breast[Noun].
As both the knowledge base and the MCRDR sub-system
were still empty, the system does not infer any knowledge.
So, the expert identifies that the expression ‘is a’ is
associated to an ‘IS-A’ relationship between the concepts
‘Ultrasonography’ and ‘method’.
The system then will store the expression-relationship
association into the knowledge base, and the system will
have to introduce the new case in the MCRDR sub-system,
so creating the following rule:
If relation ¼ ’IS-A’ and category (pos(verb) 2 1) ¼
Noun and category(pos(verb) þ 1) ¼ Noun then
{Concept(word(pos(verb) 2 1)), Concept(word(pos(verb)
þ 1)), Relation(IS-A, word(pos(verb) 2 1),word(pos
(verb) þ 1) }
Here ‘pos(verb)’ is a function that returns the position of
the ‘verb’ in the current sentence. The function ‘category(i)’
returns the category of the word in the position i, the
function word(i) returns the word in the position i.
The system will then continue to analyse the text. Let us
now suppose that the system gets the sentence
Breast[Noun] cancer[Noun] is[Verb auxiliary]
a[Preposition] disease[Noun] in[Preposition] which[Pro-
noun] cancer[Noun] cells[Noun] are[Verb auxiliary]
found[Verb lexical] in[Preposition] the[Determiner]
tissues[Noun] of[Preposition] the[Determiner] breast
[Noun]
Given this situation, the system will look up in the
knowledge base for expressions similar to the verb ‘is’.
Until that instant in the knowledge base there was only a
similar expression to ‘is’, that is ‘is a’. Now the system has
to get the acceptable expressions from the similar ones. We
can see that ‘is a’ is an acceptable expression in the current
sentence. After all this process, the system will infer that the
expression ‘is a’ represents an ‘IS-A’ relationship, and a rule
is activated inferring that ‘cancer IS-A disease’
Let us suppose now that the expert realises that this is
wrong and corrects the system asserting that ‘Breast cancer
IS-A disease’. In this situation, the system will get the case
generated and will get the differences between the two
cases. There will only be one difference, the concept is
formed by an Adjective and a Noun, and not only by a noun.
So the system will update the rules in the MCRDR sub-
system, inserting the following new rule:
If relation ¼ ’IS-A’ and category (pos(verb) 2 1) ¼
Noun and category(pos(verb) 2 2) ¼ Noun and
category(pos(verb) þ 1) ¼ Noun then {Concept(word(pos
(verb) 2 2) þ word(pos(verb) 2 1)), Concept(word(pos
(verb) þ 1)), Relation(IS-A, (word(pos(verb) 2 2) þ
word(pos(verb) 2 1)),word(pos(verb) þ 1) }
Here the operator ‘ þ ’ concatenates two linguistic
expressions.
As it can be noticed, with this subsystem it should be
only necessary to store the linguistic expressions that
represent relationships in the knowledge base, although this
technique could infer other knowledge entities. It is also
remarkable to note that if the MCRDR sub-system contains
several cases and the knowledge base contains several
verbs, then the system will be capable of inferring many
relationships and knowledge entities by only knowing their
respective grammar category.
4. The KAText tool
The approach described in Sections 2 and 3 has been
implemented into a software tool, which is capable of
acquiring medical knowledge from texts. The system
receives text files as input, and the output of the system is
the list of knowledge entities contained in such text file.
Each text is divided into different fragments and the system
works at one fragment per iteration rate. The system is also
capable of dealing with multiple knowledge domains and
multiple domain subtasks since it contains mechanisms for
managing such features and keeping knowledge extracted
for a domain/task independent of the knowledge obtained
for a different domain/task. The final user of the tool is an
expert in a particular domain who wants to extract the
knowledge contained in texts. Each expert is associated to
one or more domains, a concrete working session is
identified by the t-uple (expert, domain, task, text file),
that is, a knowledge acquisition session is performed by an
expert, in a specific domain, for a particular task, and on the
text file containing the knowledge to extract.
4.1. Operation modes
Two working modes have been provided in this tool,
namely, maintenance mode and query mode. Let us now
describe briefly both modes.
Maintenance mode: In this mode, users can add new
experts, new domains, new tasks, make knowledge-tasks
associations, saving sessions in the database, working with
incomplete previous sessions, etc. The user makes knowl-
edge explicit with the help of the tool, because the tool
makes knowledge suggestions to the user. Suggestions are
based on the application of the natural language recognition
techniques aforementioned in this paper.
Query mode: In this mode, users can neither perform
management activities nor accessing saved sessions nor
saving new ones. The user cannot input new knowledge
since ontologies are automatically built.
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299296
4.2. Operational description
When an expert session is active, the knowledge
interaction between expert and system is through two
windows, namely, Relation and Inference (see Fig. 2). The
Relation window contains the previous, current, and next
sentence, as well as the expression, if any, which is
considered the verbal form for the sentence. The system is
capable of inferring concepts, attributes, values and
relations and such knowledge is shown to the user in the
Inference window.
5. Performance of the system
A case study concerning the system described in the
previous sections has been carried out across an oncology
domain. To be more precise, the referred domain is breast
cancer. Breast cancer is the most frequent cancer in women.
Early detection is an important key to cure and survival.
Therefore, it is very important to know the detection and
prevention methods. The ontology of this domain contains
information about how a woman can prevent and explore
her breast. The following experiment has been carried out.
Four PhD students were asked to use the system with a
corpus of 4648 words. The aim of this experiment was to
analyze whether the tool was capable of learning and
suggesting the correct knowledge to the users. For this
purpose, the corpus was divided into six fragments, (each
having identical number of sentences), for a total of 310
sentences. In the following tables, the accuracy of the
knowledge suggestions proposed by the system to the users
is shown. The accuracy score is the result of the division
between the amount of knowledge entities suggested by the
system and that are accepted by the expert user, and the total
amount of knowledge entities existing in the text. In order to
complete the conditions in which the experiment was run, it
should be noticed that the knowledge base was initially
empty, that is, each student started with an empty knowl-
edge base, which was augmenting its content as the
knowledge acquisition process evolved.
In Table 1, the partial results for each fragment are
displayed. From the results contained in that table,
the learning capability of the system can be affirmed,
since the accuracy of the system increases along the
number of fragments processed. Table 2 includes the
accumulated results of the learning process. Two
Fig. 2. The system’s modus operandi.
Table 1
Evolution of the system accuracy
Block Student1 Student2 Student3 Student4
1 44.69 51.07 44.44 49.15
2 67.2 72.79 68.34 67.34
3 74.63 90.32 63.06 76.69
4 83.44 83.49 64.05 79.66
5 83.41 87.03 77.89 88.95
6 82.30 88.0 82.47 92.55
Table 2
Accumulated system accuracy
Accumulated Student1 Student2 Student3 Student4
1 44.69 51.07 44.44 49.15
2 55.61 61.81 56.98 57.40
3 61.03 69.02 58.77 62.06
4 66.35 72.18 60.30 65.87
5 70.60 75.37 64.95 71.18
6 72.15 77.18 67.03 73.49
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 297
hypotheses (Hypothesis1 and 2) were formulated in the
beginning of this experiment:
Hypothesis 1 (Similarity). Provided that the four students
had similar expertise on the topic and they worked on the
same pieces of text, it was expected that they would obtain
similar results in this experiment. This working hypothesis
is confirmed by the results obtained in the experiment, since
similar results are obtained and the accuracy curve is similar
for each student. Therefore, it can be stated that the system
allows people with similar knowledge to obtain similar
results. This is a desirable property because this property
implies that experts (assuming scalability from the PhD
students use for this experiment) identify the same knowl-
edge entities despite making their knowledge explicit in
different manners.
Hypothesis 2 (Usefulness). The system is useful. In order to
validate this hypothesis, the final accuracy scores obtained
by the system for each student have to be discussed. In
particular, the system obtained the best score for student 2,
with a 77.18%, ranging from 67.03 to 73.49% for the whole
experiment. Therefore, the average score was 72.46%.
According to the initial conditions of the experiment, this
result can be considered as a good one since in the beginning
the knowledge base was empty, so that the system was
unable to suggest knowledge in the first text fragments.
Table 2 confirms this theory, since the system reaches an
accuracy score less than 50% in the first block, whereas this
score is larger for the subsequent blocks. In this latter case
the system possesses more knowledge, so that it is capable
of making better suggestions. For block 3, the accuracy of
the system is, on average, greater than 75%, so that we may
conclude that the results are interesting. However, exper-
iments on a larger corpus should be performed in order to
ensure the usefulness of the system that can be forecasted
from the results obtained through this experiment.
6. Discussion and conclusions
The methodology presented in this work offers a new
method for acquiring knowledge from texts. This approach
is based on the use of the grammar categories of words, a
small verbal knowledge base, and a small conceptual
knowledge base for performing inferences. Three basic
assumptions were initially formulated: (1) ontologies are
useful for representing knowledge; (2) experts are capable
of specifying ontologies; and (3) knowledge resides in text.
The construction of ontologies is an important issue for
the knowledge acquisition and knowledge representation
communities. One of the hottest research trends in this area
is ontology learning from Web documents, which is
considered an important activity to promote the Semantic
Web (Alani et al., 2003). Most ontology learning
approaches (Omelayenko, 2001) are mainly concerned
with generating taxonomies and the ontology construction
process is human-driven. Knowledge acquisition from texts
is a process that has already been considered part of the
ontology learning process (Maedche & Staab, 2001). In our
approach, one of the main knowledge sources, that is,
natural language texts, is used to acquire ontological
elements in a semi-automatic way. Another feature of this
approach is that it works with multiple semantic relations
(not only taxonomy). We are currently planning the use of
WordNet (Millar, 1990) to verify the correctness of the
relations inferred by our system and to obtain the new
relations as it is done in (Navigli, Velardi & Gangemi,
2003), which use Wordnet to interpret semantic terms and to
identify mainly taxonomic and similarity relations.
An experiment has been carried out with the objective of
viewing whether the system is useful for extracting
knowledge from texts. The results though still do not reflect
the real potential of the approach, since the experiment has
been performed at a small scale, are very promising. Thus, a
larger validation of the system is planned by applying the
system to texts from different medical domains and by using
statistical methods for analysing the results obtained.
Moreover, we intend to extend the system to cover axioms.
The main forecast problem concerning axioms is, however,
that the number of participants in axioms is a priori
unknown. However, the amount of axioms present in a text
is not significant compared to the amount of other knowl-
edge entities.
Acknowledgements
We thank the Spanish Ministry for Science and
Technology for its support for the development of the
system through projects TIC-2002-03879, FIT-150200-
2001-320, FIT-070000-2001-785, FIT-150500-2003-503,
FIT-110100-2003-73, FIT-150500-2003-499 and Seneca
Foundation through projects PL/3/FS/00 and PI-
16/0085/FS/01.
References
Alani, H., Kim, S., Millard, D. E., Weal, M. J., Hall, W., Lewis, P. H., &
Shadbolt, N. R. (2003). IEEE Intelligent Systems, January/February,
14–21.
Buchanan, B. (1986). Expert systems: working systems and the research
literature. Expert Systems, 3, 32–51.
Compton, P., Horn, R., Quinlan, R., & Lazarus, L. (1989). Maintaining an
expert system. In J. R. Quinlan (Ed.), Applications of expert systems
(pp. 366–385). London: Addison-Wesley.
Fernandez-Breis, J. T., Castellanos-Nieves, D., Valencia-Garcia, R.,
Vivancos-Vicente, P. J., Martınez-Bejar, R., & De las Heras-Gonzalez,
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299298
M. (2001). Towards Scott domains-based topological ontology models.
An application to a cancer domain. in Proceedings of International
Conference on Formal Ontology in Information Systems, Maine:
EEUU, pp. 127–138.
Gomez-Perez, A., & Benjamins, V. R. (1999). Overview of knowledge
sharing and reuse components: ontologies and problem-solving
methods. in Proceedings of the IJCAI-99 workshop on Ontologies
and Problem-Solving Methods (KRR5) Stockholm, Sweden.
Gomez-Perez, A., Moreno, A., Pazos, J., & Sierra-Alonso, A. (2000).
Knowledge maps: an essential technique for conceptualization. Data
and Knowledge Engineering, 33, 169–190.
Horn, K., Compton, P. J., Lazarus, L., & Quinlan, J. R. (1985). An expert
system for the interpretation of thyroid assays in a clinical laboratory.
Australian Computer Journal, 17, 7–11.
Kang, B (1996). Multiple classification ripple down rules. PhD Thesis,
University of New South Wales
Maedche, A., & Staab, S. (2001). Ontology learning for the semantic web.
IEEE Intelligent Systems, 16(2), 72–79.
Martınez-Bejar, R., Ibanez-Cruz, F., Compton, P., & Mihn Cao, T. (2001).
An easy-maintenance, reusable approach for building knowledge-based
systems. Expert Systems with Applications, 20(2), 153–162.
Millar, A. (1990). WordNet: an on-line lexical resource. Journal of
Lexicography, 3(4).
Musen, M. A. (1998). Domain ontologies in software engineering: use of
Protege with the EON architecture. Methods of Information in
Medicine, 37, 540–550.
Navigli, R., Velardi, P., & Gangemi, A. (2003). Ontology learning and its
application to automated terminology translation. IEEE Intelligent
Systems, January/February, 22–31.
Omelayenko, B. (2001). Learning of ontologies for the web: the analysis of
existent approaches. Proceedings of the International Workshop on
Web Dynamics, London, UK.
Quinlan, J. R. (1993). C4.5: programs for Machine Learning. San Mateo:
Morgan Kaufmann.
Ruiz-Sanchez, J. M., Valencia-Garcıa, R., Fernandez-Breis, J. T., artınez-
Bejar, R., & Compton, P. (2003). An approach for incremental
knowledge acquisition from text. Expert Systems with Applications,
25(2), 77–86.
Van Heijst, G., Schreiber, A. T., & Wielinga, B. J. (1997). Using explicit
ontologies in KBS development. International Journal of Human–
Computer Studies, 45, 183–292.
R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 299