an incremental approach for discovering medical knowledge from texts

9
An incremental approach for discovering medical knowledge from texts Rafael Valencia-Garcı ´a, Juana Marı ´a Ruiz-Sa ´nchez, Pedro Jose ´ Vivancos-Vicente, Jesualdo Toma ´s Ferna ´ndez-Breis * , Rodrigo Martı ´nez-Be ´jar Departamento de Ingenierı ´a de la Informacio ´n y las Comunicaciones, Facultad de Informa ´tica, Universidad de Murcia, Murcia 30071, Spain Abstract Vast amounts of medical knowledge reside within text documents, so that the automatic extraction of such knowledge would certainly be beneficial for clinical activities. A user-centred approach for the incremental extraction of knowledge from text, which is based on both knowledge technologies and natural language processing techniques, is presented in this work. In such approach, ontologies are used to provide a formal, structured, reusable and shared knowledge representation. The system has been used to extract clinical knowledge from texts concerning Oncology. q 2003 Elsevier Ltd. All rights reserved. Keywords: Knowledge acquisition; Knowledge representation; Ontologies 1. Introduction The objective of extracting knowledge directly from free text is a challenging task. The achievement of this objective would allow for extracting knowledge easily and without the intervention of knowledge engineers. In this work, we present an approach for building ontologies from texts in a supervised mode. This work is an extension to the work presented in Ruiz-Sanchez, Valencia-Garcı ´a, Ferna ´ndez- Breis, Martı ´nez-Be ´jar, & Compton (2003), where the authors described a software tool for ontology building from natural language texts. The principal idea sustaining our approach is that, in natural language, relationships between knowledge entities are usually associated to verbs. In this work, we also introduce a system that tries to discover terms which represent concepts by means of a database that relates these concepts with linguistic expressions. This system is based on an architecture formed by three modules: morphological analysis, concept search phase, and inference. According to this, our system allows to store verbs representing a relationship between concepts in order to automatically be able to identify this knowledge whenever it reappears. The system makes use of two knowledge technologies as well as and the grammar category of the words in the current sentence to infer knowledge entities (i.e. concepts, attributes and values), from a text fragment. The system has been applied into an Oncology domain and the results of this experiment are discussed in this paper. The knowledge technologies are MCRDR and ontologies. RDR (Compton, Horn, Quinlan, & Lazarus, 1989) is a technology that provides case-based reasoning consisting on using previous situations to solve current problems. Most CBR methodologies present maintenance problems. How- ever, RDR overcomes this problem and allows expert to build and maintain knowledge bases without support. MCRDR (Kang, 1996) is an extension to RDR that allows for working with multiple conclusions. On the other hand, ontologies are commonly defined as specifications of domain knowledge conceptualisations (Van Heijst, Schreiber, & Wielinga, 1997). Due to the very nature of ontology, there is not a unique (valid) manner to define ontologies (Musen, 1998). Moreover, several definitions have historically been given to the term ontology, although it is commonly considered to be an enumeration of the relevant concepts in an application area, as well as a definition of classes of concepts and relationships among these classes (Ferna ´ndez-Breis, Castel- lanos-Nieves, Valencia-Garcia, Vivancos-Vicente, Martı ´- nez-Be ´jar & De las Heras-Gonza ´lez, 2001). According to some authors, an advantage of ontologies is the possibility of making a mathematical study on their properties, among 0957-4174/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2003.09.001 Expert Systems with Applications 26 (2004) 291–299 www.elsevier.com/locate/eswa * Corresponding author. Tel.: þ 34-968367345; fax: þ 34-968364151. E-mail addresses: [email protected] (J.T. Ferna ´ndez-Breis); [email protected] (R. Valencia-Garcı ´a); [email protected] (J.M. Ruiz- Sa ´nchez); [email protected] (P.J. Vivancos-Vicente); [email protected] (R. Martı ´nez-Be ´jar).

Upload: rafael-valencia-garcia

Post on 26-Jun-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An incremental approach for discovering medical knowledge from texts

An incremental approach for discovering medical knowledge from texts

Rafael Valencia-Garcıa, Juana Marıa Ruiz-Sanchez, Pedro Jose Vivancos-Vicente,Jesualdo Tomas Fernandez-Breis*, Rodrigo Martınez-Bejar

Departamento de Ingenierıa de la Informacion y las Comunicaciones, Facultad de Informatica, Universidad de Murcia, Murcia 30071, Spain

Abstract

Vast amounts of medical knowledge reside within text documents, so that the automatic extraction of such knowledge would certainly be

beneficial for clinical activities. A user-centred approach for the incremental extraction of knowledge from text, which is based on both

knowledge technologies and natural language processing techniques, is presented in this work. In such approach, ontologies are used to

provide a formal, structured, reusable and shared knowledge representation. The system has been used to extract clinical knowledge from

texts concerning Oncology.

q 2003 Elsevier Ltd. All rights reserved.

Keywords: Knowledge acquisition; Knowledge representation; Ontologies

1. Introduction

The objective of extracting knowledge directly from free

text is a challenging task. The achievement of this objective

would allow for extracting knowledge easily and without

the intervention of knowledge engineers. In this work, we

present an approach for building ontologies from texts in a

supervised mode. This work is an extension to the work

presented in Ruiz-Sanchez, Valencia-Garcıa, Fernandez-

Breis, Martınez-Bejar, & Compton (2003), where the

authors described a software tool for ontology building

from natural language texts. The principal idea sustaining

our approach is that, in natural language, relationships

between knowledge entities are usually associated to verbs.

In this work, we also introduce a system that tries to

discover terms which represent concepts by means of a

database that relates these concepts with linguistic

expressions. This system is based on an architecture formed

by three modules: morphological analysis, concept search

phase, and inference. According to this, our system allows

to store verbs representing a relationship between concepts

in order to automatically be able to identify this knowledge

whenever it reappears. The system makes use of two

knowledge technologies as well as and the grammar

category of the words in the current sentence to infer

knowledge entities (i.e. concepts, attributes and values),

from a text fragment. The system has been applied into an

Oncology domain and the results of this experiment are

discussed in this paper.

The knowledge technologies are MCRDR and ontologies.

RDR (Compton, Horn, Quinlan, & Lazarus, 1989) is a

technology that provides case-based reasoning consisting on

using previous situations to solve current problems. Most

CBR methodologies present maintenance problems. How-

ever, RDR overcomes this problem and allows expert to build

and maintain knowledge bases without support. MCRDR

(Kang, 1996) is an extension to RDR that allows for working

with multiple conclusions. On the other hand, ontologies are

commonly defined as specifications of domain knowledge

conceptualisations (Van Heijst, Schreiber, & Wielinga,

1997). Due to the very nature of ontology, there is not a

unique (valid) manner to define ontologies (Musen, 1998).

Moreover, several definitions have historically been given to

the term ontology, although it is commonly considered to be

an enumeration of the relevant concepts in an application

area, as well as a definition of classes of concepts and

relationships among these classes (Fernandez-Breis, Castel-

lanos-Nieves, Valencia-Garcia, Vivancos-Vicente, Martı-

nez-Bejar & De las Heras-Gonzalez, 2001). According to

some authors, an advantage of ontologies is the possibility of

making a mathematical study on their properties, among

0957-4174/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

doi:10.1016/j.eswa.2003.09.001

Expert Systems with Applications 26 (2004) 291–299

www.elsevier.com/locate/eswa

* Corresponding author. Tel.: þ34-968367345; fax: þ34-968364151.

E-mail addresses: [email protected] (J.T. Fernandez-Breis);

[email protected] (R. Valencia-Garcıa); [email protected] (J.M. Ruiz-

Sanchez); [email protected] (P.J. Vivancos-Vicente); [email protected]

(R. Martınez-Bejar).

Page 2: An incremental approach for discovering medical knowledge from texts

which, shareability and reusability should be pointed out

(Gomez-Perez & Benjamins, 1999).

The structure of the paper can be described as follows. In

Section 2, the knowledge technologies used in this

approach are described. The detailed description of the

knowledge acquisition methodology developed is pointed

out in Section 3. In Section 4, new version of KAText (v.2)

is described. The validation experiment is discussed in

Section 5. Finally, Section 6 contains the conclusions of

this work.

2. Knowledge technologies applied

In this work, an approach based on ontologies and

Ripple-Down Rules has been used. On the one hand,

ontologies have been used to model the medical knowledge

extracted from the texts. In particular, the ontology model

used in this work is an extension of that presented in

(Ruiz-Sanchez et al., 2003). There, ontologies are rep-

resented though sets of concepts having the following

properties. First, concepts are defined through a set of

attributes and a set of interconceptual relations. These

relations are of the following types: taxonomy, mereology,

equivalency, dependency, topology, causality, functional-

ity, similarity, conditionality, purpose, and chronology.

These are the most common relations in problems

(Gomez-Perez, Moreno, Pazos, & Sierra-Alonso, 2000),

although there are more possible relations between knowl-

edge entities. The extension to this ontological model used

in previous works has been to allow users to define their own

types of relations. This feature allows our ontologies to

cover more domain knowledge. However, the user-defined

relations have no formal properties as the pre-defined ones

have. Moreover, structural axioms are also derived from the

proper structure of the ontologies.

On the other hand, Ripple-Down Rules (Compton et al.,

1989) have been the technology selected for supporting the

reasoning processes of this approach. This methodology has

been widely used for medical purposes (see, for instance

Buchanan, 1986; Horn, Compton, Lazarus, & Quinlan,

1985). A RDR system is based on a binary tree with rules at

each node. Each child rule replaces the conclusion given by

its parent rule. When an input case satisfies a rule, then it is

evaluated against its child rules. The last fired rule will be

the conclusion given by the RDR system, and the input case

for which that conclusion is given is called cornerstone case.

When the system provides an incorrect answer, a new rule is

added to the RDR tree to cover this situation. In case a

parent rule is satisfied but not the child one, then the new

rule is added to the false branch of the child rule in order to

preserver the rule evaluation sequence in following

executions. In our case, the antecedent of the rules will be

linguistic expression and the conclusions will be the

ontological entities (i.e. concepts, attributes, values, or

relations) extracted from the linguistic expression. This issue

will be further discussed in Sections 3 and 4. However, we

do not use RDR but an extension called MCRDR (Kang,

1996). In this case, the tree is n-ary so that each node may

have multiple successors. In particular, all children of a

satisfied parent rule are evaluated, allowing for multiple

conclusions.

Ontologies and RDR have previously been brought

together to develop KBSs (Martınez-Bejar, Ibanez-Cruz,

Compton, & Mihn Cao, 2001). However, in this work the

combination of both technologies have a different purpose.

In our approach, RDR will be use to learn and infer

knowledge from medical texts and the extracted and

suggested knowledge will be expressed in ontological

manner, that is, through the ontological entities defined in

the previously described ontology model.

3. A framework for acquiring knowledge

3.1. Assumptions

The aim of this work was to present a technique for

extracting knowledge from natural language texts. More

precisely this work has focused on the implementation of a

system, which builds an ontology from a text. So, an implicit

assumption (Assumption 1) is that ontologies can be used to

represent knowledge. As the whole text can be very long, it

seemed convenient to divide it into minor fragments in order

to facilitate its processing. Furthermore, the approach

presented here is based on the idea that relationships are

usually associated to verbs in natural language. Also, we

decided, as it is done normally, to divide texts into

sentences.

In this approach, the expert is in charge of building the

ontology. This gives rise to another assumption (Assump-

tion 2): experts can extract semantic relationships from text.

This expert must have expertise on the specific task

described along the text, and the expert is somehow

associated with the system and to the text by the task itself.

In a way, we can say that knowledge resides inside the text.

So, there is another implicit assumption (Assumption 3) in

this sentence, namely, text can contain knowledge.

Ontologies permit to divide knowledge into categories

such as concepts, attributes, relationships, rules, axioms,

etc. These knowledge entities can appear explicitly in the

text, although sometimes knowledge is only referred to

implicitly. The system attempts to find only explicit

knowledge from the text.

By taking into account the above assumptions, the

starting point is a system composed by three modules,

namely, an empty concept knowledge base, an empty verb

knowledge base and an empty MCRDR sub-system. At this

phase, the system is hence unable to find any knowledge in

the input text and the expert has to introduce knowledge

manually. The expert’s task is then to identify

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299292

Page 3: An incremental approach for discovering medical knowledge from texts

the relationship associated to the main verb (if any) of the

current sentence and all the other knowledge entities of

the current fragment of text. He/she will also tell the system

the expressions in which they appear. The expressions-

knowledge associations for the concepts and relations are

stored thereafter in the system (the expressions associated

with concepts into the conceptual knowledge base, and the

expressions associated with relationships into the verbak

knowledge base) in order to be used for new knowledge

findings thereafter. An MCRDR sub-system is created then

and maintained by the system in order to be used for

acquiring the knowledge entities, which participate in the

relationship under question.

The MCRDR sub-system rules are based on the

relationship that represents the main verb in the current

sentence, the grammar category of the knowledge

entities, which appear in that sentence, and the position

of this knowledge entities with respect to the verb in the

sentence.

The system shows the relationship and the knowledge

entities in the current sentence, and the expert will just

have to confirm the results output by the system. If the

real conclusion is different from the conclusion inferred

by the system, then the expert has to introduce the

correct conclusion. After that, the system will add the

new case into the MCRDR sub-system and will update

the rules.

3.2. The knowledge acquisition process

The knowledge acquisition process is divided into three

sequential phases, which have been implemented in three

separate modules: morphological analysis, concept search

and inference (Fig. 1).

3.2.1. Morphological analysis module

The main objective of the first phase in our approach is to

get the grammar category of each word in the current

sentence. So the system has to use a tagger (i.e.

morphological analyser) for this purpose.

For example, if the system gets the sentence “Cardiac

glycosides have been used in the treatment of cardiac disease

for more than 200 years”, after the morphological analysis

phase the system will obtain all the words in the sentence

labelled with the grammar category in that sentence:

Cardiac [Adjective] glycosides [Noun] have [Verb

auxiliary] been [Verb auxiliary] used [Verb lexical] in

[Preposition] the [Determiner] treatment [Noun]

of [Preposition] cardiac [Adjective] disease [Noun]

for [Preposition] more [Adverb] than [Conjunction]

200 [Numeral] years [Noun]

As our approach was conceived language-independent,

we decided to develop a language-independent tagger (at

least in terms of Western-origin languages) that uses the

C4.5 learning algorithm (Quinlan, 1993) to infer the

grammar categories of the words in a sentence (Ruiz-San-

chez et al., 2003).

3.2.2. The concept search module

The first goal of this subsystem is to find linguistic

expressions that represent concepts. The associations

Fig. 1. The knowledge acquisition process.

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 293

Page 4: An incremental approach for discovering medical knowledge from texts

between linguistic expressions and the concepts are stored

into a database, called conceptual knowledge base.

The search process is quite simple and the result of this

process is a list containing all the expressions of the

fragment already contained in the concept knowledge base.

In this approach it will be assumed that there exist some

semantically meaningless words. These words usually have

the following grammar categories: Preposition, Conjunc-

tion, Interjection, Particle, Pronoun and Determiner. So the

system will only search for concepts associated to Nouns,

Adjectives and Adverbs.

With all the system works as follows. First, it takes

each word, whose grammar category is noun, adjective

or adverb, in the current sentence and looks for similar

words in the already existing expressions in the concept

knowledge base. Then, for each expression of the

concept knowledge base similar to the current word, if

it is considered to be an acceptable expression, these

actions are performed: (1) obtain and sort the associated

knowledge to the expression present in the knowledge

base; (2) create a new expression that matches the

concept knowledge-base expression and associates pre-

viously sorted associated knowledge to it as possible

knowledge; and (3) add the new expression to the list of

fragment expressions with its associated knowledge.

Obviously, there might be cases where no good

options are found. In that case, the user has to be

provided with the possibility of defining new knowledge

associated with the expression. Alternatively, these

expressions might also be straightforwardly ignored.

This implies that the system needs to provide that

possibility to the user.

The above referred similar function is in charge of

identifying which expressions of the knowledge base are

similar to the current word of the fragment. In its simplest

case, it would be an ‘equal’ function. Nevertheless, this

function cannot deal with compound expressions by itself;

therefore a function of the type ‘isPrefix’ is needed. The

‘isPrefix’ function checks whether the current word is a

substring of another word or not.

It would also be desirable that the function could deal

with word families (types associated to a single

lemma/lexeme) and other language peculiarities. For

instance, if the expression ‘causes’ already exists in the

knowledge base and the current fragment contains the

word ‘caused’, it would be desirable that the system

realised that both words actually allude to the same verb

(lemma). This issue might be partially implemented using

parts-of-speech taggers and lemmatisers. In here, a word

in the current fragment is ‘similar’ to an expression in

the knowledge base if the expression starts with the

current word.

The also above referred acceptable function is an

extension of the ‘similar’ one. As the ‘similar’ function

can be very permissive, the ‘acceptable’ function is

introduced in order to determine whether the current word

and a similar expression are not just ‘similar by chance’.

The ‘isPrefix’ function has an important drawback: if the

current word is the adjective ‘cardiac’, any expression

starting with ‘cardiac’, as ‘cardiac arrest’, ‘cardiac disease’,

or ‘cardiac glycosides’ will be (candidates to be) considered

as similar.

Therefore, this function limits the number of acceptable

options amongst the similar ones. This function has been

designed with strong requirements: an existing expression in

the database is acceptable if it actually appears in the current

fragment.

Current words in a text fragment are always single

constituents. However, database expressions can contain

more than one word (multiple-word expressions). If a

word is acceptable, then the current fragment will

contain all the words of the database expression. That

is, the current word needs to be enlarged to cover all the

words of the database expression, creating a new object

that contains all the words.

If different association possibilities in the database exist,

the system sorts them out and displays them. The existence

of more than one possibility for associating knowledge is

likely due to the following reasons:

† Domain-dependency: the different meanings given to a

term can vary according to the domain in which it is used.

† Person-dependency: it is likely that various experts

assign different meanings to the same expression.

† Spatial-location: if an expression has been used

recently with a specific meaning and the same

expression appears again, then it is very likely that

both expressions mean the same.

Whenever different possibilities are considered as

inferred knowledge from an expression, the system

rearranges them according to the previous three factors.

Amongst those factors, the spatial location interacts in

two different ways. The system considers whether an

expression has already been used in the same text file

and/or in the current textual fragment (this case is given

the highest priority).

The various sorting criteria are characterized by three

parameters, namely, (1) who recognises the knowledge,

(2) the type of domain and (3) whether the expression

belongs to the same fragment and/or text. In particular,

there are currently 11 different possible sorting

criteria. Once knowledge sorting has concluded, the

concept search phase ends. At this point, the system is

likely to have processed both the current fragment and

the set of expressions present in its database. Addition-

ally, inferred knowledge would have been sorted out

according to the above criteria in an attempt to overcome

ambiguity.

As an example, let us suppose that the concept knowl-

edge base contains the following linguistic expression-

concept associations

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299294

Page 5: An incremental approach for discovering medical knowledge from texts

Linguistic expressions Concepts

Cardiac disease Cardiac disease

Cardiac glycosides Cardiac glycoside

… …

Treatment Treatment

Treatment of cardiac disease Treatment of cardiac disease

… …

The system gets the sentence “Cardiac glycosides have

been used in the treatment of cardiac disease for more than

200 years.”.

Suppose that the tagger has previously labelled all the

words in the sentence with their grammar category:

Cardiac [Adjective] glycosides [Noun] have [Verb

auxiliary] been [Verb auxiliary] used [Verb lexical] in

[Preposition] the [Determiner] treatment [Noun] of

[Preposition] cardiac [Adjective] disease [Noun] for

[Preposition] more [Adverb] than [Conjunction] 200

[Numeral] years [Noun]

The system then will look only for concepts

associated to Nouns, Adjectives, and Adverbs. The first

word in the sentence ‘Cardiac’ has been labelled as an

adjective, so that the system will look for similar

linguistic expressions into the concept knowledge base.

The result of similar words is {cardiac disease,

cardiac glycoside}. Then the system executes the

acceptable function obtaining that Cardiac glycoside is

a concept.

The next word to be processed is ‘treatment’, because all

the previous words are labelled as neither a noun, neither an

adjective, neither an adverb.

Similar: {treatment, treatment of the cardiac disease}

Acceptable: {treatment of the cardiac disease}

The next word is ‘years’, which does not have any similar

word into de concept knowledge base.

The system will have found then two concepts in the

current sentence: ‘cardiac glycosides’ and ‘treatment of the

cardiac disease’.

3.2.3. The MCRDR module

As we have mentioned earlier, in natural language

relationships between knowledge entities are usually

associated to verbs. So, this phase is mainly concerned

with obtaining relationships between concepts. However,

the system can also inference others knowledge categories

like concepts, attributes, or values.

This subsystem is formed by a knowledge-base that

contains linguistic expressions representing relationships,

and by an MCRDR-subsystem as such that infers the

participants of the relationships.

The system gets the word(s) in the current sentence

that is labelled as a verb and looks for the linguistic

expression into the knowledge base to obtain the type of

the relationship associated to this expression (the verb in

the current sentence) which usually appears most. This

search in the verb knowledge base uses the similar and

acceptable functions exposed in the concept search

phase. Obviously, there might be cases where no good

options are found. In that case, the user has to be

provided with the possibility of defining new knowledge

associated to the expression.

Once the system has found the relationship associated

to the main verb in the current sentence, the system will

use the MCRDR sub-system to acquire knowledge by

means of the grammar category of the words, their

position in the current sentence, and the relation

associated to the verb, if any.

The system creates a case formed by the relationship,

which represents the verb, and the category of the other

words in the sentence. For example in the sentence, “Breast

cancer is a disease in which cancer cells are found in the

tissues of the breast”, the expression ‘is a’ represents an ‘IS

A’ relationship between two concepts (‘Breast cancer’ and

‘disease’).

So, the system should have the following case:

Breast[Noun] cancer[Noun] is[Verb auxiliary] a[Preposi-

tion] disease[Noun] in[Preposition] which[Pronoun] can-

cer[Noun] cells[Noun] are[Verb auxiliary] found[Verb

lexical] in[Preposition] the[Determiner] tissues[Noun]

of[Preposition] the[Determiner] breast[Noun]

This case would be processed by the MCRDR sub-

system and would generate a conclusion.

Once the system gets a conclusion, the expert

has to identify the knowledge entities and the relation-

ship(s), if any, in the current sentence in order to check

whether that conclusion is incorrect (i.e. if it is a wrong,

or an incomplete ontology). Then, the system will

identify the differences between the previously stored

case and the new one. After that, the system will update

the MCRDR sub-system rules, and introduce into

the knowledge base the relationship and the expression

associated to it.

We can illustrate how the MCRDR subsystem works

through the following example. Initially both the

MCRDR sub-system and the knowledge base, which

contains verbs and the type of relationship associated to

them, will be assumed to be empty. Then, let us assume

that the system gets the sentence “Ultrasonography is a

method that sends high frequency sound waves into the

breast” that previously had been analysed by the tagger

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 295

Page 6: An incremental approach for discovering medical knowledge from texts

to obtain the grammar categories of each word in that

sentence.

Ultrasonography[Noun] is[Verb auxiliary] a[Preposi-

tion] method[Noun] that[Pronoun] sends[Verb lexical]

high[Adjective] frequency[Noun] sound[Noun] waves

[Noun] into[Preposition] the[Determiner] breast[Noun].

As both the knowledge base and the MCRDR sub-system

were still empty, the system does not infer any knowledge.

So, the expert identifies that the expression ‘is a’ is

associated to an ‘IS-A’ relationship between the concepts

‘Ultrasonography’ and ‘method’.

The system then will store the expression-relationship

association into the knowledge base, and the system will

have to introduce the new case in the MCRDR sub-system,

so creating the following rule:

If relation ¼ ’IS-A’ and category (pos(verb) 2 1) ¼

Noun and category(pos(verb) þ 1) ¼ Noun then

{Concept(word(pos(verb) 2 1)), Concept(word(pos(verb)

þ 1)), Relation(IS-A, word(pos(verb) 2 1),word(pos

(verb) þ 1) }

Here ‘pos(verb)’ is a function that returns the position of

the ‘verb’ in the current sentence. The function ‘category(i)’

returns the category of the word in the position i, the

function word(i) returns the word in the position i.

The system will then continue to analyse the text. Let us

now suppose that the system gets the sentence

Breast[Noun] cancer[Noun] is[Verb auxiliary]

a[Preposition] disease[Noun] in[Preposition] which[Pro-

noun] cancer[Noun] cells[Noun] are[Verb auxiliary]

found[Verb lexical] in[Preposition] the[Determiner]

tissues[Noun] of[Preposition] the[Determiner] breast

[Noun]

Given this situation, the system will look up in the

knowledge base for expressions similar to the verb ‘is’.

Until that instant in the knowledge base there was only a

similar expression to ‘is’, that is ‘is a’. Now the system has

to get the acceptable expressions from the similar ones. We

can see that ‘is a’ is an acceptable expression in the current

sentence. After all this process, the system will infer that the

expression ‘is a’ represents an ‘IS-A’ relationship, and a rule

is activated inferring that ‘cancer IS-A disease’

Let us suppose now that the expert realises that this is

wrong and corrects the system asserting that ‘Breast cancer

IS-A disease’. In this situation, the system will get the case

generated and will get the differences between the two

cases. There will only be one difference, the concept is

formed by an Adjective and a Noun, and not only by a noun.

So the system will update the rules in the MCRDR sub-

system, inserting the following new rule:

If relation ¼ ’IS-A’ and category (pos(verb) 2 1) ¼

Noun and category(pos(verb) 2 2) ¼ Noun and

category(pos(verb) þ 1) ¼ Noun then {Concept(word(pos

(verb) 2 2) þ word(pos(verb) 2 1)), Concept(word(pos

(verb) þ 1)), Relation(IS-A, (word(pos(verb) 2 2) þ

word(pos(verb) 2 1)),word(pos(verb) þ 1) }

Here the operator ‘ þ ’ concatenates two linguistic

expressions.

As it can be noticed, with this subsystem it should be

only necessary to store the linguistic expressions that

represent relationships in the knowledge base, although this

technique could infer other knowledge entities. It is also

remarkable to note that if the MCRDR sub-system contains

several cases and the knowledge base contains several

verbs, then the system will be capable of inferring many

relationships and knowledge entities by only knowing their

respective grammar category.

4. The KAText tool

The approach described in Sections 2 and 3 has been

implemented into a software tool, which is capable of

acquiring medical knowledge from texts. The system

receives text files as input, and the output of the system is

the list of knowledge entities contained in such text file.

Each text is divided into different fragments and the system

works at one fragment per iteration rate. The system is also

capable of dealing with multiple knowledge domains and

multiple domain subtasks since it contains mechanisms for

managing such features and keeping knowledge extracted

for a domain/task independent of the knowledge obtained

for a different domain/task. The final user of the tool is an

expert in a particular domain who wants to extract the

knowledge contained in texts. Each expert is associated to

one or more domains, a concrete working session is

identified by the t-uple (expert, domain, task, text file),

that is, a knowledge acquisition session is performed by an

expert, in a specific domain, for a particular task, and on the

text file containing the knowledge to extract.

4.1. Operation modes

Two working modes have been provided in this tool,

namely, maintenance mode and query mode. Let us now

describe briefly both modes.

Maintenance mode: In this mode, users can add new

experts, new domains, new tasks, make knowledge-tasks

associations, saving sessions in the database, working with

incomplete previous sessions, etc. The user makes knowl-

edge explicit with the help of the tool, because the tool

makes knowledge suggestions to the user. Suggestions are

based on the application of the natural language recognition

techniques aforementioned in this paper.

Query mode: In this mode, users can neither perform

management activities nor accessing saved sessions nor

saving new ones. The user cannot input new knowledge

since ontologies are automatically built.

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299296

Page 7: An incremental approach for discovering medical knowledge from texts

4.2. Operational description

When an expert session is active, the knowledge

interaction between expert and system is through two

windows, namely, Relation and Inference (see Fig. 2). The

Relation window contains the previous, current, and next

sentence, as well as the expression, if any, which is

considered the verbal form for the sentence. The system is

capable of inferring concepts, attributes, values and

relations and such knowledge is shown to the user in the

Inference window.

5. Performance of the system

A case study concerning the system described in the

previous sections has been carried out across an oncology

domain. To be more precise, the referred domain is breast

cancer. Breast cancer is the most frequent cancer in women.

Early detection is an important key to cure and survival.

Therefore, it is very important to know the detection and

prevention methods. The ontology of this domain contains

information about how a woman can prevent and explore

her breast. The following experiment has been carried out.

Four PhD students were asked to use the system with a

corpus of 4648 words. The aim of this experiment was to

analyze whether the tool was capable of learning and

suggesting the correct knowledge to the users. For this

purpose, the corpus was divided into six fragments, (each

having identical number of sentences), for a total of 310

sentences. In the following tables, the accuracy of the

knowledge suggestions proposed by the system to the users

is shown. The accuracy score is the result of the division

between the amount of knowledge entities suggested by the

system and that are accepted by the expert user, and the total

amount of knowledge entities existing in the text. In order to

complete the conditions in which the experiment was run, it

should be noticed that the knowledge base was initially

empty, that is, each student started with an empty knowl-

edge base, which was augmenting its content as the

knowledge acquisition process evolved.

In Table 1, the partial results for each fragment are

displayed. From the results contained in that table,

the learning capability of the system can be affirmed,

since the accuracy of the system increases along the

number of fragments processed. Table 2 includes the

accumulated results of the learning process. Two

Fig. 2. The system’s modus operandi.

Table 1

Evolution of the system accuracy

Block Student1 Student2 Student3 Student4

1 44.69 51.07 44.44 49.15

2 67.2 72.79 68.34 67.34

3 74.63 90.32 63.06 76.69

4 83.44 83.49 64.05 79.66

5 83.41 87.03 77.89 88.95

6 82.30 88.0 82.47 92.55

Table 2

Accumulated system accuracy

Accumulated Student1 Student2 Student3 Student4

1 44.69 51.07 44.44 49.15

2 55.61 61.81 56.98 57.40

3 61.03 69.02 58.77 62.06

4 66.35 72.18 60.30 65.87

5 70.60 75.37 64.95 71.18

6 72.15 77.18 67.03 73.49

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 297

Page 8: An incremental approach for discovering medical knowledge from texts

hypotheses (Hypothesis1 and 2) were formulated in the

beginning of this experiment:

Hypothesis 1 (Similarity). Provided that the four students

had similar expertise on the topic and they worked on the

same pieces of text, it was expected that they would obtain

similar results in this experiment. This working hypothesis

is confirmed by the results obtained in the experiment, since

similar results are obtained and the accuracy curve is similar

for each student. Therefore, it can be stated that the system

allows people with similar knowledge to obtain similar

results. This is a desirable property because this property

implies that experts (assuming scalability from the PhD

students use for this experiment) identify the same knowl-

edge entities despite making their knowledge explicit in

different manners.

Hypothesis 2 (Usefulness). The system is useful. In order to

validate this hypothesis, the final accuracy scores obtained

by the system for each student have to be discussed. In

particular, the system obtained the best score for student 2,

with a 77.18%, ranging from 67.03 to 73.49% for the whole

experiment. Therefore, the average score was 72.46%.

According to the initial conditions of the experiment, this

result can be considered as a good one since in the beginning

the knowledge base was empty, so that the system was

unable to suggest knowledge in the first text fragments.

Table 2 confirms this theory, since the system reaches an

accuracy score less than 50% in the first block, whereas this

score is larger for the subsequent blocks. In this latter case

the system possesses more knowledge, so that it is capable

of making better suggestions. For block 3, the accuracy of

the system is, on average, greater than 75%, so that we may

conclude that the results are interesting. However, exper-

iments on a larger corpus should be performed in order to

ensure the usefulness of the system that can be forecasted

from the results obtained through this experiment.

6. Discussion and conclusions

The methodology presented in this work offers a new

method for acquiring knowledge from texts. This approach

is based on the use of the grammar categories of words, a

small verbal knowledge base, and a small conceptual

knowledge base for performing inferences. Three basic

assumptions were initially formulated: (1) ontologies are

useful for representing knowledge; (2) experts are capable

of specifying ontologies; and (3) knowledge resides in text.

The construction of ontologies is an important issue for

the knowledge acquisition and knowledge representation

communities. One of the hottest research trends in this area

is ontology learning from Web documents, which is

considered an important activity to promote the Semantic

Web (Alani et al., 2003). Most ontology learning

approaches (Omelayenko, 2001) are mainly concerned

with generating taxonomies and the ontology construction

process is human-driven. Knowledge acquisition from texts

is a process that has already been considered part of the

ontology learning process (Maedche & Staab, 2001). In our

approach, one of the main knowledge sources, that is,

natural language texts, is used to acquire ontological

elements in a semi-automatic way. Another feature of this

approach is that it works with multiple semantic relations

(not only taxonomy). We are currently planning the use of

WordNet (Millar, 1990) to verify the correctness of the

relations inferred by our system and to obtain the new

relations as it is done in (Navigli, Velardi & Gangemi,

2003), which use Wordnet to interpret semantic terms and to

identify mainly taxonomic and similarity relations.

An experiment has been carried out with the objective of

viewing whether the system is useful for extracting

knowledge from texts. The results though still do not reflect

the real potential of the approach, since the experiment has

been performed at a small scale, are very promising. Thus, a

larger validation of the system is planned by applying the

system to texts from different medical domains and by using

statistical methods for analysing the results obtained.

Moreover, we intend to extend the system to cover axioms.

The main forecast problem concerning axioms is, however,

that the number of participants in axioms is a priori

unknown. However, the amount of axioms present in a text

is not significant compared to the amount of other knowl-

edge entities.

Acknowledgements

We thank the Spanish Ministry for Science and

Technology for its support for the development of the

system through projects TIC-2002-03879, FIT-150200-

2001-320, FIT-070000-2001-785, FIT-150500-2003-503,

FIT-110100-2003-73, FIT-150500-2003-499 and Seneca

Foundation through projects PL/3/FS/00 and PI-

16/0085/FS/01.

References

Alani, H., Kim, S., Millard, D. E., Weal, M. J., Hall, W., Lewis, P. H., &

Shadbolt, N. R. (2003). IEEE Intelligent Systems, January/February,

14–21.

Buchanan, B. (1986). Expert systems: working systems and the research

literature. Expert Systems, 3, 32–51.

Compton, P., Horn, R., Quinlan, R., & Lazarus, L. (1989). Maintaining an

expert system. In J. R. Quinlan (Ed.), Applications of expert systems

(pp. 366–385). London: Addison-Wesley.

Fernandez-Breis, J. T., Castellanos-Nieves, D., Valencia-Garcia, R.,

Vivancos-Vicente, P. J., Martınez-Bejar, R., & De las Heras-Gonzalez,

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299298

Page 9: An incremental approach for discovering medical knowledge from texts

M. (2001). Towards Scott domains-based topological ontology models.

An application to a cancer domain. in Proceedings of International

Conference on Formal Ontology in Information Systems, Maine:

EEUU, pp. 127–138.

Gomez-Perez, A., & Benjamins, V. R. (1999). Overview of knowledge

sharing and reuse components: ontologies and problem-solving

methods. in Proceedings of the IJCAI-99 workshop on Ontologies

and Problem-Solving Methods (KRR5) Stockholm, Sweden.

Gomez-Perez, A., Moreno, A., Pazos, J., & Sierra-Alonso, A. (2000).

Knowledge maps: an essential technique for conceptualization. Data

and Knowledge Engineering, 33, 169–190.

Horn, K., Compton, P. J., Lazarus, L., & Quinlan, J. R. (1985). An expert

system for the interpretation of thyroid assays in a clinical laboratory.

Australian Computer Journal, 17, 7–11.

Kang, B (1996). Multiple classification ripple down rules. PhD Thesis,

University of New South Wales

Maedche, A., & Staab, S. (2001). Ontology learning for the semantic web.

IEEE Intelligent Systems, 16(2), 72–79.

Martınez-Bejar, R., Ibanez-Cruz, F., Compton, P., & Mihn Cao, T. (2001).

An easy-maintenance, reusable approach for building knowledge-based

systems. Expert Systems with Applications, 20(2), 153–162.

Millar, A. (1990). WordNet: an on-line lexical resource. Journal of

Lexicography, 3(4).

Musen, M. A. (1998). Domain ontologies in software engineering: use of

Protege with the EON architecture. Methods of Information in

Medicine, 37, 540–550.

Navigli, R., Velardi, P., & Gangemi, A. (2003). Ontology learning and its

application to automated terminology translation. IEEE Intelligent

Systems, January/February, 22–31.

Omelayenko, B. (2001). Learning of ontologies for the web: the analysis of

existent approaches. Proceedings of the International Workshop on

Web Dynamics, London, UK.

Quinlan, J. R. (1993). C4.5: programs for Machine Learning. San Mateo:

Morgan Kaufmann.

Ruiz-Sanchez, J. M., Valencia-Garcıa, R., Fernandez-Breis, J. T., artınez-

Bejar, R., & Compton, P. (2003). An approach for incremental

knowledge acquisition from text. Expert Systems with Applications,

25(2), 77–86.

Van Heijst, G., Schreiber, A. T., & Wielinga, B. J. (1997). Using explicit

ontologies in KBS development. International Journal of Human–

Computer Studies, 45, 183–292.

R. Valencia-Garcıa et al. / Expert Systems with Applications 26 (2004) 291–299 299