retrieval system for natural languages · three points that are essential for natural language...

110
A KNOWLEDGE BASED INLtORMATION STORAGE AND RETRIEVAL SYSTEM FOR NATURAL LANGUAGES A Thesis Presented to The Faculty of Graduate Studies of The University of Guelph In partial fùlfilment of requirements for the degree of Master of Science September, 1999 O Deyan Xu, 1999

Upload: others

Post on 30-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

A KNOWLEDGE BASED INLtORMATION STORAGE AND

RETRIEVAL SYSTEM FOR NATURAL LANGUAGES

A Thesis

Presented to

The Faculty of Graduate Studies

of

The University of Guelph

In partial fùlfilment of requirements

for the degree of

Master of Science

September, 1999

O Deyan Xu, 1999

Page 2: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

National Library Bibliotheque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques

395 Wellington Street 395. me WellUigtori Ottawa ON K1A ON4 OiiawaON KlAONS Canada CaMda

The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sel copies of this thesis in microfonn, paper or electronic formats.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.

The author retauis ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fkom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

Page 3: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

ABSTRACT

A KNOWLEDGE BASED INFORMATION STORAGE AND RETRIEVAL

SYSTEM FO NATURAL LANGUAGES

Deyan Xu

University of Guelph, 1999

Advisor:

Professor 3. G. Linders

Natural languages are expressive languages used for communication by hurnans.

Ho wever. from an information processing viewpoint, natural languages are basicall y

unstructured which means they are not readity suitable for machine processing.

Conceptual graphs first introduced by John Sowa in 1984, provide a rich knowledge

representation schema intended to structure and encode natural languages.

This thesis is concerned with the building of a knowledge based management system that

is able to effectively store and retrieve knowledge fiom natural languages. It discusses

three points that are essential for natural language processing:

1 . conceptual languages to structure and encode naturai language

2. a repository to store the information that is represented in the conceptual language

3. a matcher that determines whether one statement in the conceptual language is the

sarne as or an instance of a second statement.

Other aspects such indexing and querying techniques are also presented.

Page 4: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

DEDICATED TO MY PARENTS AND MY WIFE

Who have made this possible

Page 5: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Acknowledgements

1 would like to thank my supervisor Dr. Linders for his guidance and continued support

throughout my graduate program. Though out the planning and implernentation of the

thesis, Dr. Linders had given me tremendous valuable suggestions and helps. Gratitude is

also extended to Er. Wilson and Dr. Wang, members of my thesis exam cornmittee, for

their constructive criticisms for the thesis, 1 also wish to thank my CO-graduate student

Finnegan Southey for his great support and help.

My sincere appreciation is also extended to my wife and my parents whose support and

confidence have carried me through the hard days and helped me to complete this work

successfiill y.

I also wish to thank my sisters and my parents-in-law for their great support throughout

my university degree career.

Finally. 1 would like to express my gratitude to the University of Guelph and to al1 those

mentioned above.

Page 6: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Table of Contents

Acknowledgements

Table of Contents

List of Figures

C hspter 1 Introduction

1.1 The Motivation

1.2 Analysis of the Problems

1.3 Essential Points in Building a Knowledge Based System

1.4 Overview of the Thesis

Chapter 2 Review of Current Approaches to Information Processing

2.1 Inverted File Methods

2.2 Adaptive Methods

2.3 Similarity Measures, Clustering

2.4 Issues in Representation of Knowledge

2.5 Syntactic Methods

2.6 Semantic Methods

2.7 Frame-Based Methods

2.8 Graph-Based Methods

Page 7: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Chapter 3 Conceptual Grapb as Knowledge Encoding Schema

3.1 Percepts

3 2 Concepts and Conceptual Relations

3.3 Referent

3.4 Conceptual Graph

3.5 Canonical Graphs and Canonical Formation Rules

3.6 Contexts

3 -7 The CGIF Representation of Conceptuai Graphs

3 -8 Why Use Conceptual Graphs

3.9 The Notio' Java Package for Modelling Conceptual Graphs

Chapter 4 An Object-Oriented Knowledge base for Conceptual Graphs

4.1 Why Do We Need a Knowledge base for Conceptual Graphs?

4.2 Relational Database for Conceptual Graphs

4.3 Problems with Relationd Database for Conceptual Graphs

4.4 Object-Oriented Technology ---- A Better Solution

4.4.1 Basic Features of Object-Oriented Modelling(00M)

4.4.2 Abstract Data Typing Maps Concept Type

4.4.3 Obj ect Identity Maps Individual Marker

4.4.4 Mieritance Maps Canonical Formation Rule

4.4.5 Encapsdation Maps Context

4.4.6 Representing Object and Object Class

. . . - 111 -

Page 8: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Chapter 5 Design and Impiementation of the System

5.1 Overview of the System

5.2 Design of Conceptual Graph Object

5.3 Design of Concept Object

5.4 Design of Relation Object

5.5 Indexing

5 -6 Query of Conceptual Graphs

3.7 Graph Matching Mechanism of the System

5.8 Design of the System

Chapter 6 An Example

6.1 The Example

6.2 Building of Concepnial Graphs fiom the Example

6.3 Conceptual Graphs in CGIF Format

6.4 Searching of Conceptual Graphs

6.4.1 Loose Match

6.4.2 Exact Match

6.5 Conclusion

Page 9: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Chaper 7 Summery

7.1 S u m m a r y

7.2 Conclusion

7.3 Future Work

Appendix

A. Object Diagrams of the System

B. Detailed Exarnple Test Script

C. References

Page 10: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

List of Figures

Figure 2.1 An Example of Inverted Files

Figure 2.2 ANDIOR tree for text classification

Figure 3.1 A conceptual graph

Figure 3.2 Two canonical graphs

Figure 3 -3 Join of the two canonical graphs

Figure 3.4 The simplification of figure 3.3

Figure 3.5 A conceptual graph containhg a context

Figure 4.1 An example of conceptual graph for storage in a RDB

Figure 4.2 the conceptual graph for a meeting

Figure 4.3 expanded view of the meeting context

Figure 4.4 class definition for Car

Figure 4.5 an instance of car CA 163 1998

Figure 5.1 the Selection Manual

Figure 5.2 the conceptual graph object in memory

Figure 5.3 the concept object in memory

Figure 5.4 the relation graph object in memory

Figure 5.5 a 'BTreeNode' object in disk

Figure 5.6 the system class diagram

Figure 6.1 to Figure 6.7

Page 11: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 1

Introduction

1.1 The Motivation

Today we live in the information era with so many news. conference proceedings and

research articles to be read, not to mention al1 of the information available on the WWW.

No matter how many articles we read and how many web sites we visit. we are still

lacking information that we need. while so much time is wasted on reading unwanted

materials. Without an effective information or knowledge storage and retrieval facility.

we \vil1 be deluged in the information flood. This problem exists today and will get worse

if no new effective information or knowledge retrieval systems are developed. Thus. an

efleciirc. h o wiidge rerrieval sysrern rhar is able tu re rrieve h o wledge j?om natrrral

krngztuge. iike English (ex(. is rrrgently needed.

1.2 Analysis of the Problems

To cope with this problem. many kinds of information management systems and tools

have been developed to store and retrieve information. Howeveq most of them are only

able to store entities. and relations between these entities. or objects and associations

between these objects. The retrieval rnethods they use are basically keyword matching.

Page 12: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Keyword searching and fiequency distributions do not capture the meaning behind the

words. Thus. they inherently are limited in distinguishing relevant and irrelevant

information. Al1 languages have many complicated, unsystematic features that confound

and confuse simple. word-based information retrieval systems- The same object may be

described in more than one way and one word may contain difierent meanings in

different phrases. These make traditional information retrievai systems limited in their

ability to retrieve information. Actudly. very few of them are able to store and retrieve

information based on the meaning of natural languages that express knowledge. Natural

languages are expressive languages that al1 people can understand. Natural languages are

used daily to comrnunicate. acquire information and are also the basis of reasoning.

NaturaI languages are more understandable by humans than any other kind of expression.

Ho~vever. natural languages are basically unstructured which means that they are not

readily suitable for machine processing. hence can be computationally intractable.

These lead to the objective o f rhis thesis. which is concerned with the bzrilding of a

knolc.ledge bmed manugement system rhat is able to eflectively store and retrieve the

represeniation of knowledge zcse conceptual graphs. However. there are some problems

present in the building of such system. Since the information in the news, articles and

WWW has of lirnited structure. this kind of basically unstructured text files has no fixed

keys for searching. Indexing of unstnictured data also presents a major problem. It

requires someone to read the document and provide keys manually. Most importantly it is

necessary to know how to let a computer system know the "meaning" of the text? If a

computer system does not know the "meaning" of the text, the system is basically just

Page 13: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

another traditional information retrieval system. It will inhere al1 the drawbacks a typicai

information retrievai system has.

1.3 Essential Points in Building a Knowledge Based System

In order to break through the keyword barrier, a modelling technique that is able to

convert natural language into knowledge representation and structure the language is

needed. In this way a computer based systern can be used to understand the meaning of

the te'cts they process. Secondly. in order to retrieve the information in a knowledge

based systern. a method that is able to determine whether one statement in the knowledge

representation is the sarne as, or an instance of a second statement in the knowledge

representation. must be developed.

The modelling technique presented in this thesis is conceptual graphs. It was introduced

by John Sowa in 1984. Conceptual graph forms a knowledge representation hguage

based on linguistics and semantic networks used in artificial intelligence. A conceptual

graph is a finite. comected bipartite graph where there are two kinds of nodes: "concepts" - and "conceptual relations". The purpose of the conceptual graph is that it forms a bridge

between natural language and a knowledge expression format that is readable by a

computer. With the conceptual graph the literal rneaning of a natural language sentence

cm be mapped to a diagram that is computationally tractable.

Page 14: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The retrieval method presented here is conceptual graph matching. In order for this

matching process to be feasible, the underlying knowledge representation must be

cunonical. This means that al1 sentences with the same basic meaning must be parsed into

the same knowledge representation. In this way. the matcher can determine whether the

conceptual statements are representing the same knowledge.

1.4 Overview of the Thesis

The goal of the thesis is to build a knowledge based management system that is able to

store and retrieve knowledge. The system uses conceptual graphs to encode "knowledge"

as representations of texts. The system also demonstrates a prototype search engine that is

able to perform a very precise searching of knowledge stored in the system.

Chapter 2 is a review of current approaches to information management processing. I t

first introduces several successful traditional information retrieval systems including

inverted file methods. adaptive methods. similarity measures and clustering. then it

presents several knowledge representations such as syntactic methods. semantic methods.

frame- based methods and graph-based methods.

Chapter 3 briefly discusses conceptual graphs as a knowledge encoding schema. It

presents some basic features of conceptual graphs and the advantages of using conceptual

eraphs to mode1 natural languages. C

Page 15: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Chapter 4 first presents a relational mode1 that handles conceptual graphs then discusses

some key points of the object-oriented modelling technique. Issues of applying an object-

oriented technique in the management of conceptual graphs are thoroughly discussed.

The advantages of object-oriented modelling approach to conceptual graphs over the

approach of relational modelling are also presented.

Chapter 5 demonstrates the design and implementation of the conceptual graph

knowledge base system. Some key issues, such as the representation of conceptual graph

objects both in rnemory and on disk. indexing methods, querying of the knowledge base

etc.. are also discussed.

Chapter 6 shows an example and illustrates how the conceptual graph knowledge base

system w-orks. Facts that affect the search result are also discussed. A conclusion based

on the example is given at the end of the chapter.

Chapter 7 is a summary. It contains conclusions and a discussion of future research

direction.

Page 16: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 2

Review of Current Approaches to Information Processing

A few decades ago. before the invention of the computer- the method used to locate

individual texts was to read large collections of newspapers. books. reports and articles,

and make notes in an attempt to try to remember the contents for later retrieval.

Obviously. when the number of collections exceeds the ability of manual methods and

the limits of human memory. this method will not work. With the invention of the

computer. the storage and retrieval of such information was achieved with the help of

various information retrieval systems. In this way, it \vas possible to extend the limits of

information storage and retrievai system by orders of magnitude.

The under lying method that these information retrieval systems use is searching through

the entire collection of information to find words and phrases that identi@ a text

containing the information being searched. The crucial problem of this traditional

information retneval technology is that the system relies solely on the presence or

absence of a word. Ofien the searcher does not know the "rneaning" of the words and

phrases the- are searching. This limits their ability to distinguish relevant and irrelevant

tests. Sorne research efforts have attempted to improve retrieval performance by indexing

on phrases rather than on words. by adding synonym information. and by using frequency

of words. However. the gains fkom these refinements have been limited. The limits of

Page 17: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

word-based retrieval systems have been previously expiored in P e s k 851 and wetzler et

al. 841'.

This problem exists because there is no perfect correlation between matching words and

matching meaning. Hence, a possible solution to the problem of improving retrieval

performance is to give up on matching words and to match concepts instead. Further

more, in traditional information retrieval systems, the connection between information

objects are mostly boolean values such as AND, OR. NOT etc. As we know the relations

between information objects are rnuch richer than these boolean values. In knowledge

base systems such as those based on conceptual graphs, the information objects are

modelled as concepts and the connections between these concepts are modelled as

conceptual relations. Thus, a knowledge base system should be able to effectively mode1

the logic expression of natural languages. In order to effectively store and retrieve natural

languages. three things are required:

I . a conceptual language to structure and encode natural languages

2. a knowledge base to store the information that is represented in the conceptual

language

3. a matcher that determines whether one statement in the conceptual language is the

same as or an instance of a second statement in the conceptual language

' M. E. Lesk. "SIGIR 85". ACM SIGIR Forum, Vol. 18, No. 2-4, Fall 1985. pp. 10-15. D. P. Metzler, T. Noreault, L. Richey and B. Heidorn, "Dependency Parsing for Information Retrieval,"

Ressarch and Development in Information Retrieval. C. J. Van Rijsbergen.ed.. Cambridge University Press. July 1984. pp. 3 13-324.

Page 18: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

This chapter is divided into two parts. In the first part, we briefly review the traditional

methods of information retrieval systems. In the second part, we present some knowledge

representations of natural languages. With the help of these knowledge encoding schema,

we are able to process naturai ianguage texts intelligently.

2.1 Inverted Fite Methods

The most successful and relevant traditional information retrieval systems are based on

inverted files. The basic idea is to m d e storage space for retrieval tirne. The database is

viewed as a collection of files. An aiphabetized list of words is created. For each

occurrence of a word in a file. an entry is created on the list for that word with a pointer

back to the file. In some systems. the pointer indicates the position in the file where the

w-ord occurs. while in other systems the pointer merely indicates that the word appears in

the file one or more times. Cornmon words such as "the" and "of' are excluded from the

indesing. Figure 2.1 is an example of an inverted file.

At retrieval tirne. the words in the query are looked up in the inverted file. Then the lists

of documents containing the words are intersected to produce the list of texts matching

the query. Since the list lookup c m be done in constant time using hash tables. or

logarithmic time using sorted lists or trees. the time required to process a query depends

mainly on the number of documents containing each search tem.

Page 19: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

File 1 Inverted Index File#, Position

The quick brown fox jumped over the Brown ( 1 3 ) ( 2 , 2 lazy dog. code ( 3 , 1 5 )

dog ( 179) ( 2 . 7 ) evaluation ( 3.2 )

File 2 fox ( 1 . 4 ) ( 2.4) My brown Volkswagen Fox is no dog improve ( 3 . 4 ) when it comes to performance. jwped ( 1 . 5 )

~ W P S ( 3-12 ) lazv ( 1 - 8 ) ( 3 - 1 )

File 3 pe~onnance ( 2. 12 ) ( 3 . 6 ) Lazy evaluation can improve system quick ( 1 . 2 ) performance. and reduce the number reduce ( 3 . 8 ) of jumps in your code. volkswagen ( 2 ,3 )

Figure 2.1 An Example of Inverted Files

Thzre are some variations of this method. One variation is to add the notion of proximity.

usually implemented as an aaacency operator. or a wirhin operator. For systems that

store the position of each occurrence of the index terms. the adjacency operator can be

implemented by checking that the position of the second term is esactly one more than

the position of the first term. For example. if we have query:

[ brown ADJ fox ]

File 1 in the Figure 2.1 will pass the adjacency test, but file 2 fails since the words are at

positions 2 and 4. The same idea can be used to implement a WITHIN operator that is

Page 20: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

true only if the first and second terms are within some specified numbers of words of

each other.

The other variation is to add "boolean keyword queries". That means that the set of te-

matching each term in the query are combined using the set operations of intersection,

union. and complementation to produce a final set of retrieved documents. For exarnple.

if we perfonn the query:

fiump$ or perform$ AND (NOT code) 1

on the three files in Figure 2.1. The files I and 2 will be retrieved. but not 3. The "$"

operator indicating that any word with a given prefix should be matched.

The main advantage of this method is fast retrieval and easy implementation. Thus. the

inverted tïle methods are used in rnany commercial systems. such as Stairs. Dialog, Le-ris

c m / Sratus.

For detailed discussion of this technique. please see Salton's Modern Information

rerriewl [Salton & McGill 831.

2.2 Adaptive Methods

Another word-based technique is called "adaptive methods". The main idea for this

method is afier an initial query the user selects the relevant articles from the retneved set

Page 21: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

and similarity measures are recalculated. The revised measures are used to query the

database again. resulting in a new set of documents, and the process continues until the

search converges or the user is satisfied with the documents retrieved up to that point.

The query reformulation process is thus based on the following two operations:

1. Terrns that occur in documents previously identified as relevant by the user are added

to the original query vectors. or altematively the weight of such terms is increased by

an appropriate factor in constnicting the new query statements.

2. Ai the sarne time. terms occurring in documents previously identified as irrelevant by

the users are deleted fiom the original query statements. or the weight of such terms is

appropriately reduced.

The effect of such a query alteration process is to move the query in the direction of the

relevant items and away from the irrelevant ones. Thus the user is able to retrieve more

wanted and fewer un\vanted items in later searches.

2.3 Similarity Measures, Clustering

This method uses a similaril measure based on word frequency to determine whether a

document is similar to other documents known to be relevant to the user's query. To find

the degree of similarity between two documents, the method is to:

1. deterrnine frequencies of each index term in the document collection;

Page 22: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2. for any wo documents, view the frequency lists for those documents as vectors in

rnultidimensional space. and calculate the cosine of the angle between the two

vectors.

The problem with this approach is that a typical user query will not have enough words to

eive a statistically meaningfd fiequency vector. Thus. this method only works for C

measuring the diEerences between two documents. The Smart information retrieval

system from Corneil [Salton & McGill 831 uses this method. It incorporates some suffix

removal rules to calculate frequencies based on the stem rather than the whole word.

Word frequency systems have been proposed as aïternatives to boolean keyword search,

and in some cases have demonstrated improved recall and precision performance,

2.4 Issues in Representation of Knowledge

The existence of the NOT operator in boolean keyword query is a clue to problems with

keyword based queries --- it is a partially successfiil anempt to deal with output overload.

For esample. to find texts about fighting and war. but ignore texts about crime. one might

try the folIowing query:

[ \var AND (NOT dmgs )]

to avoid seeing stories like:

Oirr governmenl vorvs lo fighl a rvar againsi drzrgs.

Page 23: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

But then. the following would be missed:

Many people died during the second world war because of the Iack of drugs.

The complexity of natural language, including ambiguity, synonymy and metaphor

combine to reduce the effectiveness of today's keyword-based retrieval systems. Thus, in

order to effectively retrieve information in natural languages, a knowledge representation

msthod mut be used. As Coldstein and papert2 said in the article, 'ilrtificial Infelligence.

Lunguage, and the Strtdy of Kttowledge. "

The frtndumental d~flclrlties facing researchers in the jield roday are not

limitations drre ro hardware, bur rather qrresrions aborrt how to represenf large

arnotrnls of knoivledge in ivays rhat srill allow the eflective rrse of individual faczs.

No consensus has yet been reached on the best method for representing knowledge, and

research is continuincg in order to develop more efficient ways to store information to

conserve rnemory and processing. In the followïng sections. we \vil1 review some

kno~vledge representation methods that are used to extract knowledge from natural

language.

Ira Dotdstein and Seymour Papert. "Anificial Intelligence. Lanpuage. and the Study of Knowledge."

Cognitive Science. Vol. 1. No. 1 ( 1977).

Page 24: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2.5 Syntactic Methods

Synta. analysis focuses on the relationship between linguistic expressions and is

concemed with the rules (grammars) for the interaction between various natural language

units like words. phrase etc. Early artificial intelligence efforts to produce "question

answering" systems used some knowledge of English syntax and semantics to retrieve

information from databases in response to queries in natural l a n p g e -

Raphael's SIR program for "Semantic Information Retrieval"[Raphael 681 used an

interna1 mode1 based on words and word associations linked in a "general manner so that

no panicular relations are more significant than others." Relations used were:

Set-inclusion

Part-kvhole relationship

Numeric quantity associated with the part-whole relation

Set rnembership

Left-to-right spatial relations

Ownership

One advantage of using a syntactic method is that we can preserve the syntactic relation

bet~veen words. For exarnple. if we wish to retrieve documents about cornputer science,

with previous mentioned "invened" method. we might try this query:

[ cornput$ AND science ]

Page 25: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

which would match at least the following phrases:

rhe compter science department

the discipline of comptrting science

the science of compzrting

but ~vould also match across phrases. such as:

the use of comptrters in malerials science

Just using the query:

[ cornput$ ADJ science 1

\vil1 exclude materials science. but miss science of compzrting. What we reaily want to

say is that the word "compute" must modifi the word "science". so that we restrict the

sciences the query should match.

~ a ~ a n ' has investigated the use of syntactic information to identiQ phrases for indexing.

He cornpared statistically denved indexing phrases with phrases derived using the PEG

English grammar and the PLNLP programming language [Jensen 861. He concluded that

aIthough phases selected by using fiequency and CO-occurrence methods did not

consistently improve retrieval performance. syntau-based selection methods can generate

more usefùl phrases that do improve retrieval performance. by improving recall precision

a small arnount.

- -

' J. L. Fagan. "Autornatic Phrase Indexing For Document Retrieval: An Examination o f Syntactic and Non- slmactic rnethods." Proceedings o f the Tenth Annual Intemationai ACMSIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery. New York. 1987. pp. 91-101.

J. L. Fagan. "Automatic Phrase lndexing For Document Retrieval: A Cornparison of Syntactic and Non- Syntactic Methods. PhD dissertation. Cornell. Sepmieber 1987.

Page 26: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2.6 Semantic Methods

The idea of semantic rules is to include "semantic markers" in the definitions of each

sense of the words in the dictionary. The semantic markers attached to a word would be

used to restrict the ways in which it could combine with other words. The "RUBRIC"

retrieval system is a "semantics only" method for deciding whether a document is

relevant to a given query.4 RUBRiC uses document/query pairs. The approach is to use

rules that provide evidence for relevance or irrelevance. Such a system can deal with

constructions that contUse syntax-oniy systems.

For example. consider a query about terrorists where one does not want information

about \var to be retrieved. Figure 2.2 shows an AND/OR tree for a rule one might find in

a semantics-only classification system for the concept of "terrorist". The branches of the

tree are Iabeled with certainty factors between O and 1 that indicate how strongly the sub-

trees are related to the root concept. These certainty factors are assigned by the system

designer. Using the figure shown. if a text contains the word "terrorist" it gets a 0.8 score

for being about terrorists. For the word "hijack" it gets a 0.6. If it contained both words.

the score would be 0.92 = 0.8 + 0.6 - 0.8 x 0.6.

-

' R. M. Tong. L. A. Appelbaum. V. N. Askman and 1. F. Cunningham. "RUBRIC III : An Object-Onented Expert System for Information Retrieval." Second Annual Conference on Expert Systems In Governrnent, McLran. VA. October 1986.

Page 27: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Semantic marken have proved to be very useful in natural language processing. n i e y

have been widely applied and. in many cases, they select appropriate word senses

successfully.

letter homemade

l

Figure 2.2 AND/OR tree for text classification

2.7 Frame-Based Methods

The notion of frame [Minsky 753 ' is a method for understanding vision. natural langage

and other areas o f AI. Frames provide a convenient structure for representing objects that

are typical to a given situation such as stereotypes. The basic characteristic of a h e is

' M. Minsky. "A Frarnework for Representing Knowledge." in The Psychology of Computer Vision. P. Winston. ed.. McGnw-Hill, New York, 1975.

Page 28: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

that it represents related knowledge about a narrow subject. which has much default

knowledge. In frame theory. the knowledge base is decomposed into pieces of

knowledge, which are the data structures that represent stereotypical situations. The basic

cornponents of a frame-based representation facility include:

1. Structure. The frarne can capture the basic organizationd principles. It contains the

hierarchies of objects (components) and the attributes of abjects- which can be

inherited from other fiames. It incorporates sets of attribute descriptions called slots-

These structures c m be used to uni@ and denote a loose collection of objects, related

ideas, concepts, facts and experiences.

2. Processing Feature. When we process the natural language text we "understand it" by

f i l h g in the appropriate slots in the fiames. The slot provides space for computation.

I t can contain a default value. a restriction of value to be added. a procedure activated

to compute a needed value. or a rule activated when certain conditions are met.

Properties. relationships, and events c m be fitted into slots of an object from

conditions and situations: restrictions can be attached to the slots to trigger a sequence

of actions by the prograrn.

3. Reasoning Services. The frarne-based representation c m perfonn inferences as part of

its assertion and retrieval operation.

One of the more widely used frarne-based systems is schank's6 conceprual dependency

rheory. ofien abbreviated "CD". A conceptual dependency graph is a relation between

%. C . Schank. N. M. Goldman, C. J. Rieger and C. K. Riesbeck, Conceptua1 information Processing, North-Holland. Amsterdan, Fundamental Studies in Cornputer Science. Vol. 3. 1975.

18

Page 29: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

primitive objects that are either actions, States. or noun-like "picture producers." CD

theory7 is based on two principles:

1. CD representations should allow effective inference, by associating a fixed set of

inference rules with each CD primitive;

2. CD representations should be independent of any particular human language.

To illustrate the first principle, if given the following sentence:

JohnJIe\ïfi-orn Toronro to Shanghai

One would want to be able to tell fiom the representation that before the event John was

in Toronto and afienvard he \vas in Shanghai. and that the same was true of the airplane.

To fulfill the second principle. Schank outlined a short list of primitive actions and

showed how to represent sentences as graphs built from theses primitives. For example.

the CD primitive ptrans. which stands for "physical transfer." indicates motion of a

physical object fiom one place to another. Using the representation of case frarne. the

various components of the CD graph are related using the following cases or "slots":

ACTOR the initiating agent of an action

OBJECT the thing affected by the action

INST the instrument or means by the action is effected

FROM the source of the action

DEST the destination of the action

- - - -

' E. Rich. Arrificial Intelligence. McGraw-Hill, New York. McGraw-Hill Series in Artificial Intelligence, 1983.

Page 30: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

So the above sample sentence would be represented by the following CD graph:

r' Toronto 1 D

john <-> ptrans 4- airplane +

f O 1-w Shanghai

john

Which literally means "John physical ly transferred himself from Toronto to Shanghai

using an airplane as conveyance." (where ¢3 denotes the relation between actor and

action: + O indicates the object of an action; + 1 indicates the instrumental

conceptualization for an action; + D and < indicates the direction of an object

within an action.)

No one actually draws such graphs any more. since we can represent the same CD graph

as a case frarne as follows:

(cd (actor (john) )

(¢C' (ptrans))

(actor (john) )

(inst (airplane) )

(from (Toronto) )

(dest (Shanghai) ) )

Page 31: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The details are beyond the scope of this thesis. For more information of CD theory and

frame-based method. please refer to: m. C . Schank. N. M. Goidman. C. J. Rieger and C.

K. Riesbeck, Conceptual Information Processing, Fundamental Studies in Cornputer

Science. Elevsis Press] .

2.7 Graph-Based Metbods

Another approach to representing knowledge is through use of graph structures. There are

many varieties of graph-base representations. such as semantic nets. conceptual graphs.

Semantic nets were first developed for AI as a way of representing human memory and

language understanding [Quillian 681 '. Quillian used semantic nets to analyze the

meanincg of w-ords in sentences. Since then- semantic nets have been applied to many

problsms involving knowledge representation.

The structure of a semantic net is shown graphically in terms of nodes and the arcs

connecting them. Nodes are ofien referred to as objects and the arcs as links or edges.

The links of a semantic net are used to express relationships. Nodes are generally used to

represent physical objects. concepts, or situations. For detailed discuss of semantic nets.

refer to [QuiIlian 681.

M. R. Quillian. "Semantic Memory". Semantic Information Processing. ed. By Marvin Minsky. The MIT Press, pp227-270. 1968.

2 1

Page 32: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Although semantic nets c m be very usefui in representing knowledge, they have

limitations such as the lack of link name standards. This makes it difficult to understand

what the net is really designed for and whether it was designed in a consistent mariner.

For a semantic net to represent definitive knowledge. that is. knowledge that can be

defined. the link and node narnes must be rigorously defined.

The last. but not least. knowledge representation method we are going to present is the

Conceptual Graphs [Sowa 841 9. This knowledge representation method is widely use in

the natural language encoding and it also has many advantages in the language encoding.

We will use this method to extract meanings of natural Ianguage texts in this thesis and

w i l l discuss it in more detail in the following chapter.

J * F. Sowa, Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley. Reading. MA. 1984.

Page 33: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 3

Conceptual Graphs as Knowledge Encoding Schema

Conceptual structures. as developed by John Sowa (1984). provides a rich knowledge

representation schema intended to incorporate many concepts found in natural and formal

ianguages. A conceptual graph is an abstract representation for logic with nodes cailed

concepts and conceptual relations, linked together by arcs. The direction of the arcs

determines the relations between two objects they connect. Within the graphs. concept

nodes represent entities. attributes, States. and events. while relation nodes show how the

concepts are intercomected. We will explain some of the notations that are used in

conceptual graph as following.

3.1 Percepts

Perception is the process of building a working model that represents and interprets

sensory input. The rnodel has two components: a sensory part fonned from a mosaic of

percepts. each of which matches some aspect of the input: and a more abstract part called

a conceptual graph. which describes how the percepts are combined together to form a

mosaic of percepts. Percepts are fragments of images that fit together Iike the pieces of a

. . jigsaw puzzle. A conceptual graph describes the way percepts are assembled. Conceptual

relations specifj the role that each percept plays: one percept may match a part of an icon

Page 34: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

to the right or left of another percept; a percept for a color may be combined with a

percept of a shape to f o m a graph that represents a colored shape.

3.2 Concepts and Conceptual Relations

The term "concept" is defuied as: "a node in a conceptual graph that refers to an entity, a

set of entities. or a range of entities". A concept is a basic unit for representing an entity

or statr. Every concept has a concept type t and a referent r. The concept types are

organized in a hierarchy according to levels of generality. The referent is basically the

entity or entities that a concept references. For example. "person". "country". "city" etc.

are concept types. "John". "Canada", "Guelph" are referents. I f we map concepts and

referents. we have specific concepts: Iperson: 'John1], [country: 'Canada'] and [city:

'Guelph'].

A conceptual relation always connects two concepts. it shows that some relationship

holds betw-een their referents. For example.

[PERSON: 'John1] t (AGNI') t WADING] i (OBJECT) i PEWSPAPER:

'Toronto Star'].

This linear expression of conceptual graph represents the sentence John is reading the

"Toronro Srur". The relations in this conceptual graph are AGNT and OBJECT. The

Page 35: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

AGNT relation shows that John is the agent of READING, OBJECT shows that the

"Toronto Star" is the object of reading.

3.3 Referent

The concept bos is divided in two parts: a type field, on the lefi and a referent field, on

the right. The concept [PERSON: 'John'] is an individual concept with type PERSON and

referent John. The concept [READiNG] is called a generic concept. because it does not

identi- a particular individual; both specie oniy the type. not the individuai. The

referent r of a concept c is a pair <q, dB, where q is called the quantifier of cl and d is

called the designator of c.

3.1 Conceptual Graph

A conceptual graph is a bipartite graph that has two kinds of nodes called concepts and

conceptual relations. A conceptual relation link specifies the role that each percept plays.

Figure 3.1 shows a conceptual graph that describes the sentence John is reading the

"Tor-onro Star" ivirh u microfiche.

Page 36: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Microfiche Reader

Person: John

Figure 3.1 is the graphic representation of a conceptual graph. It uses boxes to represent

concepts and circles to represent conceptual relations. The advantage of this

representation is readability. but it is hard to type and difficult for computer to process

and also takes a lot of spaces. So more ofien. we use the linear notation which uses

square brackets for the concepts and rounded parentheses for the conceptual relation.

k

Reading . Ne wspaper : Toronto Star

i

A conceptual graph "g" with "n" conceptual relations can be constnicted frorn n star

rrraphs. one for each conceptual relation in g. Since Figure 3.1 has three conceptual C

Page 37: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

relations. it could be constructed fiom the following three star graphs, which are

represented in the linear form (LF):

[Person: John] t (Agent) t peading]

[Reading] + (Object) i [Newspaper: Toronto Star]

[Reading] + (Inst) + [microfiche Reader].

These three star graphs constitute a discomected conceptual graph. To forrn a c o ~ e c t e d

CG. they couid be joined by overlaying the three identical concepts of type [Reading] to

form the conceptual graph of Figure 3.1.

[Reading]---

(Agent) 3 [Person: John]

(O bject ) + [Newspaper: Toronto Star]

(Inst) + [Microfiche Reader].

The arrows on the arcs indicate the expected direction for reading the graph. For

conceptual relations whose narnes are nouns or abbreviations of nouns. the following

conventions are commonly used:

1. When a graph is read in the direction of the arrows. the arc pointing towards the circle

is read as "hm a''. and the one pointing away fiom the circie is read "which is".

Page 38: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2. When a graph is read against the flow of the arrows. the arc pointing away fiom the

circle is read "is a", and the one pointing towards the circle is read "of '-

So according to this rule, the above conceptual p p h can be read:

1. read in the direction of arrow:

Reading has an agent. which is John;

Reading has an object. which is "Toronto Star";

Reading has an instrument. which is Microfiche Reader.

2. read against the flow of the arrows:

John is an agent of Reading;

"Toronto Star" is an object of Reading;

Microfiche Reader is an instrument of Reading.

Conceptual graphs are independent of the surface language. Thus no matter how the

sentence is phrased or what language is used, it should be represented by the sarne

conceptual graph. For instance. the conceptual graph show in figure 3.1 also represents

the sentence: John is rdng a itlicrojiche Reader to read the "Toronro Slar".

Page 39: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

3.5 Canonical Graphs and Canoaical Formation Rules

A conceptual graph is a combination of concept nodes and relation nodes where every arc

of every conceptua1 relation is Iinked to a concept. But not al1 such combinations make

sense. Some of them include absurd combinations like the following:

[CAR] i (STATE) + [LAUGHTNG].

This is an odd. unusual? or perhaps meaningless graph that may be read "A car has a state

of laughing". To rule out such sentences, Katz and Fodor(1963) developed a theory of

semant ics that imposes selectional constraints on permissible combinations of words.

To distinguish the meaningfiil graphs that represent real or possible situations in the

ssternal world. certain graphs are declared to be canonical. Through experience. each

person develops a world view represented in canonical graphs. One source of the graphs

is obsemation: the assembler may combine certain concepts in perception- Since that

combination is true of a reaI situation. it must be canonical. Another source is the

derivation of new canonical graphs from other canonical graphs by formation rzrles. The

formation rules are the rules of copy. restrict. unresû-icted. join. simpli@ and detach.

The join mle merges identical concepts. Two graphs may be joined by overlaying one

craph on top of the other, so that the two identical concepts merge into a single concept. C

Page 40: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

As a result. dl the conceptual relations that had been linked to either concept are linked

to the single merged concept.

When w o concepts are joined. some relations in the resulting graph may become

redundant. One of each pair of duplicates can then be deleted by the rule of

simplification: when two relations of the same type are linked to the same concepts in the

same order. they assert the same information; one of them may therefore be erased-

For example. if we have the following two canonical graphs:

MAN EAT

Figure 3.2 Two canonical graphs

FAST

I

Figure 3.2 shows two canonical graphs. The tirst one may be read "A man is eating fast";

and the second "A person. John. is eating apple". Then we can use the formation rule to

join the two graphs as show in figure 3.3:

PERSON: John EAT

Page 41: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

MAN: John FAST

OBJEC ~ G I

Figure 3.3 Join of the two canonical graphs in figure 3.2

-4fter simplifj4ng the above joined graphs according to the simplification rules. we now

have a new canonical graph:

&pi APPLE

Figure 3.4 The simplification of figure 3.3

J

FAST . r

MAN: John EAT

Page 42: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

3.6 Contexts

A contelct is a concept that contains one or more nested conceptual graphs that describes

the referent. The concept of type "Situation" is an example of a context. Figure 3.5 shows

the conceptual graph that expresses the sentence " 1 suggest that you take the exam.".

Proposition:

L

Person: You T k e Esam. I h

Figure 3.5 A conceptual graph containing a context

Page 43: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

3.7 The CGIF Represeatation of Conceptual Graphs

A conceptual graph can be represented in several ways. The above mentioned

representations are good for humans in that they are readable. However. they are not

computer readable. Another representation of conceptual graphs is introduced in order to

solve this problem. that is the CGIF.

CGIF. which stands for the Conceptuai Graph Interchange Form. is a representation for

conceptual graphs intended for transrnitting conceptual graphs across networks and

betkveen IT systems that use different internai representations. Al1 features in the formal

CG definitions are represented in CGIF. and the comment fields permit informal

extensions. such as formatting information for graphical displays. The primary design

goal for CGIF is high-speed generation. transmission. and parsing of conceptual graphs

sent between computer systems. The CGlF syntax ensures that al1 necessary syntactic and

semantic information about a symbol is available before the symboi is used: therefore. al1

translations can be performed during a single pass through the input Stream. When a

conceptual graph is represented in CGIF. the grarnmar rules permit several different

options. al1 of which are logically equivalent. For exarnple. the above CG in Figure 3.1

John is reuding rhe "Toronto Star" wiîh a microfiche. can be represented in CGIF as:

[Reading *x](Agnt ?x[f erson 'John'])(Object ?x mewspaper 'Toronto Star']) (Inst

[Microfiche Reader])

Page 44: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

3.8 Why Use Concephial Graphs

In the previous sections, we bnefly overviewed some of the main aspects of the

conceptual graphs. As we can see, the concepnial graph is a knowledge rich

representation of natural language. It has the expressive power that is sufficient to encode

any fact or concept that is encodable in any other formal, symbolic systems. This means

that conceptual graphs may serve as a common medium of representation for diverse

kinds of knowledge. The conceptual structures that encode information may themselves

serve as a guide for idormation retrieval. From a given node. nodes representing related

entities are found simply by following pointers from the node to its neighbors. In this

kvay. a conceptual graph provides its own meaning-bearing indexing system. that is the

indexes are no longer based on key words but based on concepts or relations between

concept nodes. Labels on arcs and nodes are meaningfûl to graph-manipulating

procedures. they provide guidance to help traverse the conceptual graph in search of

information relevant to a task.

The conceptual graph's eâse and expressiveness for general and specitic concepts is a

major attraction over other formalisms. such as rules and Iogic for building linowledge

bases. Several efforts have demonstrated the power of conceptual graphs to perform

natural language processing and build domain-focused knowledge based systems

[Fargues et al. 86: Sowa & Way 86; Garner & Tsui 86; Morton & Baldwin 851. A

conceptual graph-based knowledge system is believed to be a flexible and powerfûl

approach in building a foundational knowledge base of general concepts [Berg-Cross &

Page 45: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Pice 891. As conceptual graphs have various means of representations to meet various

need. it is convenient both for human reading and for cornputer processing. For these

reasons. we chose conceptual graphs as o w natural language encoding schema and build

a knowledgr base system that is able to store and retrieve conceptual graphs.

3.9 The 'Notio' Java Package for Modelling Conceptual Graphs

The Notio' is a Java class Iibrary for constnicting and manipulating conceptual graphs. It

is designed and implemented by Finnegan ou the^" at the University of Guelph.

Currently, the package provides facilities for operations on single graph or pairs of

eraphs. It provides support for the management of individual graphs. not large groups of u

craphs. Most of its operations act on only one or hvo graphs. As such. it is ideal as a t

b a i s for CG editors or as a representation for data retrieved fiom a large-scaie system.

This thesis wiil make use of the "Notio" in the building of the conceptual graph

knowledge base system.

IO Finnegan southey. lCCS 1999 "Notio-A Java API for developing CG toots". University of Guelph, Computing and information Science, Guelph ON. Canada

35

Page 46: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 4

An Object-Oriented Knowledge base for Conceptual Graphs

In chapter 3. we discussed the basic concepts about conceptual graphs. Conceptual p p h s

are a graphic representation of logic that is used to structure knowledge embedded in

natunl languages so that the knowledge can be processed by a cornputer. With the help

of this knowledge encoding system, we are able to extract and encode meanings from

natural language text. Having this knowledge representation tool is not enough. We still

need a knowledge base to store and manage the conceptual graphs.

4.1 Why Do We Need a Knowledge base for Conceptual Graphs?

It is obvious. that for tinancial institutes like banks. they need a database system to store

information such as account number, encrypted password, customer name. address.

telephone number and account balance, etc. For retrieval purposes. there is also need to

build indexes that are based on account nurnber or name and address combined. A

traditional relational database system can do a very good job for this type of data

management.

Ho\vever. for research purposes, it is desirable to have a database (i.e. repositories) to

store research articles that are written in natural language texts. Can we also use a

Page 47: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

traditional relational database to do the job? The answer is both Yes and No. For the

answer Yes, we may employ the methods discussed in chapter two. such as inverted file

methods. adaptive methods and similarity measures. For the answer No, it is basically

because the retrieval mechanisms for these methods are al1 basically "key word"

matching. Thus the retrieval result is usually unsatisfactory; either irrelevant information

is retrieved or useful and relevant information is missed.

To overcome the "keqword barrier", we employ knowledge representations. such as

conceptual graphs. to mode1 and structure natural language texts. Thus. we are able to

extract and encode meaning form the texts and make intelligent search possible. So

instead of storing natural language texts directly in the database. we may store conceptual

eraphs that contain the meaning of the texts. Thus we need a knorvledge buse systern ro - store rhe conceptml gmphs and an indexing mefhod to inde-r the conceptiral graphs for

iriter eficienf retrieval.

Can a normal relational database do the job? Some research has been done on how to

apply a normal relational database to conceptual graphs. Brian Bowen and Pave1

~ o c u r a " have showed that conceptual graphs can be stored in a relational database and

managed by the relational database system. I t is worth having a look at their research

since we are going to do similar work but with a different approach. The following are

synopsis of their work.

I l B.A.Bowen and PKocura. "Irnplementing Conceptual Graphs in a RDBMS", ICCS 19%. Loughborough University, Department of Computer Studies. UK.

Page 48: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

4.2 Relational Data base for Conceptual Graphs

The way in which conceptual graphs are physicaily stored is of obvious importance in

terms of their efficiency of retrieval, as well as of the efficiency of operations upon them.

Conceptual graphs can consist of many concepts and relations and they are variable in

size. which creates problems when attempting to store them in fixed-field relational

tables. To solve the problems, Brian Bowen and Pavei Kocura first fragment the graph

into concepts and relations. then store the concepts in a table and the relations in another

table.

Al1 graph-string tables in their system are bi-tables. and have a number of domains in

common. In the concept tables, the domains common to al1 uses are:

TYPE A type label;

INDVMARK A marker. either generic or individual:

In the relation table. they use FROMPOS (FROM POSITION) and TOPOS (TO

POSITION) to represent the direction of the arrow in the conceptual graph. The common

domains in the relation table are:

FROMPOS The marker in the From position:

RELATION A marker. either generic or individual:

TOPOS The rnarker in the To position;

Page 49: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The above are the core domains that will allow us to store simple dyadically connected

graphs. The following is an example that shows how their methods store a conceptual

graph in a relational database. Consider the conceptual graph in figure 4.1 :

INTELLIGENT k

WOMAN: Peua

BEAUTIFUL

PERSON: Clovis

Figure 4.1 an example o f conceptual graph for storage in a RDB

A

4 MAN: Brian

Page 50: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

which would be fragrnented into two relational table as show in the following:

/ Indv Mark

MAN Brian

l

Petra

Clovis

1 INTELLIGENT

From

1 Brian

Petra 1_1.

Relation

CHILD OF Clovis

CHILD OF Clovis

ATTR

ATTR

Page 51: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The numbered asterisks represent intemal markers assigned by the system in order to

ensure that the graph is logically connected when hgmented.

Those core domains constitute the core graph's database, which is used to store general

information: most of the tables in their system also have extra domains that allow them to

store extra information that is necessary for the data that is being stored. The extra

information may include TYPEDEFN and RELDEFN which store type and relation

definitions respectively. Each tuple has an extra domain to record the definition that the

entry is part of.

4.3 Problems with Relational Database for Conceptual Graphs

Brian Bowen and Pave1 Kocura do proposed a way to solve some of the problems when

t ~ i n g to apply relational database technology to conceptual graphs. However. they do not

solve al1 the problems. By nature. a normal relational database has difficulty in

modelling the data domains of conceptual graphs. Such dornains share some common

features such as:

1. they contain complex objects that are very dificult to represent in a relational

database(RDB):

2. they require more manipulative power than the relational mode1 can provide. The

DMLIdatabase management language) is primarily concerned with efficient querying

and maintenance of the database. but has little expressive power: the standard

Page 52: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

operations of the relational mode1 just don't have the expressive power required; e.g.

the RDMS is unable to perform matchs based on graphs, concepts and relations.

. such domains ofien require the modelling of cornplicated interrelationships and

constraints associated with the objects k i n g modelled; the constra.int mechanisms of

RDBs are completely unable to cope with such requirements.

Most importantly. the method they proposed is not able to store al1 kinds of conceptual

graphs. It is restricted to simple dyadic comected graphs. -

4.4 Object-Oriented Technology - A Better SoIution

Although relational databases can be applied to conceptual graphs. they do not naturally

fit. and some restrictions have to be imposed. Now we nrni to the other alternative --- the

object-oriented technique. 1s the object-oriented knowledge base better for conceptual

graphs? The answer is "Yes". In the following few sections, we will show the rasons

w-hy this approach is better.

4.4.1 Basic Features of Object-Oriented Modelling(00M)

There are four basic features in object-onented modelling: Abstract Data Typing. Object

Identity . Inheritance. and Encapsulation. We present these four features below as we

Page 53: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

will see later these features are very similar to the features of CG'S. Thus, OOM can be

applied very nicely and naturally to CG'S issues.

1. dbsrracr Dam TypinglADV models various classes in object-oriented knowledge

base applications. where each class instance has a protocol: a set of messages to

which it can respond. With abstract data types there is a clear separation between the

external interface of a data type and its intemal implementation. The implementation

of an abstract data type is hidden. Hence. alternative implementations could be used

for the sarne abstract data type without changing its interface. This provides a rich

mechanism for recording design information for related data and behavior. W s can

use the notion of Abstract Data Type to encapsulate data and behavior so that we

export only external services while hiding the implementation details for these

senices. Abstract data typing allows the construction of complex software systems

through reusable components - - the classes. Thus. through abstract data typing.

programrning becomes modularized and extendible. Abstract data typing supports a

much more natural representation of real-world problems: the dominant components

are the objects rather than the procedures. Abstract data typing allows objects of the

same stnicture and behavior to share representation (instance variables) and code

(methods).

Abstract data typing is a useful feature when we use object oriented method to mode1

conceptual graphs. since conceptual graphs have concept types. with OODB1s ADT

feature. any concept types can be modelled with a corresponding ADT.

Page 54: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2. Objecr Idenriry is the property of an object that distinguishes each object from al1

others. With object identity, objects c m contain or refer to other objects. Object

identity allows the same object to be referenced through attributes of many other

objects. This is cailed referential sharing. Object identity is the property of an object

that distinguishes it from al1 other objects. In programming languages identity is

usuall y realized through memory addresses. In databases. identity is realized through

identifier keys, User-specified names are used in both languages and databases to

give unique names to objects. Each of these schemes compromises identity. Object

identity clarifies. enhances, and extends the notions of pointers in conventional

programming ianguages. foreign keys in databases. and file names in operating

systems. Using object identity. programmers c m dynamically consmct arbitrary

graph-stnictured composite or complex objects and objects that are constructed fiom

sub-objects. Objects can be created and disposed of at run time. In some cases objects

can even become persistent and be reaccessed in subsequent programs.

lnheritance is a technique that lets us speciS, some parts of a system incrernentally. It

means subclasses can inherit the instance variables and methods of super classes. It

captures an "is a" relationship. Through inheritance new software modules (e-g.

classes) can be built on top of an existing hierarchy of modules. Inheriting behavior

enables code sharing and reusability. Most existing object-oriented systems allow

developers to extend an application by specializing existing components (in most

cases. classes) of their application. Specialization is a top-down approach to the

development of object-oriented database applications. Generalization is the

Page 55: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

complement of specialization. It uses a bottom-up approach by creating classes that

are generalizations (or super-classes) of existing subclasses. Generalization is a

bottom-up approach for object-orîented database development. There are three facets

of inheri tance that c haractenze most of the approac hes used by object-oriented

languages:

a) Visibility of inherited variables and methods: some object-oriented languages allow

the direct manipulation of instance variables. Other languages distuiguish between

public and private instance variables. With inhentance. there is a third alternative

called subclass-visi ble.

b) Method Ovemdingj Subclass c m ovemde an inherited method. In other words, a

method called "MW in class "C" can be ovemdden by a totally different method; also

called "Mn. in a subclass of "C".

C ) Multiple inheritance: multiple inheritance is a mechanism that allows a class to inherit

from more than one immediate parent. The class inheritance hierarchy for multiple

inheritance for which a class can have more than one immediate predecessor becomes

a directed acyclic graph (compare with single inheritance for which the class

inheritance hierarchy is a tree).

Inheritance is also a usehl feature when we use object oriented methods to model

conceptual graphs. In conceptual graphs. al1 concept types are organized in a

hierarchy tree according to Ievels of generality. The inheritance feature can be used to

model this concept type hierarchy.

Page 56: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

4 . Encapsrdation protects object integrity because encapsdation limits access to only

those services explicitly exported for an object. Encapsulation utilizes implementatîon

independence to hide the implementation for an object. Encapsulation thus ailows

implementation detaits to change without requiring any change to programs that

access obj ects through the exported services. This sarne principle also allows objects

within an object set to have different implementations. This leads to appropriate uses

for overriding. overloading dy namic binding and pol ymorphism in ïnheritance

hierarchies, in which more specialized objects can have more eficient

impkmentations for operations. Encapsulation also serves as interface. which lets the

encapsulated object control which services are available and when they are available.

These fundamental features of object-oriented modelling technique are coincidental 1 y

mapped directly to the features of conceptual graphs very well. We discuss these

similarities in the foIlowing:

4.4.2 Abstract Data Typing Maps Concept Type

The first coïncidence is that the abstract duta ryping in object-onented modelling is

naturally mapped to the concept type of conceptual graphs. For example, in order to

mode1 the concept type "person". we may create an abstract data type -- class "person".

In that class. we may include some attributes that describe "person". Such attributes may

include: narne. gender. age. height. weight. etc.

Page 57: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

1.4.3 Object Identity Maps Individual Marker

The second coincidence is that the objecr identity in object-oriented modelling can be

exactly mapped to the individual marker. They both are unique identifiers of an

individual or an object in a system. They are both generated intemally by the system, and

they are not usable outside the system. For externally printable references. an individual

may also have a name or serial number that would appear afier the type label in the

concept box. For example. in Java the way we introduce a new object into a system is by

using the keyword "new":

Car rny FirstCar = new Car(seria1Number): // assume we already defined the class "Car"

In this way. a new object of type "Car" is created and a unique identifier is assigned to it

by the system. The "myFirstCarW is an extemal reference to the object.

The above car can aIso be represented in a conceptual graph like this:

[Car: serialNumber]

The serial number is an extemal printable reference. It refers to the pariicular car in the

system.

Page 58: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

4.1.4 Inheritance Maps Canonical Formation Rule

The third coincidence is that the inheritance mec hanism of the O bject-oriented modeIIing

is already a fundamental feature of conceptual graphs that are supported by the canonical

formation d e s . A canonical graph can be derived form another canonical graph by

fornrcrrion rtrles. The formation rules are the niles of copy. restrict. unrestrict, join.

sirnplifq. and detach.

For example, a concept of type "tiger" can be derived from the concept "animal". if we

add some restrictions on the "animal" such as: "live on land". "with four legs". "eats

meat" etc. In object-oriented modelling, the object class "tiger" can be derived from class

"animal". The restrictions in the conceptual graphs may become the attributes of the

derived class.

4.4.5 Encapsulation Maps Context

The fourth coincidence is that the encapszrlation mec hanism of the object-oriented

mode1 1 ing c m be matched to confext of conceptual graphs. Contests encapsulate object

descriptions in a way that esactly reflects the structure of the object-oriented modelling.

We may already be familiar with encapsulation of object-oriented modelling! and

probably new to contexts. Let's look at the following example that shows how contexts

Page 59: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

encapsulate an object. Figure 4.2 is a conceptual p p h indicating a meeting has k e n held

on 1 Oct. 1998.

L . MEETING DATE: 1 Oct. 1998

Figure 4.2 the conceptual graph for a meeting

The concept box with the label MEETING says that there exists a meeting, but it does not

specify any details of what happened. The OCCUR relation indicates that it occurred on

the date 1 Oct. 1998. To see the details of the Party. it is necessary to open the box and to

look inside. The box may be expended as show in Figure 4.3.

The expended box says that there are 20 attendants in the meeting and the chairman of

the meeting is John. He gives a speech to al1 the attendants.

Page 60: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

r DATE: 1 Oct. 1998

J

MEETING:

i

CHAIRMAN: ATTENDANT:

John {* )@O

@-i-1 SPEECH

Figure 4.3 expanded view of the meeting context

4.4.6 Representing Objects and Object Classes

Other aspects of object-oriented modelling can also be matched to those of conceptual

eraphs. One of the aspects in object-oriented modelling is the distinction between an C

object class and the instances of each object. This feature can also be found in conceptual

graphs. For example. in Java. the way we define an abstract object class is like this: C

Page 61: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

public class Car (

private string model;

private string enginehmberj

private int wheelsize;

private string chassisNurnber

// the model of the car

// engine has a serial number

// the size of the wheels

// chassis has a serial number

public Car(string model, string engine#. int wheelsize. string chassis#)

this-mode1 = model;

this-engineNurnber = engine#;

this.wheelSize = wheelsize;

this.chassisNumber = chassis#:

1

............................ // other methods

In conceptua1 graphs. the definition of a class "Car" may tooks like this:

Page 62: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Car: V *C MODEL: *m

ENGINE: *e WHEELSIZE: (*)@16

Figure 4.4 class definition for Car

Figure 4.4 shows a sample definition for the object class Car. The object definition has a

universal quantifier V to show that it applies to every car *C. Inside the definition, the car

?C is a kind of mode1 *m, and it has as parts an engine *e' a set of 16 inch wheels *W.

and a body *b. The concepts in the class definition are generic concepts that Say that

some engine or body must exist for each car but they do not speciQ their narnes or other

identifiers.

An instance of the object class Car can be created in Java by using the keyword "new".

For esarnple:

Page 63: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Car aCar = new Car("MustangW, "V6". 4. "JKL333");

The sarne instance can also be specified in conceptual graphs as the following diagram:

Car: CA 163 1998 *C

MODEL: Mustang L

ENGINE: 728EClZS *e 1

WEHEEL: {*)@16 1 1 CHASSIS: JKL.333 *b

Figure 4.5 an instance of car CA 163 1998

Frorn the above discussions we draw the conclusion that an object-oriented approach

naturally fits to conceptual graphs and thus it is better to use an object-oriented database

to store conceptual graphs than to use a relational database.

Page 64: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 5

Design and Implementation of the System

In chapter 3 and chapters 4: we reviewed some basic concepts conceming conceptual

graphs and object-oriented databases. In this chapter? we wiil discuss the design and C

irnpIementation issues of the object-oriented knowledge retrieval system. We use Object

Modelling Technique (OMT) method to design the system and use Java to irnplement it.

It has been tested on JDK1.1.6. The executable program is named as 'CGBase'. an

abbreviation for the "Conceptual Graph Knowledge base". The reason why we choose

Java is that:

1 . it is an Object-Onented Language; in chapter 4 we have discussed the advantages to

use object-oriented modelling techniques for conceptual graphs

2. it is portable to any platform

3. currently most CG tools are implemented in Java

The guide lines for designing the system are:

1. The system should provide efficient means of storing and searching conceptual

eraphs. The search should accurately retrieve knowledge based on user inputs. C

2. A user-fnendly interface should be used. which enables both experienced and

ine'cperienced users to use the system.

Page 65: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Based on the above guidelines, the following is the detailed system design that reflects

these guide lines.

5.1 Overview of the System

The system is divided into three components namely:

1. an interface between the user and the system- this interface contains a manual that

pnnts the options availabte for the user to select. as show in Figure 5.1.

* 1 . Create a new conceptual graph knowledge base * * 2- Open an existing conceptual graph knowledge base * * 3. Load a conceptual graph knowledge base * * 4. Search the conceptual graph knowledge base * * 5- Update a conceptual graph * * 6. Show al1 conceptual graphs * * 7. Close a knowledge base * * 8. Close al1 knowledge bases * * O. Exit * * * *************************************************

Figure S. 1 the Selection Manual

Page 66: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

2 . a conceptual graph management kernel, the kernel consists of four object classes that

includes (a) conceptual graph loader that loads conceptual graphs into the knowledge

base: (b) index builder that extracts usefiil information from the graphs and builds

indexes with these information; (c) graph searcher that searches the knowledge base

for the rnatched graphs with the help of the indexes; (d) graph update editor that

updates a conceptuai graph.

3. a b-tree management system that hoMs and manages indexes.

In this project. each conceptual graph knowledge base consists of two files. One file is

the knowledge base file with a file name extension '.cgdl. This file contains al1 conceptual

graphs witten in CGIF format. The other file is the knowledge base index file with a file

name extension '.indl. This is the file that maintains al1 the index information about the

knowIedge base.

In Our approach. we first create a knowledge base for the conceptual graphs that are in

CGIF format, then populate the knowledge base w i t h the input conceptual graphs. While

the conceptual graphs are on the way to the knowIedge base. we parse the graph and

estract the usehl information and build indexes with the information.

The system also provides two ways to retrieve the conceptual graphs stored in the

knowledge base. namel y:

1. "terms" based search; user may enter any concepts or relations or combination of

concepts and relations. The system wili search for any conceptual graphs that

Page 67: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

contains this information. The search is based on individuai concept and conceptual

relation match.

2. knowledge based search; user may enter a conceptual graph. The system will search

for any conceptual graphs that match the graph user entered. The search is based on

graph match.

For example. if the knowledge base contains three conceptual graphs:

( 1 ) [DOG* x'Mollyl] [MAN* y'Arthur'](LOVES?x?y)

(2) [DOG*x'Molly'][M~*y'Arthufl] [BONE*z](LOVES?x?y)(THROWS?y?z)(CATCH

ES?x?z)

(3) [DOG*x](LOVES?x[BObE])

Nol;. if the user just enters: dog

The system will find al1 conceptual graphs that contain concept 'dog'. In this case a11

conceptual graphs will be printed. The user. however. may narrow down the search by

providing more information. For example, the user may enter: dog + loves + arthur. In

this case only the first two conceptual graphs will be found and printed.

For the exact search. the user must enter a conceptual graph in CGIF format. The system

\vil1 take the input graph and search the knowledge base using graph matching.

Page 68: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

We also want to update an existing conceptual graph. In this case, we first search the

conceptual graph that we want to update, When we enter the new conceptual p p h , the

system wi I1 update the old graph with the new one.

5.2 Design of a Conceptual Graph Object

Objects in memory and objects in a database differ semantically. Historically. object-

oriented systems and languages assumed that al1 the objects reside in a large virtual

memory and, as such. never bothered to develop concepts for managing objects in

database (Stefik and Bobrow. 1986). Like relational database. integrity constrains such as

uniqueness of objects. admissibility of nul1 values. domain type of attributes and

relationship between objects has also to be applied to object-oriented database. The

design of conceptual graphs in memory and on disk is different- In memory. a conceptual

graph is an object that consists of concepts. relations and comments. which are also

objects. Figure 5.2 shows the conceptual graph in memory.

Page 69: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Conceptuai Graph

L

Concepts Relations Comments i

Figure 5.2 the conceptual graph object in memory

The above diagram shows that a conceptual graph consists of zero or more concepts. zero

or more conceptual relations and zero or more cornrnents.

At first gIance. to store objects on a disk is easy, since Java supports direct wi t e of an

object to a file. However. objects written in this way c m not be retneved randomly. since

Java does not support random access objects in a file. Random access to the information

stored on a disk is very basic and is a must for any knowledge base systems. Thus. we

have to find other ways to store objects (conceptual graphs) on a disk.

Page 70: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The solution is as follows: since everything on the disk is in the form of byte streams, it is

impossible to retain data structures and pointers on disks. Thus, We have to store

conceptud graphs as byte streams. The length of each conceptual graph byte strearn is

variable. it is in sharp contrast with that of relational database. In relational database, al1

records consist of a fixed number of fixed-length fields. Thus, a waste o f disk space is

inevitable. In some situations. the waste of disk space can be very significant. since the

size of conceptual graphs can be significantly different. This is one of major drawbacks

of the relational database solution.

5.3 Design of a Concept Object

Concepts consist of four components namely: a type. a quantifier. a designator and a

comment. The latter two components constitute the referent of the concept. Quantifiers

are represented by using a class that impiements the Macro intedace. Macros c m either

use a simple placeholder object with a name. or can actually provide an esecutable

operation that 'esecutes' the macro and changes the graph accordingly. Figure 5.3 is the

graphical representation of a concept object.

Page 71: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Concept I

1 .'

Type Quantifier designator comment L

Figure 5.3 the concept object in memory

5.4 Design of Relation Object

Relations consist of a type and arguments (arcs) which are an ordered Iist of Concepts.

Figure 5.4 shows the relation object-

Page 72: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

. Relation

Figure 5.4 the relation g a p h object in memory

5.5 Indesing

The process of constructing document surrogates/tags by assigning identifies to text

items is known as indexing [Saltion 831. Indexing provides a means to organize and

facilitate the retrieval of information. Some systems use a single index. The draw back

for this index method is obvious. namely the users have to provide information on that

particular index in order to retrieve information. For example. if a database stores

information about al1 ernployees for a Company it can use a single index on names. For

Page 73: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

such an index system. a user must know an employee's name in order to draw

information about himher. Any other information will not help. It is nice to have more

than one index. however the trade-off is disk space.

Indexes for this knowledge base system are deliberately designed to provide users with as

many ways to retrieve conceptual graphs as possible. The system uses a triple index

system. It constnicts the indexes system based on "concept type", "name desiyator of a

referent" and "relation type". The indexes are organized into a B-tree structure. Each

node in the b-tree consists of three objects namely: "data". "pointer" and "addrVectorW.

The "data" detennines the position at which the node should be inserted. The "pointer" is

an array of pointers that point to its sub-nodes. The "addrVector" is a vector that contains

the addresses at which the searched information stored in the knowledge base. The disk

representation of a B-tree node is show in figure 5 -5.

The Nurnber of addresses (integer)

Length of the value

of the 'data'

(integer)

The value of the -data'

(String)

Figure 5.5 a ' BTreeNode' object on disk

address value (long integer)

address vaIue (long integer) .

Page 74: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

5.6 Query of Conceptual Grapbs

Query is an important part of a knowledge base system. It is actually the interface

betwern user and the system. In designing of the format of the query for this system, we

take the following fact into consideration. The user can be either inexperienced or

experienced with conceptuai graphs. For inexperienced users. they may not know

conceptual grrrph ternis, such as concepts, relations. referent etc. For them. natural

language is the primary means of communication. Thus. they want to use natural

language to query the conceptual graph knowledge base. However. for experienced users.

they rnay know conceptual graphs quite well. They rnay wish to use their own "language"

that is "conceptual graph" to do the query. Such users tend to believe the results are

usually more accwate if they use conceptual graph terminology in the query language.

Based on the above discussion. the system implements two kinds of query:

1. query with natural language. it actually provides a fuzzy match

3. query with conceptual graph. it actually to provides a more exact graph match

For the first kind of query. a user may enter any information about conceptual graphs that

they want to retrieve. The information may include concept type. relation type and name

designator. The system will take the query and then search the indexes try to find

matched elements and finally read the conceptual graphs fiom the knowledge base and

prints them on the screen.

Page 75: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

For the second kind of query, a user may enter a conceptual graph they want to retrieve.

The graph may be entered either fiom the keyboard or tiom a file. The system will take

two steps to finish the "retrieve" process. The first step is to take the graph and extract

useful information. It then searches the indexes to try to fmd al1 the matched elements. In

this case. many graphs may be found. For the second step, the system refmes the results

by performing graph matching. Thus. only those graphs that represent the same meaning

as the input one will be found.

5.7 Graph Matching Mechanism of the System

Graph matching between conceptual graphs is an essential feature- Different systems

provide different types of matching capabilities. ofien depending on the way that they

store the graphs. Applications may require many different forms of matching in order to

estract the information required.

Since the main focus of this thesis is to build a conceptuai graph knowledge base system.

the topic of graph matching is not deeply explored here. This conceptual graph

knowledge base system provides users with one form of matching. The matching scheme

the system uses is to match concept types. relation type and name designators of

concepts. For this matching scheme. if a user queries the knowledge base system with a

conceptual graph. only those graphs that match al1 these three components will be

Page 76: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

retrieved. We chose this matching scheme because it represents the core meaning of most

sentences we use. For example:

"Man clrrhzw loves his dog kfolly. "

In this sentence, there are two concepts and one relation involved namely: Flan:

'Arthur']. [dog: 'Molly'] and the relation is (loves). The core meaning of the above

sentence is captured by those simple concepts and relations. A sentence like:

':?Man A rrhtrr Ioves dog b.folly. "

will also be matched to the above sentence according to our matching scheme and we

may see the meaning of the above two sentences is the same. But a sentence like:

"Dog Mo lly loves man Arthur. "

will not match since the meaning is quite different.

5.8 Design of the System

Object-oriented design is a new way of thinking about sohvare based on abstractions that

exist in the real world. The object-oriented design emphasizes the objects and the

relat ionshi p between these objects. An object incorporates both data structure and

beha\-ior. This is in contrast to conventional programming in which data structure and

behavior are only loosely connected. The design of the system can be show in Figure 5.6.

Page 77: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

contains

v uses BTree

contains I Search

r

m

BTreeNode

Figure 5.6 the system class diagram.

Page 78: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

As we can see. the system consists of six object classes. The class 'CGBasef is the

interface between the user and the system. It is able to accept the instruction or inputs

from the user. ln case the user want to perform management operations on the knowledge

base. a 'CGIFLoader' object will be created; or a 'Search' object will be created in case the

user

wants to search or update a graph. The class 'CGBasef may create and contain one or

more 'CGIFLoader' object. It may also create one or more 'Search' objects.

The class 'CGIFLoader' is the main class of the system. This class is able to create a

conceptual graph knowledge base. populate the knowledge base and build indexes on the

knowledge base with the help of the 'BTree' class. It also performs file management

functions such as "open" or "close". both for the knowledge base file and the index file.

Each 'CGIFLoader' object contains one 'BTree' object that is a B-tree structure that holds

the index of the knowledge base that this 'CGIFLoader' manages.

There are two main functions of the cIass 'Searcher'. The first main function is to find the

rnatched graphs and print them on the standard output. The second one is to update an

esisting graph according to a user's wish. This class may get the 'BTree' object fiom the

'CGIFLoader'. and then search through the index tree to find the addresses of the matched

graphs- and then randomly access the knowledge base file. read the matched graphs to the

memol and print them out on the standard output. If the user wants to update the graph.

it wil l alIow the user to enter the new graph and insert the new graph into the knowledge

base. Meanwhile. it \vil1 update the index as well.

Page 79: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The class 'BTree' dong with the 'BTreeNode' and the 'BTreeElernent' forms the index

structure. This structure is able to read the index file fiom the disk and build the index

structure automatically. It is also able to write itself to the disk. Each 'BTree' object

consists of zero or more objects of the 'BTreeNode' class, and each 'BTreeNode' object

consists of exactly one object of the 'BTreeElement' class.

Page 80: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 6

An Example

6.1 The Example

This example illustrates how the conceptuai graph knowledge base management system

~vorks. The example is an abstract of an article fiom the "Proceedings of Thirteenth

International Conference on Data Engineering". The reason we choose this exarnple is

because the example is just a common article wriaen in English. It does not have its own

specific features. It is common to most English articles. This means the conceptual graph

knowledge base management system can be applied to any common English articles.

Here is the article:

Objecr-orienred databuse sysrems (OODBMS) offer powerficl modelling concepts

as reqccired by advanced application domains like CAX or oflce automation.

Typicaf applications have ro handle large and complex strtrctured objects which

fieyzrenrly change rheir d u e and rheir structure. As the strucîzcre is described in

the schema of the database, support for schema evolurion is a highly required

fearzlre. There fore, a set of schema update primitives must be provided which can

be tcsed ro perform the required changes, even in the presence of poprtlared

darabases and rztnning applications.

In rhis paper. we use the versioning approach ro schema evoluîion ro slipport

schema repdutes as a c o m p k design task The presenred propagation mechanism

Page 81: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

is based on conversion finciions fhat map objects between dzrerent Wpes and can

be zrsed ro slcpporf schema evolution and schema integraiion.

6.2 Building of Conceptual Craphs from the Example

In order to capture the meaning of this article, we build seven conceptual graphs based on

the article. Here are the conceptual graphs:

Object 0 I

Application- domain Require

Modelling- concepts OODBMS

Figure 6.1

Offer .

Page 82: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

. Large Complex S tructured

A A L

Structure Value

Figure 6.2 Change >

Page 83: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

. Schema ' Describe

L

Database cl Figure 6.3

Object 9 Structure el

Page 84: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

d Schema-

Support ' evolution

Fsature Required

Figure 6.4

Page 85: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Sc hema-update- - primitive A

A

~

Perfonn

Object

Change G' Condition O Required A Figure 6.5

Page 86: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Approach + Target

Agent A

Object 9 C haracter 9 Complex rl

Figure 6.6

Page 87: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Object 0 evolution

Agent O Object 0 Object Q

Character (2 1 Different 1 c(S-) t 1 Twe 1

Figure 6.7

Page 88: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

6.3 Conceptual Graphs in CGIF Format

Since the conceptual graph knowIedge base system only accepts conceptual graphs in

CGIF format, we also create CGIF format of the a b v e conceptual graphs as the

fol lowing:

1. [ O f fer * X I (Agent ?x [OODBMSI } (Object ?x [Mode l l i ng Concep t s *u] ) [ R e q u i r e *y ] (Agent ?y [ A p p l i c a t i o n - dornain] ) ( 0b j e c t - ?y?u )

2 . [Handle *ml [La rge *oj [Cornplex *pl [ S t r u c t u r e d *q] [Change * t ] (Agent ? r n f A p p l i c a t i o n ] ) ( O b j e c t ? m [ o b j e c t *n] ) ( C h a r a c t e r ?n?o ) ( C n a r a c t e r ?n?p ) ( C h a r a c t e r ?n?q ) (Have ?n [Va lue * S I ) (Have ? n [ S t r u c t u r e * X I ( C h a r a c t e r ? s ? t ) ( C h a r a c t e r ? x ? t )

[ S t r u c t u r e *x] [Describe 'y] [Schema * z ] [ D a t a b a s e *u] (Agent ?y?z ) ( O b j e c t ? y ? x ) ( I n ?z?u)

4 . [ S u p p o r t *u ] [Schema - e v o l u t i o n * V I [ R e q u i r e * X I [ F e a t u r e * y ] ( O b j ect ?u?v} ( D e f i n i t i o n ?u?y) ( C o n d i t i o n ?y?x )

5 . [Schema u p d a t e p r i m i t i v e *u] [ P r o v i d e *v] [ P e r f o r m * w ] [Change *x] [ ~ o ~ u l a f e d *y] T ~ a t a b a s e * z ] [Running *pl [ A p p l i c a t i o n *qJ [ R e q u i r e d + O ] ( O b j e c t ?v?u ) (Agent ?w?u) ( O b j e c t ?w?x) ( C o n d i t i o n ?x?o) ( C o n d i t i o n ?w?z) ( C o n d i t i o n ?w?q) ( S t a t e ? z ? y ) ( S t a t e ?q?p)

r o. [ V e r s i o n i n g * u ] [ a p p r o a c h *v][Schema e v o l u t i o n *w] [ S u p p o r t *x ] [Schema u p d a t e *y ] [Cornplex * z ] [~esign t a s k *s] (Method ?v?u ) ( ~ a r c ~ e t - ? v ? w ) (Agnet ?x?v ) ( O b j ect ? x ? ~ ) (character ? s ? z ) ( C l a s s i f y ? s ? x )

Page 89: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

7. [ P r e s e n ~ *pl [Propagation mechanism *q] [Conversion functions * S I [Map *t] [Object *u] [~iffërent * V I [Type *w] [Be use2 *x] [Support * y J [Scherna-evolution * z ] [Schema integration *r] (Stato ?q?p) (Based on ?q?s) (Agent ?t?s) (Ebject ?t?u) (Character ?u?w) tat te ?w?v) (Patient ?x?q) (Target ?x?y) (Object ?y?z) (Object ?y?r)

We ais0 popdate the system with other CGIFs:

1 . CGIF for figure 3.1 :

[Person *x'John'][Reading *y](Agent ?y?x)(Object ?y

[Newspaper 'Toronto - Star' ] ) (Instrument ?y [Microfiche - Reader] )

2. CGIF for figure 3.4:

[Man 7x'Sonn1] [Eat * y ] [Fast *z] [Apple *u] (Agent ?y?x)

(Manner ?y?z) (Ob jeci ?y?u)

3. CGIF for "Dog Molly loves bone throwed by man Arthur.":

[Dog *x'Mollylj [Man *ylArthur'](Loves ?x[Bonetz])

(Throwed - by ? y ? z )

Page 90: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

6.4 Searching of Conceptual Graphs

We first create a knowledge base and populate the knowledge base with the above ten

conceptual graphs. Then we are going to search conceptual graphs. There are two kinds

of search mechanism available: loose match and exact match.

6.4.1 Fuzzy Match

For fuzzy match, the search result is based on the mappïng of individual concepts and

relations. We enter queries:

I . Support + schema - evolution

Result: conceptuai graph: #4, #6 and #7 are retumed.

2. Support + schema - evolution + schema - integrution

Result: only conceptual graph #7 is retumed.

As we c m see, the search resuits totally depend on individual concepts and relation

matches.

Page 91: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

6.4.2 Knowledge Based Match

For know!edge based match, the search resuit is based on the match schema discussed in

chapter 5 section 7. Two graphs are considered to be match oniy when both graphs

represent the same meaning. Let's look at the queries:

Query 1 : Find information about object-onented database systems that offer rnodelling

concepts which is required by application domains?

With this input query Ianguage, we can build a conceptual graph query:

[OODBMS *O] [Offer *p][Modelling - Concepts *q][Require

[Application-dornain *s](Agent ?p?o ) (Object ?p?@ (Agent ?r ?s) (Object ?r?@

Use this query to query the system, we get gridph #1 as the result.

Query 2: Find information concerning support schema evolution when a required feature.

Now we build a conceptual graph query:

[Suppor f jcl (Object ?x [Scherna-evolution]) (Definition ?x[Fearure 91)

(Condition ?yfRequire])

This query yields graph # 4.

Query 3: Find information about using the versioning approach to schema evolution that

supports schema updates as a design task?

We build a conceptual graph query as foliowing:

[A pproach %] (Method ?v[Versioning7) (Target ?v[Schema-evoiution])

Page 92: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

(Agent [Support 3r]?v) (Object ?x[Schemaup&te]) [Desgn-task *.Y]

(Characr er ?s[Complex]) (ClasszB ?s?x)

Now the system retums graph #S.

We can enter queries to query other graphs just as the way we did in the above. Then we

can get the exact graph that we want to h d .

When we query the knowledge base, only one CG that matches the query CG is retumed

for each query. We can see the query CGs are ail different fiom the CGs stored in the

system. Yet, the system can still find the matched one. This is becaw although the query

CGs and the CGs in the system looks different, they al1 represent the same meaning. It is

reasonable, not to expect al1 people to derive the same set of conceptual graphs liom an

article, but we can expect people to wite different sets of conceptual graphs that

represent sarne meaning based on an given article. For a detailed test example please

refer to appendix B.

6.5 Conclusion

For this example, the IÜzzy search provides more loose match. It is suitable for users who

try to find articles with a set of sirnilar topics. It is ais0 designed for inexperienced

conceptual graph users? since the queries are written in English words. If the user want to

find a specific article, he/she can use the exact match.

Page 93: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

The knowledge based match is based on core graph match and also it is designed for

experienced conceptual graph users since it only accepts conceptual graphs as it query

language. It is much more accurate compared with the previous one- The system yields

very high accuracy in the search with knowledge based graph matching. However. in the

real bvorld. it may not be able to search conceptual graphs at this high accuracy. The

major facts that largely affect the search accuracy tvill be:

1. people's understanding of an article that he/she will derive conceptual graphs fiom

2. the ability to correctly derive conceptual graphs from the article

To solve the second problem. first we need a standard for conceptual graphs. With a

conceptual graph standard, we will be able to derive conceptual graphs fiom tests in a

consistent way. Second. we need a software that is able to automatically build conceptual

craphs from a given text. Since. conceptual graphs derived by human are inherently b

lacking consistency. One text can be represented in different conceptual graphs by

different people even though they al1 follow the standard.

As Lve can see frorn the example, the system is able to find the right conceptual graph

even the input conceptual graph does not looks exactly the same as the conceptual graph

stored in the knowledge base. This verifies that the system does allow different

representations of conceptual graphs as long as the meaning that two conceptual graphs

representing are same.

Page 94: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

CHAPTER 7

Summary

7.1 Summary

Conceptual structurest as developed by Sowa, is a very rich knowledge representation

ianguage intended to incorporate many concepts in nahuai and formal Ianguages. The

conceptual graphs stored in computers must be made more readily accessible and

manageable. This thesis is right for this purpose. The thesis has explored and

demonstrated some of the issues involved in building a conceptual graph knowledge base

system with object-oriented design and programming technology. Such issues include

knowledge representation methods, modelling and encoding techniques of natwai

Ianguages. knowledge retrieval techniques. knowledge base indexing techniques. The

object-oriented technology has increasingly been applied in modelling and building

comples information and knowledge systems. Thus, some object oriented design and

modelling techniques are also presented.

Page 95: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

7.2 Conclusion

ï h e research in this thesis demonstrates a knowledge base system that is able to encode

knowledge in conceptual graphs and also demonstrates the use of knowledge through

simple query language. The conceptual graph is a promising technique in knowledge

representation especially in natural language stnicturing and encoding. It is a graphical

language designed for the interchange of knowledge between humans and computers. It

can be processed effectively by a computer. It also can be stored and managed by an

object-oriented knowledge base system; in fact, an object-oriented technology is a more

natural fit to the conceptual graphs than a traditional relational database.

The accuracy in retrieving conceptual graphs stored in the knowledge base system largely

depends on the consistency in the building of query CGs and the CGs in the knowledge

base. Theoretically, if we guarantee 100% consistency in the building of query CGs and

the CGs in the knowledge base, the system will yield high accuracy in the retrieving of

CGs. The result fiom our experiment is confirm this accurate.

7.3 Future Work

The format of the conceptual graphs used in the thesis project is in CGIF. However, a

conceptuai graph can be expressed in many formats, such as display fornt @F) that is a

graphic format much easier for hurnan to understand but dificuit for computers to

Page 96: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

process. and linear form (LF) that is a more compact notation to the display form. Future

research can be expected to build and include a "Conceptual graph format translator",

which takes in one of these formats and translates it to the any of the other formats. Thus,

the conceptual graph knowledge base management system. collaborate with the

translator, will be able to store and manage conceptual graphs in any of these formats.

Future research can also be expected to build conceptual p p h modeis of legacy DB's

and use data rnining techniques for knowledge extraction and explore new algorithrns and

techniques for CG matching. So. instead of providing just knowledge based match. the

system will also provide difierent layers of graph matching, such matches c m be based

on sub-graphs. individual concepts. relations, concept or relation type etc.

Page 97: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Appendix:

A. Object Diagrams of the System

openedDBFileVector: Vector

1oaderTable: Hashtable

Figure 1. class diagarn for TGBase'

Page 98: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

parseGraph(Know1edgeBase. TranslationContext, String)

Page 99: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

- - --

cgifloader: CGIFLoader

key: Vector

dbFile : RandomAccessFile

root: BTreeNode - -

resultvector: Vector

searcho

findMatch(Vector, Vector)

printGraph(Vector)

update(Vector. int)

Insert(String. long)

deleteOld(long, long)

Figure 3. class diagram for -Search'

Page 100: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

BTree

root: BTreeNode

size: integer

fileName: String - -

indesile: File

insert(B TreeE 1 ement)

readIndex()

Page 101: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

1 data: BTreeElernent

addrvector : Vector

lefi: BTreeElement

ri&: BTreeElement

totalElements: int

insert(BTreeElement, long, long)

isNew(Vector. long. long) - -

delete(1ong. long)

Figure 5 . class diagram for ' BtreeNode'

Page 102: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

BTreeElement -

value: String

Figure 6. class d i a m for 'BTreeElement'

Page 103: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

B. Detailed Example Test Script

1. Create a Knowledge Base

We run the system by typing:java CGBase from prompt. The following manual appears:

* 1. Create a new conceptual graph knowledge base * * 2. Open an existing conceptuai graph knowiedge base * * 3. Load a conceptual graph knowledge base * * 4. Search the conceptual graph knowledge base * * 5. Update a conceptual graph * * 6. Show al1 conceptual graphs * * 7. Close a knowledge base * * 8. Close al1 knowledge base * * O. Exit * * *

Your selection is: 1

Please enter a knowledge base name: example

2. Poputate the knowledge base

M e r we enter a knowledge base name. the above main manual appears again.

Your selection i s : 3

Please enter t h e knowledge base name you want to

populate: êxamgle

Pleese enter the file name from which to load the CGIF: cg1

(w-e assume a conceptual graph is already saved in the file: cgl)

Page 104: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

If successfiil we \vil1 see: Populate knowledge f in i shed .

If the conceptual graph is not in nght fomatter we will see: t he input

conceptual graph i s not i n right format ter .

If the input file does not exkt we wili see: The input f i l e does n o t exist .

3. Search the Conceptual Knowledge Base

Your s e l e c t i o n is: 4

Please e n t e r t h e knowledge base name in which you want to

search: e x a m p l e

* 1. Saggy search * * 2 . Knowledge based search * *********************************

we perform exact search by entering : 2

* 1. E n t e r a graph £rom keyboard * 2 - E n t e r a graph £rom a f i l e

If xve choose enter a graph from keyboard: 1

Please enter t he graph i n CGIF fomatter:

[00DBMS *o][Ofier *pJ[Modelling-Concepts *ql[Reqrtire *r]

plica cari on - domain *s](,4gent ?p?o ) (Object ?p?q) (Agent ?r?s) (Object ?r?q)

Page 105: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Assume the knowledge base is already populated with ten conceptual graphs mentioned

in chapter 6 . Then the system will display the search results:

There is 1 graph(s) found to be match:

1. [ o f f e r * X I (Agent ?x [OODBMS]) (Object ?x[Modelling - Concepts *u]) [Roquire * y ] (&gen t ? y [Application - domain] ) (Ob j e c t ?y?u)

Page 106: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

C. Reference:

Finnegan Southey and Jim G. Linders 1999

Notio -- A Java API for developing CG tools

ICCS 1999

T. Mueck, M. Polaschek 1997,

The Multikey Type Index for Persistent Object Sets. The 13" international

Conference on Data Engineering -- ICDE97.

K. Peltonen. 1997

Adding Full Text Indexing to the Operation System. . The 13" International

Conference on Data Engineering --- ICDE97.

A. Sistla. O. Wolfson. S. Chamberlain. S. Dao. 1997.

Modeling and Querying Moving Objects. . The 13" International Conference on

Data Engineering --- ICDE97.

D. Konopnicki. 0-Shmueli 1997

W3QS --- A systern for WWW Querying. . The 13" International Conference on

Data Engineering --- ICDE97.

David W. Embley 1997.

Page 107: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Object Database Development, Concepts and Pnnciples. Addison Wesley

Longman.. Inc.

Douglas K. Bany 1996

The Object Database Handbook: How to SelecS implement, and use Object-

Oriented Databases. Katherine Schowaiter Press.

J. D. Ullman 1988

Principles of Database and Knowledge Base Systems. Vol. 1. Computer Science

Press. 1988.

J. D. Ullman 1998

Principles of Database and Knowledge Base Systems. Vol. 2. Computer Science

Press. 1988.

G. M. White 1990

Natural Language Understanding and Speech recognition. Communications of the

ACM. Vol. 33, August 1990.

J. FSowa 1984

Conceptual Structures: Information processing in Mind and Machine, Addison-

wesley. Reading, MA. 1984.

Page 108: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

J. F. Sowa 1990

Ehowledge representation in Database? Expert system, and Natural Language,

Artificial Intelligence in Database and Information System, Edited by R. A.

Meersman et al, Noth-Holtand, 1990.

Timothy E. Nagle. Janice and Lawiel. Gerholz 1992

Conceptual Structures Current Research and Practice. First Pressed by: Ellis

Honvood Limited. 1992. Editord by: Timothy E. Nagle, Janice and Lauriel.

Gerholz 1992.

E. R. Tello

Object-Orientec 1 Prograrnming for Artificial Intelligence: A guide to Tools an<

System Design. Addison-Weslley.

IV. Kim. 1991

Object-Oriented Database System: Strengths and Weakness. Journal of Object-

Oriented Programming. July 1 99 1.

J. C. Giarratano. 1989

Expert Systems: Pnnciples and Programming, PWS-KENT Pub. Co. Boston.

1989.

Page 109: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

K. R. Dittrich. 1986

Object-Oriented Database System: The Notion and the Issues, Proceeding of

International Workshop on Object-Oriented Database System, edited by Rt K.

Dittrich and U. Dayai. IEEE Company Society Press, 1986.

Guy W. Mineau. Bernard Moulin. John F. Sowa 1993

Conceptual Graphs for knowledge Representation. First International Conference

on Conceptuai Structures. ICCS'93. Quebec City, Canada August 1993.

Proceedings. Edited by G. Goos and J. Hartmanis.

Willian M. Tepfenhart, Judith P.Dick, John F. Sowa 1994

Conceptual Structures: Current Practices. Second International Conference on

Conceptuai Structures. ICCS'94. College Park. Maryland. USA. August 1994.

Proceedings.

Gerard El1 is. Robert Levinson. William Richm John F. Sowa 1 995

Conceptual Structures: Applications. Implementation and Theory. Third

International Conference on Conceptual Structures, ICCS'95. Santa Cruz, CA.

USA. August 1995. Proceedings.

Dickson Lukose, Harry Delugach. Mary keeler. Leroy Searle. John Sowa. 1997

Page 110: RETRIEVAL SYSTEM FOR NATURAL LANGUAGES · three points that are essential for natural language processing: 1. conceptual languages to structure and encode naturai language 2. a repository

Conceptual Structures: fùlfilling Peirce's Dream. Fifih International Conference

on Conceptual Structures, ICCS'94. Seattle, Washington, USA, August 1997.

Proceedings.

G. Salton. M. Lesk 1986

Cornputer Evaluation of Indexing and Text Processing, ACM. Vol 29, No. 7. July

1986.

M. L. Mauldin, 1991

Conceptual Information Retrievai: A Case Study in Adaptive Partiai Parsing.

KIuwer Academic Publishers. Norwell. MA. 199 1.

C. Faloutsos. H. V. Jagadish. 1992

B-Tree Indices for Skewed Distributions. 18Lh VLDB Conference. Vancouver.

BC. August 1992.

C. Faloutsos. H. V. Jagadish, 1992

Hybrid Index Organizations for Text Database. EDBT '92. March 1992.

Chen. P. 1985

Entity-Relationship Approch: The Use of the ER Concept in Knowledge

Representation. North-Holland. Amsterdam. 1985-