context-based ontology building support in clinical domains using formal concept analysis

11
TECHNICAL COMMUNICATION Context-based ontology building support in clinical domains using formal concept analysis Guoqian Jiang a, *, Katsuhiko Ogasawara b , Akira Endoh a , Tsunetaro Sakurai a a Department of Medical Informatics, Hokkaido University Graduate School of Medicine, North 15, West 7, Kita-ku, Sapporo 060-8638, Japan b Department of Radiological Technology, Hokkaido University College of Medical Technology, Sapporo, Japan Received 18 March 2003; received in revised form 19 May 2003; accepted 2 June 2003 KEYWORDS Knowledge representa- tion; Information retrieval; Formal concept analysis; Natural language processing; Medical records; Knowledge acquisition Summary Objective: Ontology in clinical domains is becoming a core research field in the realm of medical informatics. The objective of this study is to explore the potential role of formal concept analysis (FCA) in a context-based ontology building support in a clinical domain (e.g. cardiovascular medicine here). Methodology: We developed an ontology building support system that integrated an FCA module with a natural language processing (NLP) module. The user interface of the system was developed as a Prote ´ge ´-2000 JAVA tab plug-in. A collection of 368 textual discharge summaries and a standard dictionary of Japanese diagnostic terms (MEDIS ver2.0) were used as the main knowledge sources. A preliminary evaluation was taken to show the usefulness of the system. Results: Stability was shown on the MEDIS-based medical concept extraction with high precision. 739 /14% (mean9 /S.D.) of the compound medical phrases extracted were sufficiently meaningful to form a medical concept from a clinical perspective. Also, 57.7% of attribute implication pairs (i.e. medical concept pairs) extracted were identified as positive from a clinical perspective. Conclusion: Under the framework of our ontology building support system using FCA, the clinical experts could reach a mass of both linguistic information and context-based knowledge that was demonstrated as useful to support their ontology building tasks. 2003 Elsevier Ireland Ltd. All rights reserved. 1. Introduction Our previous studies on clinicians’ information needs in Japan have shown that the clinicians eagerly expected to be able to access and retrieve the patient-related data from computerized hospi- tal information systems for their clinical practice and clinical research [1,2]. Although the compu- terized patient record (CPR) system has been commonly proposed as the long-term solution for the problems, how to establish the comprehensive patient record data with the enough support functions is still an open question. In particular, with the advance of evidence-based medicine in recent years, how to develop evidence-based *Corresponding author. Tel.: /81-11-706-6017; fax: /81-11- 700-5608. E-mail address: [email protected] (G. Jiang). International Journal of Medical Informatics (2003) 71, 71 /81 www.elsevier.com/locate/ijmedinf 1386-5056/03/$ - see front matter 2003 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/S1386-5056(03)00092-3

Upload: guoqian-jiang

Post on 19-Sep-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Context-based ontology building support in clinical domains using formal concept analysis

TECHNICAL COMMUNICATION

Context-based ontology building support in clinicaldomains using formal concept analysis

Guoqian Jianga,*, Katsuhiko Ogasawarab, Akira Endoha,Tsunetaro Sakuraia

a Department of Medical Informatics, Hokkaido University Graduate School of Medicine, North 15, West 7,Kita-ku, Sapporo 060-8638, Japanb Department of Radiological Technology, Hokkaido University College of Medical Technology, Sapporo,Japan

Received 18 March 2003; received in revised form 19 May 2003; accepted 2 June 2003

KEYWORDS

Knowledge representa-

tion;

Information retrieval;

Formal concept analysis;

Natural language

processing;

Medical records;

Knowledge acquisition

Summary Objective: Ontology in clinical domains is becoming a core research fieldin the realm of medical informatics. The objective of this study is to explore thepotential role of formal concept analysis (FCA) in a context-based ontology buildingsupport in a clinical domain (e.g. cardiovascular medicine here). Methodology: Wedeveloped an ontology building support system that integrated an FCA module with anatural language processing (NLP) module. The user interface of the system wasdeveloped as a Protege-2000 JAVA tab plug-in. A collection of 368 textual dischargesummaries and a standard dictionary of Japanese diagnostic terms (MEDIS ver2.0)were used as the main knowledge sources. A preliminary evaluation was taken to showthe usefulness of the system. Results: Stability was shown on the MEDIS-basedmedical concept extraction with high precision. 739/14% (mean9/S.D.) of thecompound medical phrases extracted were sufficiently meaningful to form a medicalconcept from a clinical perspective. Also, 57.7% of attribute implication pairs (i.e.medical concept pairs) extracted were identified as positive from a clinicalperspective. Conclusion: Under the framework of our ontology building supportsystem using FCA, the clinical experts could reach a mass of both linguisticinformation and context-based knowledge that was demonstrated as useful tosupport their ontology building tasks.– 2003 Elsevier Ireland Ltd. All rights reserved.

1. Introduction

Our previous studies on clinicians’ informationneeds in Japan have shown that the clinicianseagerly expected to be able to access and retrieve

the patient-related data from computerized hospi-tal information systems for their clinical practiceand clinical research [1,2]. Although the compu-terized patient record (CPR) system has beencommonly proposed as the long-term solution forthe problems, how to establish the comprehensivepatient record data with the enough supportfunctions is still an open question. In particular,with the advance of evidence-based medicine inrecent years, how to develop evidence-based

*Corresponding author. Tel.: �/81-11-706-6017; fax: �/81-11-700-5608.

E-mail address: [email protected] (G. Jiang).

International Journal of Medical Informatics (2003) 71, 71�/81

www.elsevier.com/locate/ijmedinf

1386-5056/03/$ - see front matter – 2003 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/S1386-5056(03)00092-3

Page 2: Context-based ontology building support in clinical domains using formal concept analysis

decision support tools to provide relevant and up-to-date evidence to clinicians has attracted muchattention in the medical informatics community[3,4]. Some studies have demonstrated that thecomputer-generated, concept-oriented methodsusing a knowledge-based system can be used toreduce clinicians’ information overload and im-prove the accuracy of clinical data retrieval [5,6].In the past decade, the emergence of second-generation knowledge-based systems has providedmore explicit and more maintainable frameworksfor encoding and applying clinical knowledge [7,8].Knowledge management in knowledge-based sys-tem involves knowledge acquisition, as well as itsorganization, structuring, refinement and distribu-tion to domain specialists, which can be addressedby means of ontologies [9].

Ontologies are defined in the literatures invarious ways. One prevailing definition of ontologyis a specification of a conceptualization that isdesigned for reuse across multiple applications[10,11]. We take the definition here in accordancewith that of Stanford Medical Informatics, i.e. anontology is a model of a particular field of knowl-edge*/the concepts and their attributes, as well asthe relationships between the concepts [12]. Thewell-defined medical concepts (i.e. medical ontol-ogy) can be applied to the task of constructingcomputable models for medical domains to buildthe knowledge-representation system in the do-main [13�/16]. Ontology in clinical domains isbecoming a core research field in medical infor-matics [8,17,18]. Building the formal specializedclinical terminologies*/the ontologies*/is a diffi-cult and time-consuming task. Natural languageprocessing (NLP) tools have already contributedextensively to the construction of ontologies[19,20]. Morphological knowledge obtained fromNLP has long been acknowledged as an importantarea of medical language processing, medicalinformation indexing and ontology development[21,22]. Besides linguistic knowledge, researchersin knowledge engineering and design domains haverecognized the importance of the situation inwhich the expert acts, i.e. the context in whichthe expert applies the knowledge [23]. Formalconcept analysis (FCA) has been advocated torepresent and process context knowledge in suchclinical domains as, for example, describing pa-tient cases, interpreting therapeutic decisions andrepresenting rules [24].

In this paper, we describe a support system ofontology building in a clinical domain that inte-grated the linguistic knowledge information withthe domain-specific context knowledge informa-tion using FCA. The objective of the study is to

explore the potential role of FCA on the context-based ontology building support in a clinical do-main (e.g. cardiovascular medicine here).

The structure of this paper is as follows. InSection 2, we introduce the main knowledgesources used in our system. In Section 3, wedescribe the system construction that includes aNLP module and a FCA module. In Section 4, wetake a system evaluation design for a preliminaryevaluation of its usefulness. In Section 5, wedescribe the major results of the system evalua-tion. In Section 6, we discuss the issues of ontologybuilding support and the limitations of the study.Section 7 contains concluding remarks and futuredirections.

2. Materials

The full text of discharge summaries from thedomain of cardiovascular medicine is considered asa sample corpus for a clinical domain. The corpus ismade up of 368 serial discharge summaries in 2001from The Department of Cardiovascular Medicine,Hokkaido University Hospital. The patient nameswere carefully removed from the discharge sum-maries for the security of privacy. A tokenizerprogrammed by us was used to transform thecorpus into an XML file with the structure shownin Fig. 1. The knowledge source was mainly usedfor evaluating our ontology building support systemin this clinical domain.

A standard dictionary of Japanese diagnosticterms named MEDIS version 2.0 released fromMEDIS-DC Japan [25] was used for our domaindependent terminology extraction tool describedin the following section. The MEDIS dictionaryincludes the standard headings, the code forelectronic exchange, the code of ICD-10, thesynonyms, the compatible terms and the accepta-ble modifiers [26]. All information was formed intothree tables including a MAIN table with 18 805terms, a MODIFIER table with 1158 terms and aKEYWORD table with 64 181 terms.

3. System construction

Our ontology building support system includes aNLP module that provides the linguistic knowledgeinformation and a FCA module that provides thecontext knowledge information. Fig. 2 shows themain architecture of the system. The user inter-face was developed as a Protege-2000 Tab plug-intitled ‘FCA Ontology Builder’ that integrates withProtege-2000 ontology edit environment by Pro-

72 G. Jiang et al.

Page 3: Context-based ontology building support in clinical domains using formal concept analysis

tege knowledge model API [27]. The linguisticknowledge information obtained from NLP moduleand the context knowledge information from FCAmodule were provided to domain experts by threetabbed panes as shown in Figs. 4�/6.

3.1. Natural language processing module

A Japanese morphological analysis system,ChaSen, Version 2.1 developed by Nara Instituteof Science and Technology as freeware, was usedfor the core component in the NLP module [28].The ChaSen is a non-JAVA-based standalone system,so a JAVA interface named Morphological AnalyzerConnectivity Driver-model (MACD) developed in thesame institute as ChaSen was used to connect theChaSen system with the Protege-2000 platform[29].

At this point, a domain dependent term recogni-tion method was realized as follows. Since theChaSen system allows adding a user-defined dic-tionary, we transformed the three tables of MEDISversion 2.0 into the ChaSen dictionary format

automatically by a JAVA program that we con-structed. We obtained three medical dictionarieswith extension *.dic and put them into the dic-tionary directory. The part-of-speech of the termsin these dictionaries was defined as Noun-Diagnos-tic Term-Basic, Noun-Diagnostic Term-Modifier, andNoun-Diagnostic Term-Keyword, respectively. Inaddition, the weights of the diagnostic termswere set as weights in the main table with thehigher priority and those in the keyword table withthe lower priority, among the three dictionaries.Following the rule for dictionary creation in theChaSen system, we modified the part-of-speechdefinition file and added the three newly-definedpart-of-speech names into the sublevel of ‘Noun’.Finally, all dictionaries were recompiled using aChaSen dictionary compiler. Then the ChaSensystem could be used to recognize the medicalterms in our newly-defined dictionaries in the sameway as in the original dictionaries.

The compound word is an important character-istics in Japanese, and a compound word mightrepresent a new concept different from its atomwords [30,31], so a simple heuristic algorithm wasintroduced to extract the compound nouns fromgiven text. By using part-of-speech informationfrom morphological processing, a compound noun(or phrase) is formed with more than one nounoccurring in the text, one by one. When one of theatom words is from a noun with the part-of-speech‘Noun-Diagnostic Term’, we described the com-pound noun as a compound medical phrase.

3.2. Formal concept analysis (FCA) module

FCA is a mathematical approach to data analysisbased on the lattice theory. It provides a way toidentify groupings of objects with shared proper-ties [32]. The complete description of FCA has beenprovided [33].

The FCA module itself was developed by anontology approach based on the Protege-2000system. First of all, we constructed an ontologyto express the basic components of the FCAmethod, including formal object, formal attribute,formal context and formal concept. The ontologywas developed as an independent Protege-2000project and the users could include it into theirown project conveniently by employing the Pro-tege-2000 ‘include projects’ function. In Protege-2000, ontology is represented as a set of classeswith their associated slots. Fig. 3 shows the classhierarchy of the FCA ontology.

Here the class Medical Concept represents themedical concept extracted from NLP module andits slots represents the linguistic information in-

Fig. 1 The XML file structure of discharge summary.

Context-based ontology building support in clinical domains using formal concept analysis 73

Page 4: Context-based ontology building support in clinical domains using formal concept analysis

cluding its name, pronunciation and part of speech.The class is also used to represent the concept offormal attribute. The class Medical Documentrepresents the medical documents used from NLPprocessing, and its slots include its name and

content. The class is also used to represent theconcept of formal object. The class Formal Contextrepresents the concept of formal context and itsslots represent a binary relation between the classMedical Document and the class Medical Concept.

Fig. 2 The system architecture.

Fig. 3 The class hierarchy of formal concept analysis ontology made by OntoViz plug-in of Protege-2000.

74 G. Jiang et al.

Page 5: Context-based ontology building support in clinical domains using formal concept analysis

The class Formal Concept represents the conceptof formal concept and its slots represent the extentobjects, the intent attributes, the label attributesin concept lattice, the label attributes of its directsuper concepts and sub concepts.

Secondly, the knowledge acquisition processeswere taken to fill the values of slots in each classby taking advantage of the information from theNLP module. The processes could be taken bothautomatically and manually. In Fig. 4, a batchimporting button named ‘Begin importing’ was setto import automatically the medical concepts withthe part-of-speech of ‘Noun-Diagnostic term’ andall medical documents as the instances of thecorresponding classes, and their binary relationswere set as well. The users could also select theindividual concepts such as those from the com-pound noun list and import them one by one bydrag-and-drop or button-clicking (the button is notshown in Fig. 4). The ‘Create Formal Concept’button is set here to calculate the formal conceptaccording the instances of the class Formal Con-text. Finally, the knowledge from this FCA wasrepresented as a list of attribute implication pairs

and a lattice diagram named Hasse Diagram toshow a local context related with a specificmedical concept (a formal attribute) (see Figs. 5and 6).

4. System evaluation

The goal of our system is to support ontologybuilding in a clinical domain, so the opinions fromclinical experts are critically important. As anexample, the textual discharge summaries selectedfrom The Department of Cardiovascular Medicinewere processed by our system and five clinicians(two from cardiology, one from nuclear medicine,one from psychiatry and one from surgery) eval-uated the results.

A preliminary evaluation was designed in twoparts. One part is for the usefulness of linguisticinformation extracted from NLP module of oursystem, another for the usefulness of attributionimplication pairs obtained from FCA.

Fig. 4 The linguistic information display in Protege-2000. 20 diagnostic term concepts with the highest frequencieswere displayed in a table titled ‘Noun List’, including (from the bottom of the table) chest pain, heart failure,palpitation, angina pectoris, arrhythmia, allergy, edema, shortness of breath, diabetes mellitus, heart murmur,anemia, hyperlipidemia, cardiac enlargement, syncope, dyspnea, jaundice, bradycardia, atrial fibrillation, ischaemicheart disease, lymph adenopathy.

Context-based ontology building support in clinical domains using formal concept analysis 75

Page 6: Context-based ontology building support in clinical domains using formal concept analysis

Fig. 5 Context information display by formal concept lattice in Protege-2000. The dotted lines point out the Englishconcepts translated from the corresponding Japanese concepts.

Fig. 6 Attribute implication information display in Protege-2000.

76 G. Jiang et al.

Page 7: Context-based ontology building support in clinical domains using formal concept analysis

4.1. Usefulness of linguistic information

Firstly, in order to verify the performance of theNLP module, we evaluated the precision of allextracted medical concepts with the part-of-speech of ‘Noun-Diagnostic term’. Secondly, weevaluated the clinical validity of the compoundmedical phrases extracted to represent the useful-ness of the compound noun algorithm by designinga questionnaire. The essential item on the ques-tionnaire was: ‘the following compound medicalphrases were extracted automatically from thedischarge summary texts. Do you agree that, froma clinical perspective, the compound medicalphrase is meaningful to form a medical conceptthat is different from its atomic medical term(s)(indicated by a *)?: one yes, two neutral, three no’.The questionnaire contained 300 compound medi-cal phrases. All compound medical phrases evalu-ated were selected randomly from all compoundmedical phrases extracted from the example tex-tual discharge summaries.

4.2. Usefulness of attribute implicationpairs

We also evaluated the clinical validity of theattribute implication pairs extracted to representits usefulness for ontology building with this ques-tion: ‘the following medical concept pairs wereextracted automatically from the discharge sum-mary texts. Do you agree that there is a relation-ship between the pair from the clinicalperspective?: five strongly yes, four yes, threeneutral, two no, one strongly no’. The question-naire contained 300 pairs of attribute implications(i.e. medical concepts) representing a kind ofsemantic relationship. The pairs were randomlyselected from the list of attribute implication pairsextracted from the example textual dischargesummaries.

5. Results

368 textual discharge summary documents con-tained in a XML file were processed by our system.33 523 word types were found from 640 246 wordtokens processed.

In the NLP module, 772 diagnostic term conceptsand 429 modifier concepts were extracted andimported as instances of the class Medical Con-cepts. The precision of the 1201 medical conceptsevaluated by us was up to 100%. The recall of themedical concepts was not evaluated directly in this

study. 20 diagnostic term concepts with the highestfrequencies were displayed in a table titled ‘NounList’ as shown in Fig. 4. 4724 compound medicalphrases were extracted from the texts and 300phrases selected randomly from the 4724 com-pound medical phrases for evaluation. The ratio ofanswer [YES] among five evaluators was calculatedas 739/14% (mean9/S.D.). Table 1 shows an exam-ple list of 20 compound medical phrases identifiedby our compound noun algorithm.

In the FCA module, a formal context with 368objects (medical documents) and 1201 attributes(medical concepts including diagnostic term con-cepts and modifier concepts) was formed. Aftercalculation, 1158 formal concepts were created.24 324 attribute implication pairs were createdbased on the calculation of formal concepts.From the clinical perspective, this kind of implica-tion is more meaningful when the super attribute isa diagnostic term concept. Therefore, we filteredout the super attributes that are the modifierconcepts and 7666 attribute implication pairswere left as shown in Fig. 6. Five cliniciansevaluated the clinical validity of 300 attributeimplication pairs randomly selected from the7666 pairs. The ratio of answer [five strongly yes]or [four yes] among the five evaluators wascalculated as 369/11% (mean9/S.D.). The cumula-

Table 1 An example list of 20 compound medicalphrases identified by the system (translated fromJapanese)

Atomic medicalitem(s)

Compound medical phrases

Tonsilla TonsillectomyHeart failure Heart failure symptomChest Chest pressure sensationPeritoneum Peritoneal dialysisCardiac muscle Myocardial degenerationProstata Prostatic hypertrophyUrticaria Senile urticariaKidney, acute Acute aggravation of renal

failureMild Mild enlargementDiffused Diffused narrowingDyspnea Exertional dyspneaAcute, Inferior Acute inferior infarctionCrus, Hemorrhage Crus ecchymosisEssential Essential hypertensionBlood vessel Blood vessel elasticityPericardium Pericardial hyperplasiaStomach GastrectomyMultiple Multiple sclerosisVagus Vagal reflexAorta Aorta calcification

Context-based ontology building support in clinical domains using formal concept analysis 77

Page 8: Context-based ontology building support in clinical domains using formal concept analysis

tive number of answers [five strongly yes] or [fouryes] was up to 173, accounting for 57.7% (173/300).Table 2 shows an example list of 20 attributeimplication pairs that were answered as ‘stronglyyes’ by the evaluators.

6. Discussion

Identifying the relevant medical concepts andthe types of their relations is the main task ofontology construction in one clinical domain,namely, cardiovascular medicine. In this paper,we presented a framework of an ontology buildingsupport system in that domain providing thelinguistic knowledge and the context knowledgeto support the domain experts for their ontologybuilding tasks.

Although the Japanese medical society has takenmeasures to standardize health terminology in thelast 50 years, the focus has been on nomenclaturealone, without semantic structure [29]. The MEDISworking group is engaged in building a clinicalhierarchical classification of the diagnostic termsbased on the semantic structure of ICD10 Code;however, the difficulty was addressed by repre-senting the semantic structure well using one-wayclassification [34]. Tanaka suggested using a framemodel to represent a clinical meaningful diseasehierarchy by mapping Japanese disease conceptswith the UMLS concepts and also encoding theconcepts with the UMLS semantic type and the

concept unique identifier (CUI) [35]. Ontologiesdiffer from controlled terminologies in that theyrepresent the relevant concepts and relationshipsin a domain, whereas terminologies simply restrictthe words used to describe the domain [36,37]. Weproposed Protege-2000 as the ontology editingenvironment in our system because two goalshave driven the design and development of Pro-tege-2000: (1) achieving interoperability with otherknowledge-representation systems, and (2) beingan easy-to-use and configurable knowledge-acqui-sition tool [38]. The Protege-2000 system has beendesigned as an open foundation upon which endusers can build tailored knowledge-acquisitionfunctionality by creating appropriate ‘plug-ins’[39]. After several updated versions, Protege-2000 currently supports languages other thanEnglish language well [40]. Our ontology buildingsupport system selected the Protege-2000 tab plug-in as the user interface to display the knowledgeelicited while it allowed the domain experts totake full advantage of the Protege knowledgemodel for manual (or semi-automatic) ontologymodeling and visualization. By integrating with theProtege-2000, the system provides a convenientway for domain experts to introduce the Japanesemedical concepts into an ontology editing environ-ment from the Japanese medical corpus and totake advantage of the characteristics of Protege-2000 in ontology building to represent the semanticstructure of medical knowledge in a multi-wayclassification.

A domain dependent term recognition methodwas developed in our system to identify themedical concepts in a clinical domain. A clini-cally-oriented standard dictionary of medicalterms (MEDIS version 2.0) was used as one ofmain knowledge sources. The dictionary is one ofthe most comprehensive sets of diagnostic termswith ICD-10 code, so we believe that the knowl-edge of the machine-readable source should beused as the foundation for further Japanese ontol-ogy building in clinical domains. The core compo-nent of the NLP module in our system uses ChaSenJapanese morphological analysis system known forhigh performance (F-value �/98%) [41]. The pre-cision of medical concepts extracted was up to100% and the recall could be calculated as 96%when assuming an F-value of 98%. The resultsindicated that the ChaSen system could identifystably the terms included in its dictionaries. There-fore, our domain dependent term recognitionmethod could stably identify and extract themedical concepts included in our MEDIS-baseddictionaries.

Table 2 An example list of 20 attribute implicationpairs identified by the system (translated from Japa-nese)

Hemorrhage j consumption coagulopathyBronchopneumonia j viralRheumatic fever j aquiredJaundice j acute hepatitisColon polyps j primarySarcoidosis j ocular sarcoidosisHeart failure j inferior myocardial infarctionNeoplasm j benign neoplasmChest pain j intercostal neuralgiaDiabetic renal disease j diabetic complicationsCarcinoma j biliary carcinomaHypertrophic cardiomyopathy j familial cardiomyopa-

thySuperficial gastritis j superficialGastric cancer j metastases to lymph nodesAngina pectoris j myocardial infarctionPericarditis j infectious pericarditisHeadache j hypertensive encephalopathyHepatic encephalopathy j convulsive seizureHeart failure j familial cardiomyopathyChest pain j myocardial infarction

78 G. Jiang et al.

Page 9: Context-based ontology building support in clinical domains using formal concept analysis

In addition, we implemented a compound nounalgorithm that is able to identify and extract newmedical concepts beyond the range of dictionaries.The simple heuristic method was used as a goodapproximation for core, nonrecursive noun phrasesin the identification of key concepts in biomedicalliterature [42]. For example, ‘extra systoles’ and‘dyspnea’ are the basic diagnostic terms identified.By use of the algorithm, we identified moreconcepts related to these terms such as ‘ventricu-lar extra systole’ and ‘exertional dyspnea’ thatmay also interest the domain experts (or users).More promisingly, the ‘IS-A’ type of relations can beconstructed automatically between the basic diag-nostic terms and its related compound diagnosticterms, e.g. ‘ventricular extra systole’ IS-A ‘extrasystole’. More examples are shown in Table 1. Theusefulness of the algorithm has been evaluated bythe study and the results indicated that about 73%of compound medical phrases extracted are mean-ingful to form a medical concept that is differentfrom its atomic medical terms from a clinicalperspective.

The word frequency in a domain corpus is a goodindex for domain experts to decide the importanceof a concept in the domain. The diagnostic termconcepts with the highest frequencies shown in Fig.4 indicated that these concepts are very importantin the domain of cardiovascular medicine. Johans-son compared word frequencies in the domain-specific corpus with a balanced reference corpusby employing a x2-test and relative frequencyratios to decide the domain-specific concepts[43]. The word frequency information also can beused statistically to establish a domain-indepen-dent method for domain-specific term recognition[9].

The theory of FCA allows an adequate represen-tation of the underlying semantics under themeaning of the concept definition [23]. In theFCA module of our system, the context knowledgewas mainly represented in the attribute implicationpairs and the formal concept lattices. The informa-tion about attribute implications has been pro-posed to represent knowledge that will be used tohelp in the query definition process that completesthe general available domain knowledge [30]. Inour system, the attribute implication pair elicitedrepresents a kind of semantic relationship betweentwo medical concepts (attributes) that could inter-est the domain experts and then be identified. Byevaluating the clinical validity of the attributeimplication pairs extracted from the sample tex-tual discharge summaries, the result indicated thatabout 36% of medical concept pairs were identifiedby the clinical experts as that there is a relation-

ship from a clinical perspective. When we countedthe cumulative number of the medical conceptpairs that were identified as positive by all fiveevaluators, we found that about 57.7% of medicalconcept pairs became positive. We think thecumulative value is more reasonable to representthe usefulness of the medical concepts extractedby our system, because we found that the semanticrelationship between a medical concept pair wasidentified differently by the clinicians in differentdomains. For example, two evaluators from cardi-ovascular medicine answered ‘no’ relationshipbetween the pair ‘Bladder carcinoma j Sleep dis-order’, whereas an evaluator from psychiatricmedicine answered ‘yes’. Above all, we believethat this kind of knowledge is important and usefulfor domain experts to identify the relations amongmedical concepts in their ontology building tasks inclinical domain.

A typical task that concept lattices are useful foris to unfold given data, making their conceptualstructure visible and accessible, in order to findpatterns, regularities, exceptions, etc. [32]. Forexample, from the lattice shown in Fig. 5, it ispossible for domain experts to identify somesemantic relations with the concept ‘Angina pec-toris’ such as its relation with anatomy ‘bloodvessel’, ‘chest’ and ‘heart’, and its relation withclinical symptoms ‘bradycardia’, ‘chest pain’, andits relation with other diseases ‘hyperlipidemia’,‘myocardial ischemia’, ‘diabetes mellitus’. Theformal concept lattices have also been applied tothe context-based ontology building in recentseveral studies. Richards et al. used the conceptlattices to clarify commonalities and differencesbetween concepts to aid discussions between thoseinvolved in developing the ontology [22]. AndStumme et al. proposed to use a pruned conceptlattice, which has the same degree of detail as thetwo source ontologies to support bottom-up ontol-ogy merging [44]. Complexity is a major problemfor concept lattice display so we did not computethe whole concept lattice based on the wholecontext. In order to reduce the complexity to acertain degree, we only constructed the conceptlattice by one medical concept (here attribute)according to the interest and selection fromdomain experts.

Although many efforts have been made onseeking the automatic methods, a fully automaticontology construction method is still not realistic[19]. In particular, with the heterogeneity ofclinical data, the physician’s interpretation ofclinical data written in unstructured free-textlanguage is very difficult to standardize and thusdifficult to mine [45]. Therefore, the ontology

Context-based ontology building support in clinical domains using formal concept analysis 79

Page 10: Context-based ontology building support in clinical domains using formal concept analysis

engineering community has emphasized how tosimplify and accelerate ontology construction usingdifferent knowledge discovery processes. However,the knowledge discovery process is a non-trivialprocess of identifying valid, novel, potentiallyuseful and ultimately understandable patternsfrom large collections of data [45,46]. It is neces-sary to ensure the suitability of knowledge ex-tracted and to determine if it reflects theapplication data requirements by an efficientevaluation during the knowledge discovery process[47]. Our system evaluation is accomplished byconsidering the clinical validity evaluated by thephysicians. We believe the results reflect theusefulness of the system in the ontology buildingsupport task in clinical domains.

One of the main limitations in our study is thatwe only provide the basic linguistic and contextknowledge information under the system frame-work. With the extensibility of our framework,additional linguistic knowledge from the NLP mod-ule could be incorporated in the system, such as adomain-independent method of term recognition.And more complicated algorithms of FCA could beincorporated, such as a nested Hasse diagram toshow a large concept lattice.

7. Conclusion and future work

In this paper, we described a framework of anontology building support system in a clinicaldomain that integrated the linguistic knowledgeinformation with the context knowledge informa-tion using FCA. Under the framework of ourontology building support system using FCA, theclinical experts could achieve a mass of both thelinguistic information and the context-basedknowledge information that has been demon-strated as useful to support their ontology buildingtasks in a clinical domain.

Future work should be undertaken to furtherefficient evaluation of more linguistic and contextknowledge (using FCA) suitable for ontology con-struction tasks in clinical domains. Based on theevaluation, the system could be extended withmore powerful Japanese ontology building supportfunctions.

Acknowledgements

The authors are very grateful for the cliniciansinvolved in the evaluation process of our study andfor the cooperation of Department of Cardiovas-cular Medicine, Hokkaido University Hospital in

providing the discharge summaries. The authorsare also very grateful to Dr Richard Berwick for hiskind help in English-language editing of the manu-script.

References

[1] G. Jiang, K. Ogasawara, A. Endoh, T. Sakurai, Comparisonof computer-based information support of clinical researchin Chinese and Japanese hospitals: a postal survey ofclinicians’ views, Methods Inf. Med. 41 (2002) 141�/146.

[2] G. Jiang, K. Ogasawara, A. Endoh, T. Sakurai, How well doclinicians use computer-based information for clinicalpractice? Postal survey of clinicians’ views in Japan, Proc.AMIA Symp. (2000) 1038.

[3] V.G. Allen, J.F. Arocha, V. Patel, Evaluating evidenceagainst diagnostic hypothesis in clinical decision makingby students, residents and physicians, Int. J. Med. Inf. 51(2�/3) (1998) 91�/105.

[4] I. Sim, P. Gorman, R.A. Greenes, R.B. Hayness, B. Kaplan,H. Lehmann, P.C. Tang, Clinical decision support systemsfor the practice of evidence-based medicine, J. Am. Med.Inf. Assoc. 8 (6) (2001) 527�/534.

[5] Q. Zeng, J.J. Cimino, K.H. Zou, Providing concept-orientedviews for clinical data using a knowledge-based system, J.Am. Med. Inf. Assoc. 9 (2002) 294�/305.

[6] E.A. Mendonca, J.J. Cimino, S.B. Johnson, Y.H. Seol,Accessing heterogeneous sources of evidence to answerquestions, J. Biomed. Inf. 34 (2001) 85�/98.

[7] M.A. Musen, Scalable software architectures for decisionsupport, Methods Inf. Med. 38 (1999) 229�/238.

[8] W.W. Stead, R.A. Miller, M.A. Musen, W.R. Hersh, Integra-tion and beyond: linking information from disparatesources and into workflow, J. Am. Med. Inf. Assoc. 7(2000) 135�/145.

[9] G. Nenadic, H. Mima, I. Spasic, S. Ananiadou, J. Tsujii,Terminology-driven literature mining and knowledge acqui-sition in biomedicine, Int. J. Med. Inf. (2002) 1�/16.

[10] B.C. Chandrasekaran, J.R. Josephson, V.R. Benjamins,What are ontologies, and why do we need them, IEEEIntell. Syst. 1 (1999) 20�/26.

[11] R. McEntire, P. Karp, N. Abernethy, D. Benton, G. Helt, M.DeJongh, R. Kent, A. Kosky, S. Lewis, D. Hodnett, E.Neumann, F. Olken, D. Pathak, P. Tarczy-Hornoch, L. Toldo,T. Topaloglou, An evaluation of ontology exchange lan-guages for bioinformatics, Proc. Int. Conf. Intell. Syst. Mol.Biol. 8 (2000) 239�/250.

[12] M.A. Musen, Domain ontologies in software engineering:use of protege with the EON architecture, Methods Inf.Med. 37 (1998) 540�/550.

[13] J.F. Sowa, Knowledge Representation: Logical, Philosophi-cal and Computational Foundations, Brooks Cole Publish-ing, Pacific Grove, CA, 1999.

[14] M.A. Musen, S.W. Tu, A.K. Das, Y. Shahar, EON: A compo-nent-based approach to automation of protocol-directedtherapy, J. Am. Med. Inf. Assoc. 3 (1996) 367�/388.

[15] P.D. Stetson, L.K. Mcknight, S. Bakken, C. Curran, T.T.Kubose, J.J. Cimino, Development of an ontology to modelmedical errors, information needs, and the clinical com-munication space, J. Am. Med. Inf. Assoc. 9 (2002) S86�/

S91.[16] P.A. De Clereq, A. Hasman, J.A. Blom, H.H. Korston, Design

and implementation of a framework to support the devel-opment of clinical guideline, Int. J. Med. Inf. 64 (2001)285�/318.

80 G. Jiang et al.

Page 11: Context-based ontology building support in clinical domains using formal concept analysis

[17] M.A. Musen, Medical informatics: searching for underlyingcomponents, Methods Inf. Med. 41 (2002) 12�/19.

[18] J.R. Hajdukiewicz, K.J. Vicente, D.J. Doyle, P. Milgram,C.M. Burns, Modeling a medical environment: an ontologyfor integrated medical informatics design, Int. J. Med. Inf.62 (1) (2001) 79�/99.

[19] S. Le Moigno, J. Charlet, D. Bourigault, P. Degoulet, M.Jaulent, Terminology extraction from text to build anontology in surgical intensive care, Proc. AMIA Symp.(2002) 430�/434.

[20] A. Maedche, R. Volz, The TEXT-TO-ONTO ontology extrac-tion and maintenance system, ICDM-Workshop on Integrat-ing Data Mining and Knowledge Management, San Jose, CA,USA, 2001, pp. 11.

[21] C. Lovis, R. Baud, A.M. Rassinoux, J.R. Scherrer, Medicaldictionaries for patient encoding systems: a methodology,Artif. Intell. Med. 14 (1998) 201�/214.

[22] N. Grabar, P. Zwigenbaum, A general methods for shiftinglinguistic knowledge from structured terminologies, Proc.AMIA Symp. (2000) 310�/314.

[23] D. Richards, S.J. Simoff, Design ontology in context: asituated cognition approach to conceptual modeling, Artif.Intell. Eng. 15 (2001) 121�/136.

[24] M. Schnabel, Representing and processing medical knowl-edge using formal concept analysis, Methods Inf. Med. 41(2002) 160�/167.

[25] MEDIS URL: http://www.medis.or.jp/.[26] K. Ohe, K. Hatano, Y. Kumazawa, H. Matsumoto, Open

software for registering standard disease term into HIS withthe capability of adding modifiers and conversion to ICD-10code, The Proceedings of the 21th Joint Conference ofMedical Informatics in Japan, 2001, pp. 829�/830 (inJapanese).

[27] Protege-2000 URL: http://protege.stanford.edu/in-dex.html.

[28] ChaSen URL: http://chasen.aist-nara.ac.jp/.[29] MACD URL: http://chasen.aist-nara.ac.jp/macd/.[30] Y. Liu, Y. Satomura, Building a controlled health vocabulary

in Japanese, Methods Inf. Med. 40 (2001) 307�/314.[31] K. Ohe, Requirements of practical standard of disease

concepts for electronic information exchanging, Proceed-ings of the 20th Joint Conference of Medical Informatics inJapan, 2000, pp. 37�/40 (in Japanese).

[32] B. Diaz-Agudo, P.A. Gonzalez-Calero, Formal conceptanalysis as a support technique for CBR, Knowledge-BasedSyst. 14 (2001) 163�/171.

[33] B. Ganter, R. Willer, Formal Concept Analysis: Mathema-tical Foundations, Springer, Berlin, 1997 (ISBN 3-540-62771-5).

[34] K. Hatano, A. Hamada, M. Kashiwagi, T. Tashiro, A.Watanabe, M. Sato, T. Sasaki, K. Ohe, Development ofclinical disease name classification based on ICD10-based

code set of disease names, Proceedings of the 20th JointConference of Medical Informatics in Japan, 2001, pp.739�/741 (in Japanese).

[35] M. Tanaka, Study of construction of disease hierarchy byframe model, Proceedings of the 20th Joint Conference ofMedical Informatics in Japan, 2001, pp. 810�/812 (inJapanese).

[36] T. Gruber, Toward principles of the design of ontologiesused for knowledge sharing, in: Knowledge Systems La-boratory, Stanford University, 1993.

[37] I. Yeh, P.D. Karp, N.F. Noy, R.B. Altman, Knowledgeacquisition, consistency checking and concurrency controlfor Gene Ontology (GO), Bioinformatics 19 (2) (2003) 241�/

248.[38] N.F. Noy, R.W. Fergerson, M.A. Musen, The knowledge

model of Protege-2000: combining interoperability andflexibility, Second International Conference on KnowledgeEngineering and Knowledge Management (EKAW 2000),Juan-les-Pins, France, 2000.

[39] M.A. Musen, R.W. Fergerson, W.E. Grosso, et al., Compo-nent-based support for building knowledge-acquistion sys-tems, Conference on Intelligent Information Processing (IIP2000) of the International Federation for InformationProcessing World Computer Congress (WCC 2000), Beijing,2000, SMI technical report (SMI-2000-0838).

[40] J.H. Gennari, M.A. Musen, R.W. Fergerson, W.E. Grosso, M.Crubezy, H. Eriksson, N.F. Noy, S.W. Tu, The evaluation ofProtege: an environment for knowledge-based systemsdevelopment, SMI technical report (SMI-2002-0943), 2002.

[41] A. Masayuki, M. Yuji, Extended Models and Tools for High-performance Part-of-Speech Tagger. Proceedings of COLING2000, July 2000.

[42] W.H. Majoros, G.M. Subramanian, M.D. Yandell, Identifica-tion of key concepts in biomedical literature using amodified Markov heuristic, Bioinformatics 19 (3) (2003)402�/407.

[43] P. Johansson, Extraction of ontology concepts based onword frequencies, Project Report at http://hem.fyristorg.-com/pontus.johansson/gslt/statmet.html.

[44] G. Stumme, A. Maedche, Merging ontologies by means offormal concept analysis, First International Workshop onDatabases, Documents, and Information Fusion, Magde-burg, Germany, April 2001.

[45] K.J. Cios, G.W. Moore, Uniqueness of medical data mining,Artif. Intell. Med. 26 (2002) 1�/24.

[46] U.M. Fayyad, G. Piatesky-Shapiro, P. Smyth, Advances inKnowledge Discovery and Data Mining, AAAI Press/MITPress, Boston, 1996.

[47] E.A. Mendonca, J.J. Cimino, Automated knowledge extrac-tion from MEDLINE citations, Proc. AMIA Symp. (2000) 575�/

579.

Context-based ontology building support in clinical domains using formal concept analysis 81