link-springer-com-443.webvpn.jmu.edu.cn€¦ · preface the european society for artiﬁcial...

Lecture Notes in Artificial Intelligence 2780Edited by J. G. Carbonell and J. Siekmann

Subseries of Lecture Notes in Computer Science

3BerlinHeidelbergNew YorkHong KongLondonMilanParisTokyo

Michel Dojat Elpida KeravnouPedro Barahona (Eds.)

Artificial Intelligencein Medicine

9th Conference on Artificial Intelligencein Medicine in Europe, AIME 2003Protaras, Cyprus, October 18-22, 2003Proceedings

1 3

Series Editors

Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USAJorg Siekmann, University of Saarland, Saarbrucken, Germany

Volume Editors

Michel DojatCHU de Grenoble – Pavillon BUnite mixte INSERM-UJF U594 "Neuroimagerie Fonctionnelle et Metabolique"BP 217, 38043 Grenoble Cedex 9, FranceE-mail: [email protected]

Elpida KeravnouUniversity of Cyprus, Department of Computer ScienceP.O. Box 20537, Nicosia 1678, CyprusE-mail: [email protected]

Pedro BarahonaUniversidade Nova de LisboaFaculdade de Cincias e Tecnologia, Departamento de Informatica2829-516 Caparica, PortugalE-mail: [email protected]

Cataloging-in-Publication Data applied for

A catalog record for this book is available from the Library of Congress.

Bibliographic information published by Die Deutsche Bibliothek.Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie;detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>.

CR Subject Classification (1998): I.2, I.4, J.3, H.2.8, H.4, H.3

ISSN 0302-9743ISBN 3-540-20129-7 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer-Verlag. Violations areliable for prosecution under the German Copyright Law.

Springer-Verlag Berlin Heidelberg New Yorka member of BertelsmannSpringer Science+Business Media GmbH

http://www.springer.de

© Springer-Verlag Berlin Heidelberg 2003Printed in Germany

Typesetting: Camera-ready by author, data conversion by Olgun ComputergrafikPrinted on acid-free paper SPIN: 10931479 06/3142 5 4 3 2 1 0

Preface

The European Society for Artificial Intelligence in Medicine (AIME) was es-tablished in 1986 with two main goals: 1) to foster fundamental and appliedresearch in the application of Artificial Intelligence (AI) techniques to medicalcare and medical research, and 2) to provide a forum for reporting significantresults achieved at biennial conferences. Additionally, AIME assists medical in-dustrials to identify new AI techniques with high potential for integration intonew products. A major activity of this society has been a series of internationalconferences, from Marseille (FR) in 1987 to Cascais (PT) in 2001, held bienniallyover the last 16 years.

The AIME conference provides a unique opportunity to present and improvethe international state of the art of AI in medicine from both a research and anapplications perspective. For this purpose, the AIME conference includes invitedlectures, contributed papers, system demonstrations, tutorials and workshops.The present volume contains the proceedings of the AIME 2003 conference, theninth conference on Artificial Intelligence in Medicine in Europe, held in Cyprus,October 18–22, 2003.

In the AIME 2003 conference announcement, we encouraged authors to sub-mit original contributions to the development of theory, techniques, and ap-plications of AI in medicine, including the evaluation of health care programs.Theoretical papers should include a prospective part about possible applicationsto medical problems solving. Technical papers should describe the novelty of theproposed approach, its assumptions and pros and cons compared to other alter-native techniques. Application papers should present sufficient information toallow the evaluation of the practical benefits of the proposed system or method-ology.

The call for papers for AIME 2003 resulted in 65 papers. All papers werecarefully evaluated by at least two independent referees from the program com-mittee with support from additional reviewers. Submissions came from 18 coun-tries with 5 outside Europe. This confirms the international interest for an AI inmedicine conference. The reviewers judged the originality, the quality, and thesignificance of the proposed research, as well as its presentation and its relevanceto the AIME conference. All submissions were ranked based on two aspects: theoverall recommendation of each reviewer and a quantitative score obtained fromall aspects of the detailed review. In general, the two aspects were in compliance:a highly positive recommendation corresponded to a high qualitative score. Inthe very few where discrepancies appeared, a careful evaluation of each reviewand a deep examination of the paper were performed by the program committeeand the organizing committee chair before reaching a final decision.

As a result, 24 papers were accepted as full papers (a 37% acceptance rate)for oral presentation. Each of them received a high overall ranking and two posi-tive recommendations, of which at least one was highly positive. Ten pages have

VI Preface

been allocated to each full paper in this volume. In addition, 26 papers wereaccepted as short papers for poster presentation. Each of them also received twopositive recommendations. Five pages have been allocated to each short paperin this volume. All accepted papers were organized under nine themes duringthe AIME 2003 conference and in this volume. These themes reflect the cur-rent interests of researchers in AI in medicine. Temporal reasoning, from theinterpretation of high-frequency data to the modeling of high level abstractions,is a persistent research theme. AI techniques for image processing seem verypromising in particular for neuroimaging applications. The construction of on-tologies based on medical terminologies or free-texts has generated theoretical(logics based) and technical papers. The growing medical interest for protocoland guidelines-based care is motivating the development of specific frameworksand methodologies for their representation, verification, learning, and sharing.Probabilistic networks and bayesian models remain representational frameworkswell adapted to medical information and a dynamic research field. The need forcomputerized assistance for medical decision making from diagnosis to treatmentplanning has encouraged several applications papers. Finally, machine learning,data mining, and knowledge discovery appear as central techniques for dataanalysis in various medical domains.

The modeling, using computerized techniques, of biological systems fromgenetic networks to highly cognitive mechanisms, is still largely debated andhas been since the beginning of AI. Two speakers were invited to discuss thesepoints in the light of the more recent results in computer simulation of biologicalphenomena and robotics. Two extended abstracts of these invited lectures areincluded at the end of this volume.

We would like to emphasize the high quality of the papers selected in thisvolume, demonstrating the vitality and diversity of research in Artificial Intelli-gence in medicine and the interest of specific medias, literature, and conferencesdevoted to this field.

We would like to thank all the people and institutions who contributed to thesuccess of the AIME 2003 conference: the authors, the members of the programcommittee as well as additional reviewers, all the members of the organizingcommittee, the invited speakers Zoltan Szallasi and Phillipe Gaussier. Moreover,we would like to thank the organizers of the two workshops, Ameen Abu-Hannaand Jim Hunter, and Peter Lucas. Finally, we would like to thank the Universityof Cyprus for sponsoring the conference.

July 2003 Michel DojatElpida KeravnouPedro Barahona

Organization

Program Chair: Michel Dojat(INSERM, France)

Organizing Chair: Elpida Keravnou(University of Cyprus, Cyprus)

Workshops Chair: Pedro Barahona(Universidade Nova de Lisboa, Portugal)

Program Committee

Klaus-Peter Adlassnig (Austria)Amparo Alonso-Betanzos (Spain)Steen Andreassen (Denmark)Robert Baud (Switzerland)Riccardo Bellazi (Italy)Enrico Coiera (Australia)Carlo Combi (Italy)Rolf Engelbrecht(Germany)Henrik Eriksson (Sweden)John Fox (United Kingdom)Catherine Garbay (France)Werner Horn (Austria)Jim Hunter (United Kingdom)Nada Lavrac (Slovenia)Peter Lucas (The Netherlands)

Johan van der Lei (The Netherlands)Silvia Miksch (Austria)Constantinos Pattichis (Cyprus)Silvana Quaglini (Italy)Alan L. Rector (United Kingdom)Steve Rees (Denmark)Basilio Sierra (Spain)Yuval Shahar (Israel)Chistos N. Schizas (Cyprus)Mario Stefanelli (Italy)Costas Spyropoulos (Greece)Thomas Uthmann (Germany)Mario Veloso (Portugal)Blaz Zupan (Slovenia)

Additional Reviewers

Ivano AzziniDragan GambergerKatharina Kaiser

Efthyvoulos KyriacouCristiana LarizzaCostas Neocleous

Georgios PaliourasJoerg ReinerFilip Zelezny

Table of Contents

Temporal Reasoning

On-Line Extraction of Successive Temporal Sequencesfrom ICU High-Frequency Data for Decision Support Information . . . . . . . . 1

Sylvie Charbonnier

Quality Assessment of Hemodialysis Servicesthrough Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Riccardo Bellazzi, Cristiana Larizza, Paolo Magni,and Roberto Bellazzi

Idan: A Distributed Temporal-Abstraction Mediatorfor Medical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

David Boaz and Yuval Shahar

Prognosis of Approaching Infectious Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . 31Rainer Schmidt and Lothar Gierl

Modeling Multimedia and Temporal Aspectsof Semistructured Clinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Carlo Combi, Barbara Oliboni, and Rosalba Rossato

NEONATE: Decision Support in the Neonatal Intensive Care Unit –A Preliminary Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Jim Hunter, Gary Ewing, Yvonne Freer, Robert Logie, Paul McCue,and Neil McIntosh

Abstracting the Patient Therapeutic History througha Heuristic-Based Qualitative Handling of Temporal Indeterminacy . . . . . . . 46

Jacques Bouaud, Brigitte Seroussi, and Baptiste Touzet

Ontology, Terminology

How to Represent Medical Ontologies in View of a Semantic Web? . . . . . . . 51Christine Golbreich, Olivier Dameron, Bernard Gibaud,and Anita Burgun

Using Description Logics for Managing Medical Terminologies . . . . . . . . . . . 61Ronald Cornet and Ameen Abu-Hanna

Ontology for Task-Based Clinical Guidelinesand the Theory of Granular Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Anand Kumar and Barry Smith

X Table of Contents

Speech Interfaces for Point-of-Care Guideline Systems . . . . . . . . . . . . . . . . . . 76Martin Beveridge, John Fox, and David Milward

Text Categorization prior to Indexing for the CISMEF Health Catalogue . 81Alexandrina Rogozan, Aurelie Neveol, and Stefan J. Darmoni

Bodily Systems and the Modular Structure of the Human Body . . . . . . . . . . 86Barry Smith, Igor Papakin, and Katherine Munn

Image Processing, Simulation

Multi-agent Approach for Image Processing: A Case Studyfor MRI Human Brain Scans Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Nathalie Richard, Michel Dojat, and Catherine Garbay

Qualitative Simulation of Shock States in a Virtual Patient . . . . . . . . . . . . . . 101Altion Simo and Marc Cavazza

3D Segmentation of MR Brain Images into White Matter, Gray Matterand Cerebro-Spinal Fluid by Means of Evidence Theory . . . . . . . . . . . . . . . . 112

Anne-Sophie Capelle, Olivier Colot, and Christine Fernandez-Maloigne

A Knowledge-Based System for the Diagnosis of Alzheimer’s Disease . . . . . 117Sebastian Oehm, Thomas Siessmeier, Hans-Georg Buchholz,Peter Bartenstein, and Thomas Uthmann

Guidelines, Clinical Protocols

DEGEL: A Hybrid, Multiple-Ontology Framework for Specificationand Retrieval of Clinical Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Yuval Shahar, Ohad Young, Erez Shalom, Alon Mayaffit,Robert Moskovitch, Alon Hessing, and Maya Galperin

Experiences in the Formalisation and Verification of Medical Protocols . . . . 132Mar Marcos, Michael Balser, Annette ten Teije, Frank van Harmelen,and Christoph Duelli

Enhancing Conventional Web Contentwith Intelligent Knowledge Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Rory Steele and John Fox

Linking Clinical Guidelines with Formal Representations . . . . . . . . . . . . . . . . 152Peter Votruba, Silvia Miksch, and Robert Kosara

Computerised Advice on Drug Dosage Decisions in Childhood Leukaemia:A Method and a Safety Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Chris Hurt, John Fox, Jonathan Bury, and Vaskar Saha

Table of Contents XI

The NewGuide Project: Guidelines, Information Sharingand Learning from Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Paolo Ciccarese, Ezio Caffi, Lorenzo Boiocchi, Assaf Halevy,Silvana Quaglini, Anand Kumar, and Mario Stefanelli

Managing Theoretical Single-Disease Guideline Recommendationsfor Actual Multiple-Disease Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Gersende Georg, Brigitte Seroussi, and Jacques Bouaud

Informal and Formal Medical Guidelines: Bridging the Gap . . . . . . . . . . . . . 173Marije Geldof, Annette ten Teije, Frank van Harmelen, Mar Marcos,and Peter Votruba

Terminology, Natural Language

Rhetorical Coding of Health Promotion Dialogues . . . . . . . . . . . . . . . . . . . . . . 179Floriana Grasso

Learning Derived Words from Medical Corpora . . . . . . . . . . . . . . . . . . . . . . . . 189Pierre Zweigenbaum and Natalia Grabar

Learning-Free Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Patrick Ruch, Robert Baud, and Antoine Geissbuhler

Knowledge-Based Query Expansionover a Medical Terminology Oriented Ontology on the Web . . . . . . . . . . . . . 209

Linda Fatima Soualmia, Catherine Barry, and Stefan J. Darmoni

Linking Rules to Terminologies and Applications in Medical Planning . . . . 214Sanjay Modgil

Machine Learning

Classification of Ovarian TumorsUsing Bayesian Least Squares Support Vector Machines . . . . . . . . . . . . . . . . . 219

Chuan Lu, Tony Van-Gestel, Johan A.K. Suykens, Sabine Van-Huffel,Dirk Timmerman, and Ignace Vergote

Attribute Interactions in Medical Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 229Aleks Jakulin, Ivan Bratko, Dragica Smrke, Janez Demsar,and Blaz Zupan

Combining Supervised and Unsupervised Methodsto Support Early Diagnosis of Hepatocellular Carcinoma . . . . . . . . . . . . . . . . 239

Federica Ciocchetta, Rossana Dell’Anna, Francesca Demichelis,Amar Paul Dhillon, Alberto Quaglia, and Andrea Sboner

Analysis of Gene Expression Data by the Logic Minimization Approach . . . 244Dragan Gamberger and Nada Lavrac

XII Table of Contents

A Journey trough Clinical Applications of Multimethod Decision Trees . . . 249Petra Povalej, Mitja Lenic, Milojka Molan Stiglic,Maja Skerbinjek Kavalar, Jernej Zavrsnik, and Peter Kokol

Probabilistic Networks, Bayesian Models

Detailing Test Characteristics for Probabilistic Networks . . . . . . . . . . . . . . . . 254Danielle Sent and Linda C. van der Gaag

Bayesian Learning of the Gas Exchange Propertiesof the Lung for Prediction of Arterial Oxygen Saturation . . . . . . . . . . . . . . . . 264

David Murley, Stephen Rees, Bodil Rasmussen, and Steen Andreassen

Hierarchical Dirichlet Learning – Filling in the Thin Spots in a Database . . 274Steen Andreassen, Brian Kristensen, Alina Zalounina,Leonard Leibovici, Uwe Frank, and Henrik C. Schønheyder

A Bayesian Neural Network Approach for Sleep Apnea Classification . . . . . 284Oscar Fontenla-Romero, Bertha Guijarro-Berdinas,Amparo Alonso-Betanzos, Ana del Rocıo Fraga-Iglesias,and Vicente Moret-Bonillo

Probabilistic Networks as Probabilistic Forecasters . . . . . . . . . . . . . . . . . . . . . 294Linda C. van der Gaag and Silja Renooij

Finding and Explaining Optimal Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . 299Concha Bielza, Juan A. Fernandez del Pozo, and Peter Lucas

Case Based Reasoning, Decision Support

Acquisition of Adaptation Knowledgefor Breast Cancer Treatment Decision Support . . . . . . . . . . . . . . . . . . . . . . . . . 304

Jean Lieber, Mathieu d’Aquin, Pierre Bey, Amedeo Napoli, Maria Rios,and Catherine Sauvagnac

Case Based Reasoning for Medical Decision-Supportin a Safety Critical Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

Isabelle Bichindaritz, Carol Moinpour, Emin Kansu, Gary Donaldson,Nigel Bush, and Keith M. Sullivan

Constraint Reasoning in Deep Biomedical Models . . . . . . . . . . . . . . . . . . . . . . 324Jorge Cruz and Pedro Barahona

Interactive Decision Support for Medical Planning . . . . . . . . . . . . . . . . . . . . . 335David W. Glasspool, John Fox, Fortunato D. Castillo,and Victoria E.L. Monaghan

Table of Contents XIII

Compliance with the Hyperlipidaemia Consensus:Clinicians versus the Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Wouter P. van Rijsinge, Linda C. van der Gaag, Frank Visseren,and Yolanda van der Graaf

WoundCare: A Palm Pilot-Based Expert System for the Treatmentof Pressure Ulcers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

Douglas D. Dankel, Mark Connor, and Zulma Chardon

VIE-DIAB: A Support Program for Telemedical Glycaemic Control . . . . . . 350Christian Popow, Werner Horn, Birgit Rami, and Edith Schober

Data Mining, Knowledge Discovery

Drifting Concepts as Hidden Factors in Clinical Studies . . . . . . . . . . . . . . . . . 355Matjaz Kukar

Multi-relational Data Mining in Medical Databases . . . . . . . . . . . . . . . . . . . . . 365Amaury Habrard, Marc Bernard, and Francois Jacquenet

Invited Talks

Is It Time to Trade “Wet-Work” for Network? . . . . . . . . . . . . . . . . . . . . . . . . . 375Zoltan Szallasi

Robots as Models of the Brain: What Can We Learnfrom Modelling Rat Navigation and Infant Imitation Games? . . . . . . . . . . . . 377

Philippe Gaussier, Pierre Andry, Jean Paul Banquet, Mathias Quoy,Jacqueline Nadel, and Arnaud Revel

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

M. Dojat, E. Keravnou, and P. Barahona (Eds.): AIME 2003, LNAI 2780, pp. 1–10, 2003. © Springer-Verlag Berlin Heidelberg 2003

On-Line Extraction of Successive Temporal Sequences from ICU High-Frequency Data

for Decision Support Information

Sylvie Charbonnier

Laboratoire d'Automatique de Grenoble BP 46, 38402 St Martin d’Hères France [email protected]

tel: (33) 476-82-64-15 fax: (33) 476-82-63-88

Abstract. This paper presents a method to extract on line successive temporal sequences from high frequency data monitored in ICU. Successive temporal se-quences are expressions such as: ”the systolic blood pressure is steady at 120mmHg from time to until time t1; it is increasing from 120 mmHg to 160mmHg from time t1 to time t2 …”. The method uses a segmentation algo-rithm that was developed previously and a classification of the segments into temporal patterns. It has seven tuning parameters that are rather easy to tune be-cause they have a physical meaning. The results obtained on simulated data are quite satisfactory. Sequences extracted from real biological data recorded dur-ing 14 hours from different patients received the approbation of two clinicians. These temporal sequences can help the health care personnel to take decisions in alarm situations, or can be used as inputs to intelligent alarm systems using inferences on the data.

1 Introduction

False alarms generated by monitoring systems are extra numerous in Intensive Care Units. Indeed, most of the alarm detection procedures included in these systems con-sist in triggering an alarm when a variable crosses a preset level, which is very sensi-ble to artefacts. These false alarms are actually an extra burden to the health care personnel, as reported in the literature ([1], [2]).

On the past decade, some work has been done to develop intelligent alarm systems for ICU, their goal being to assist clinicians in the interpretation of an alarm situation ([3], [4]). Reliable intelligent alarm systems require signal to symbol conversion but also temporal pattern extraction as a prior step, so as to include time in the decision ([5]-[11]).

In this paper, we present a method to extract on line temporal sequences from high frequency biological parameters. Successive temporal sequences are semi quantitative information, explaining the temporal behaviour of a variable, such as, ”the systolic blood pressure is steady at 120mmHg from time to until time t1; it is increasing from 120 mmHg to 160mmHg from time t1 to time t2 …”. These sequences enhance the important patterns of change in the data. They may be used as inputs to intelligent alarm systems or as a support to physicians during alarm situations by helping them to

2 Sylvie Charbonnier

take a prompter decision. The presentation of the paper is the following. In the first part, we will make a description of the method, in the second part, results on simu-lated and real biological data will be presented and discussed in the third part.

2 Presentation of the Method

The method developed to extract on line successive temporal sequences consist in 4 parts, achieved in the following order:

1. On line segmentation of the data into linear segments 2. Classification of the last segment calculated into 9 temporal shapes: steady, in-

creasing, decreasing, positive or negative step, positive or negative step+slope, concave or convex transient

3. Transformation of shapes into semi-quantitative trend patterns 4. Aggregation of the current trend pattern with the previous ones to form the succes-

sive sequences

2.1 On-Line Segmentation of the Data

A segmentation algorithm has been developed previously. It consists in splitting the data into successive line segments of the form: y(t)=p(t-to)+yo, where to is the time when the segment begins, p is its slope and yo is the ordinate at time to.

The segmentation algorithm determines on line the moment when the current linear approximation is no longer acceptable and when to calculate the new line segment that now best fit the data, using the least squares criteria. The technique used to detect if the linear approximation is still acceptable is the cumulative sum (CUSUM) tech-nique. The algorithm is tuned with 5 parameters, 2 for the decomposition into seg-ments (called th1 and th2) and 3 for the rejection of artefacts. When the cusum (the integral of the differences between the line segment and new data) is inferior to th1, the linear model is correct. When the cusum becomes superior to th1, data are stored and when the cusum crosses th2, a new line segment is calculated using the data stored. We proposed values for these parameters that perform well on biological pa-rameters routinely monitored in ICU such as the heart rate, the oxygen saturation rate, the arterial systolic and diastolic pressures, the respiratory rate, the minute ventilation, the maximal pressure in the airways. A complete description of the algorithm can be found in [12].

2.2 Classification of the New Segment into a Temporal Shape

Once a new segment has been calculated by the segmentation algorithm, the segment forms a shape with the preceding one that can be classified into 9 temporal shapes: increasing, decreasing, steady, positive step, negative step, positive step+slope, nega-tive step+slope, concave transient, convex transient.

Figure 1 shows the features extracted from the new segment just calculated and the previous one, that are used for the classification. The output of the segmentation algo-rithm is some information on the current segment i: the slope )(ip , the starting point

)(ito , the ordinate at time )(ito , )(iyo

On-Line Extraction of Successive Temporal Sequences 3

The current time, when the new segment has just been detected and calculated, is )(itc . The new shape associated with the new segment starts at time

Tsititb o −= )()( , with Ts the sampling period, ie it starts at the end of the previous segment. A shape is described by the following features: - the total increase (or decrease) observed during the shape, named I, calculated as

the difference between the value at the end of the shape and the value at its begin-ning:

{ }{ })1()]1()()[1(

)()]()().[(

)()()(

−+−−−−+−

=−=

iyititip

iyititip

iyiyiI

oob

ooc

bc

- the increase (or decrease) due to the discontinuity (or the step), named Id, calcu-

lated as the difference between the value at the beginning of the new segment and the value at the end of the previous segment.

{ })1()]1()().[1()(

)()()(

−+−−−−=−=

iyititipiy

iyiyiId

oobo

bo

- the increase (or decrease) due to the slope, named Is, calculated as the difference

between the value at the end of the segment and the value at the beginning of the segment

)]()().[()()()( ititipiyiyiIs ococ −=−=

Fig. 1. Features extracted from a new segment and the previous one which are used for classifi-cation

The classification is achieved thanks to a hierarchical tree which is presented in figure 2. Its rules are the following:

First node: If Id is superior to a threshold thc, the shape is discontinuous. It is a “step” or a “step+slope” or a “transient”. Else, the shape is continuous. It is a “steady”, an “in-creasing” or a “decreasing”.


Second node: If the shape is continuous and the absolute value of I is inferior to a threshold ths, the shape is “steady”. Else, it is an “increasing” or a “decreasing”, depending on the sign of I.

Third node: If the shape is discontinuous and the increase due to the slope is inferior to the thresh-old ths, the shape is a “step”, positive if Id is positive, negative if Id is negative. Else it is a “transient” if the signs of Id and Is are opposite or it is a “step+slope” if the signs are the same.

Then, the classification is easily achieved with 2 parameters, thc and ths. For all the biological variables we considered, we tuned thc equal to ths. This threshold has a physical meaning for a clinician. When a variation observed on the variable is supe-rior to this threshold, the variable is considered to have increased or decreased.

Fig. 2. Hierarchical tree

2.3 Transformation of Shapes into Semi-quantitative Trend Pattern

The classification of the segment into a temporal shape provides a symbolic informa-tion. Information on the starting instant of the shape - tb(i)) -, the value of the variable at this moment - yb(i) -, the value of the variable a sampling time later - yo(i) - and the value of the variable at the end of the shape - ye(i) - can be kept and associated with the information on the shape to provide semi qualitative information.

For example, a step is described by: [Step, tb(i), yb(i), yo(i), ye(i)] A steady may be described by [steady, tb(i), yb(i), ye(i)] since, in this case,

yb(i)~=yo(i).


If a shape is a continuous one, the symbolic information is equivalent to the sym-bolic trend of the variable. If it is a discontinuous one, the shape is split in 2 parts, each part being associated with a trend information.

Then, a positive step (respectively negative) described by [Step, tb(i), yb(i), yo(i), ye(i)] becomes [Increasing (respectively Decreasing), tb(i), yb(i), yo(i)] + [stable, tb(i)+Ts, yo(i), ye(i)]

A positive step+slope (respectively negative) becomes [Increasing (respectively Decreasing), tb(i), yb(i), yo(i)] + [Increasing (respectively Decreasing), tb(i)+Ts, yo(i), ye(i)]

An increasing/decreasing transient (respectively decreasing/increasing) becomes [Increasing (respectively Decreasing), tb(i), yb(i), yo(i)] + [Decreasing (respectively Increasing), tb(i)+Ts, yo(i), ye(i)]

So, the 9 shapes are reduced to 3 trend patterns: steady, increasing, decreasing, each associated with three quantitative information: the time of the beginning, the value of the variable at the beginning, the value of the variable at the end.

2.4 Aggregation of Trend Patterns

Aggregating trend patterns consists in associating the current trend pattern with the previous one to form the longest possible temporal sequence. The aggregating rules are the following:

If the previous sequence is [increasing, tb(i-1), yb(i-1),ye(i-1)] If the current trend pattern is [increasing, tb(i), yb(i), ye(i)]

Then, the new sequence is [increasing, tb(i-1), yb(i-1), ye(i)]. If the current trend pattern is stable or decreasing

Then it cannot be aggregated and it starts a new sequence: [increasing, tb(i-1), yb(i-1),ye(i-1)] ; [steady, tb(i), yb(i), ye(i)].

The possible aggregations are: Increasing + increasing = increasing Decreasing + decreasing=decreasing Steady + steady = steady if the increase of the global sequence [ye(i)-yb(i-1)] is in-

ferior to the threshold ths, the threshold used to separate the shapes steady and in-creasing.

Else, Steady + steady = increasing (or decreasing). This permits to detect slow trends in the signal. The trend will take more time to be detected since it requires the association of at least two steady trend patterns. Yet, it is not a major drawback since the apparition of a slow trend does not mean an immediate danger for the patient.

Let us suppose that the last two sequences extracted are [increasing, tb(i-1), yb(i-1),ye(i-1)] ; [steady, tb(i), yb(i), ye(i)]. This means, in natural language, that the variable has increased since tb(i-1) until tb(i) from yb(i-1) to ye(i), it is now steady at the value yb(i) since tb(i). In two following sequences, the value at the end of the previous se-quence is equal to the value at the beginning of the new sequence ie yb(i)=ye(i-1). The time at which a sequence ends is equal to the time at which the following sequence starts.


3 Results

3.1 Results on Simulated Data

To analyse the ability of the method to extract successive sequences, it was tested at first on a set of simulated data. The simulated data correspond to 3 common situations that may occur on the Spo2 signal monitored from a patient hospitalised in ICU which are:

- Situation 1: a transitory hypoxic event simulated by a decrease of Spo2 from 96% to 86% in 30 sampling periods, followed by an increase from 86% to 96% in 30 sampling periods.

- Situation 2: a change in the patient’s state simulated by an increase of 4% , from 93% to 97%, lasting 180 sampling periods

- Situation 3:a slow decrease simulated by a decrease of 5%, from 99% to 94%, lasting 1500 sampling periods.

Gaussian white noise with variance 0.5 was added to the root signals. The signal to noise ratio of the simulations, calculated as the ratio of the variance of the root signal to the variance of the noise was about 6. For each of the 3 situations, 50 sets of simu-lated data were created and the method was applied on each set. The result obtained for each set was classify into 3 categories:

- minimal sequences detection (MSD) when the successive sequences extracted corresponded to the minimal number of sequences required to correctly describe the data (4 for situation 1, 3 for situation 2, 3 for situation 3).

- Readable sequences (RS) when the successive sequences extracted are more than the minimal number of sequences required, with a few additional steady sequences between 2 increasing (or decreasing) sequences

- Unreadable sequences (US) when some erroneous increasing-decreasing sequences appeared.

The results obtained are summarized in table 1. They show that the method is able to extract correct sequences for the 3 situations considered. The results are quite good when the change to detect is a rapid one, and slightly less good when the change is slow.

Table 1. Results on data

Hypoxic event State change Slow decrease 50 MSD 0 RS 0 US

46 MSD 4 RS 0 US

42 MSD 7 RS 1 US

3.2 Results on Real Data

The method was applied to real biological data, recorded on different patients admit-ted in the ICU of two French hospitals, Lyon-Sud and Lille.

Sequences extracted from Spo2 data recorded every second, or every five seconds, during 14 hours from 6 different patients were presented to two clinicians. The se-quences were extracted with the same values of the tuning parameters for all the pa-tients. Both clinicians thought that they explained correctly the variable behaviour and that no important changes were missed.


As an example, a set of data is presented in figure 3, corresponding to 33 minutes of recording from a patient artificially ventilated whose perfusion of sedative drug has been stopped 2 hours ago. The patient is in awakening process. The results are pre-sented for 5 biological data - Spo2, heart rate, systolic blood pressure, respiratory rate and minute ventilation – recorded with a sampling rate of 1Hz.

During the 10 first minutes, the patient’s state is steady. Then, the patient awakes because of some care given by a nurse and becomes agitated then get back to sleep 15 minutes later. These different stages are clearly visible on the sequences extracted on the 5 variables.

During the 10 first minutes and the last five minutes of the recording, the se-quences extracted on the five variables report them to be steady at normal physiologi-cal values. Then, after 10 minutes and during 15 minutes, the variables are varying.

Spo2 drops below 90% three times during this period. According to physicians, the first event is a transitory hypoxic event, consecutive to a tracheal suction, the second and third decreases are artefacts due to a loss of the signal because of the patient’s agitation. The first artefact lasts 12s and the second 5s.

In the sequences extracted, the first decrease is reported as a decrease from 97% to 88% lasting 80s, the second is reported as a decrease from 97% to 80% in 5s, and the third is not reported because it has been filtered by the segmentation algorithm.

The first hypoxic event, detected as starting at time 620s by the method, is pre-ceded by a decrease in the minute ventilation from 11 to 6 l/mn, lasting 60s, and start-ing at time 600s. This decrease is explained by the tracheal suction. The decrease in minute ventilation is immediately followed by an increase to 16 l/mn. Concomitantly with the decrease in minute ventilation, a rapid increase of the respiratory rate from 20 to 23 then from 23 to 30 breathes per minutes is reported.

The patient’s agitation during the care is also visible on the heart rate which in-creases from 82 bpm to 103 bpm at time 700s and the systolic pressure that drastically increases from 150mmHg to 210 mmHg, then slowly oscillates around 200mmHg. Then, all the variables decrease more or less rapidly to their initial values.

4 Discussion

The preceding example shows how the sequences extraction method could be useful in an intelligent alarm system. Firstly, information is given on the time evolution of the variables that could permit to eliminate some false alarms. For example, the sec-ond hypoxic event is reported as a decrease in Spo2 of 16% in 5s which does not correspond to physiological decrease. We could imagine, in the future, to develop an artefact rejection method using the sequences extracted. Secondly, association of sequences from different variables monitored could be used in some knowledge based systems. The successive sequences are composed of semi quantitative temporal in-formation that could be easily transformed in qualitative temporal information by replacing the value of the variable by an indication on the normality of the value. The sequence would become “the variable is decreasing since to until t1 from normal value to low value”. Association of qualitative sequences could trigger some decision rules. For example, the first decrease in Spo2 is reported with a concomitant decrease in minute ventilation which can active a decision rule of the form “if a decrease in Spo2 to an abnormal value is reported concomitantly with a decrease in ventilation minute to a low value then the patient is hypoxic because of a respiratory problem”.


Fig. 3. Spo2, Minute ventilation, respiratory rate, systolic blood pressure, heart rate signals and the corresponding sequences.


The method is tuned with seven parameters which are easy to tune because they have a physical meaning for the physician. Five of these parameters are used to tune the segmentation algorithm (two for the decomposition into segments and 3 for the rejection of large artefacts) and two to tune the classification into temporal shapes. The same tuning of parameters applied to different patients gave correct results. This is interesting because it tends to show that it is not necessary to have a training period when the parameters should be adapted to a patient.

The method is developed to work on line. The delay of detection of a change in the data depends on the tuning of the segmentation algorithm. As presented in [12], the delay of detection is not fixed but depends on the importance of the change: the more important the change, the quicker the detection. This is very interesting in the context of patient monitoring because an important change can mean that the patient’s life is at stake and should be detected very quickly.

The information given on the trend corresponds to three classes only: steady, in-creasing, decreasing. This means that a slow decrease in the data followed by a rapid decrease will be reported as a decreasing sequence only. It may be interesting to add two more classes “increasing rapidly” and “decreasing rapidly” to have a more accu-rate description of the data evolution. However, describing the temporal behaviour with three classes has the advantage of simplicity.

The sequences extracted are strongly dependant of the segments extracted with the segmentation algorithm. The localisation of the beginning (or the end) of an increas-ing (or decreasing) sequence may not be very accurate. It depends on the beginning time of the segment calculated by the segmentation algorithm. Moreover, when a decrease starts, the variable can vary slowly in the beginning and the corresponding segment be classified as steady. The increasing sequence will start with the following segment classified as increasing, and may start with a delay.

The combination of abstraction for neighbouring segments is rather simple. Only the combination of two consecutive sequences is analysed. We made this choice to make the aggregation step easy to perform on line. Consequently, the sequences ex-tracted may not be optimal. For example, when a short steady segment appears be-tween two decreasing segments, no merging of these three segments is done to create a single decreasing segment. This could be a future development of our work.

Compared to [11], the method developed extracts the trend in the signal and com-bines the information on the trend with the value of the signal. Eliminating variations that do not threaten the patient’s life can be easier. Compared to [10], our method does not require a pre filtering of the data and, compared to [9], it can be processed on line.

5 Conclusion

In this paper, we propose a methodology to extract on line successive temporal se-quences on high frequency monitored variables. The method uses a segmentation algorithm that was developed previously and a classification of the segments into temporal patterns. It is tuned rather easily by seven parameters, which take different values from one biological variable to the other one. The method does not require any training period or any data filtering before its application on a patient. It is applied with the same set of tuning values for any patient.


The results obtained on simulated data are satisfactory and the results obtained on Spo2 data were approved by two clinicians.

These temporal sequences can help the health care personnel to take decisions in alarm situations, or can be used as inputs to intelligent alarm systems.

Acknowledgment

Special thanks are expressed to Drs. P.Y. Carry and J.P. Perdrix from the Intensive Care Unit of CHU Lyon Sud and to M.C. Chambrin from the INSERM for their help in the analysis of the results.

This work is supported by the national network for health technologies 2000-2003, from the french research ministry.

References

1. O’Carrol, T.: Survey of alarms in an intensive therapy units. Anaesthesia 41 (86) 742-744 2. Beneken, J., Van der A.A. J.: Alarms and their limits in monitoring. J. Clin. Monit. 5 (89)

205-210 3. Coiera, E.: Intelligent monitoring and control of dynamic physiological systems. Artificial

Intelligence in Medicine 5 (93) 1-8 4. Uckun, S.: Intelligent systems in patient monitoring and therapy management A survey of

research projects. International Journal of Clinical Monitoring and Computing 11 (94) 241-25

5. Steimann, F.: The interpretation of time-varying data with DIAMON-1. Artificial Intelli-gence in Medicine 8 (96) 343-357

6. Avent, R., Charlton, J.: A critical review of trend-detection methodologies for biomedical systems. Critical Reviews in Biomedical Engineering 17 (90) 621-659

7. Haimowitz, I., Phillip, P., L., Kohane, I.: Clinical monitoring using regression-based trend templates. Artificial Intelligence in Medicine 7 (95) 473-496

8. Shahar, Y.: A framework for knowledge-based temporal abstraction. Artificial Intelligence in Medicine 90 (97) 79-133

9. Hunter, J., McIntosh, N.: Knowledge based event detection in complex time series data. AIDM’99, Lecture Notes in Artificial Intelligence 1620, (99), 271-280,

10. Salatian, A., Hunter, J.R.W.: Deriving trends in historical and real-time continuously sam-pled medical data. Journal of Intelligent Information Systems 13 (99) 47-71

11. Calvelo, D., Chambrin, M.,C., Pomorski, D., Ravaux, P.: Towards symbolisation using data-driven extraction of local trends for ICU monitoring. Artificial Intelligence in Medi-cine 1-2 (2000) 203-223

12. Charbonnier, S., Becq, G., Biot, L., Carry, P., Perdrix, J.P.: Segmentation algorithm for ICU continuously monitored clinical data. 15th World IFAC congress (2002)


Quality Assessment of Hemodialysis Services through Temporal Data Mining

Riccardo Bellazzi1, Cristiana Larizza1, Paolo Magni1, and Roberto Bellazzi2

1 Dip. Informatica e Sistemistica, Università di Pavia, via Ferrata 1, 27100, Pavia Italy {Riccardo.Bellazzi,Cristiana.Larizza,Paolo.Magni}@unipv.it

2 Unità Operativa di Nefrologia e Dialisi, S.O Vigevano, A.O. Pavia Corso Milano 19, 27029, Vigevano Italy

Abstract. This paper describes a research project that deals with the definition of methods and tools for the assessment of the clinical performance of a hemo-dialysis service on the basis of time series data automatically collected during the monitoring of hemodialysis sessions. While simple statistical summaries are computed to assess basic outcomes, Intelligent Data Analysis and Temporal Data mining techniques are applied to gain insight and to discover knowledge on the causes of unsatisfactory clinical results. In particular, different tech-niques, comprising multi-scale filtering, Temporal Abstractions, association rules discovery and subgroup discovery are applied on the time series. The pa-per describes the application domain, the basic goals of the project and the methodological approach applied for time series data analysis. The current re-sults of the project, obtained on the data coming from more than 2500 dialysis sessions of 33 patients monitored for seven months, are also shown.

1 Introduction

Health care institutions routinely collect a large quantity of clinical information about patients status, physicians actions (therapies, surgeries) and health care processes (admissions, discharge, exams request). Despite the abundance of this kind of data, their practical use is still limited to reimbursement and accounting procedures and sometimes to epidemiological studies. The general claim of researchers is that the potentiality of generalization of those data, that we will refer to as process data, is very weak, since they are not collected in controlled clinical trials. However, the growing interest on knowledge management within health care institutions have high-lighted the crucial role of process data for organizational learning [1,2]. One of the aspects of organizational learning is represented by the assessment of the quality of a hospital service, in particular in relationship to certain performance indicators [3]. Such performance indicators can be either related to the efficiency of the hospital department or to the efficacy of the treatment delivered. In this paper we are interested in the use of data mining tools for assessing the efficacy of treatment delivered by a Hospital Hemodialysis Department (HHD) on the basis of the process data routinely collected during hemodialysis sessions. HHD manage chronic patients that undergo blood depuration (hemodialysis) through an extra-corporal circuit three times a week for four hours. The data accumulated over time for each patient contain the set of variables that are monitored during each dialysis session. In other words, the data

12 Riccardo Bellazzi et al.

collected are time series (inter-session data) of multidimensional time series (intra-session data). Those process data are typically neglected during clinical treatment, since they are synthesized by few clinical indicators observed at the beginning and at the end of each treatment session. Such clinical indicators are usually related to the well-being of patients, and do not contain detailed information about the quality of the treatment, in terms, for example, of blood depuration efficiency or nurse interventions during the dialysis itself. The goal of an auditing system for quality assessment is therefore to fully exploit the process data that may be automatically collected in order to: i) Assess the performance of the overall HHD; ii) Assess the performance achieved for each patient; iii) Highlight problems and understand their reasons. The steps i)-iii) need first to define a suitable set of automatically computable performance indicators and then to analyze the dialysis temporal patterns, by studying both inter- and intra-dialysis data. In particular, the design and implementation of this system requires the use of methodological tools to perform two different temporal data min-ing tasks [4]: a) the discovery of patient-specific relationships between the time pat-terns of monitoring variables and the dialysis performance indexes; b) the extraction of HDD-specific relationships between the time patterns of monitoring variables and the dialysis performance indexes.

In this paper, we present both a new method for the discovery of patient-specific temporal pattern and a new system for quality assessment of dialysis sessions; the system is currently used in the clinical routine. In particular, the paper describes first the application domain and the basic goals of the project; then, it presents the meth-odological approach applied for time series data analysis and the results obtained.

2 End Stage Renal Failure and Hemodialysis

End stage renal disease (ESRD) is a severe chronic condition that corresponds to the final stage of kidney failure. In ESRD, kidneys are not anymore able to clear blood from metabolites and to eliminate water from the body. Without medical intervention, ESRD leads to death. In 1999 the ESRD incidence in Italy was of 130 cases per mil-lion [5]. The elective treatment of ESRD is kidney transplant. Blood-filtering dialysis treatment is provided as a suitable alternative to transplant for people in waiting list or for people that cannot be transplanted at all, such as elderly people. Two main classes of dialysis treatments are nowadays available: hemodialysis (HD) and peritoneal di-alysis. More than 80% of the ESRD patients are treated with HD. In the HD treatment the blood passes through an extra-corporal circuit where metabolites (e.g. urea) are eliminated, the acid-base equilibrium is re-established and the water in excess is re-moved. Typically, HD is performed by exchanging solutes through a semi-permeable membrane (dialysis) and by removing water through a negative pressure gradient (ultrafiltration); a device called hemodialyzer regulates the overall procedure. In gen-eral, HD patients undergo a dialysis session for four hours three times a week. The dialysis treatment has very high costs and it is extremely demanding from an organ-izational viewpoint [6]. Rather interestingly, a potential solution to improve the effi-ciency of dialysis services is represented by Information Technology, as reported in the literature [7-9]. In this paper we are interested in the exploitation of the recent advances in the implementation of hemodialyzers, that allow an automated monitor-ing of dialysis sessions [8]. In particular we have implemented an auditing system designed to summarize the dialysis sessions from a clinical quality viewpoint. This

Quality Assessment of Hemodialysis Services through Temporal Data Mining 13

tool is aimed to automatically extract meaningful patterns from the data, that may be useful for assessing the dialysis sessions from an organizational learning perspective, i.e. for periodically evaluating the HDD performance, either for what concerns all patients or for what concerns each patient.

3 A System for Quality Assessment of Hemodialysis Centers

3.1 Measurements

Our system for quality assessment of HD sessions is based on the automatic meas-urements of 13 variables, which reflect three main aspects of the HD process: 1- Effi-ciency of the removal of protein catabolism products (urea, creatinine); such effi-ciency is indirectly evaluated by measuring the blood flow in the extracorporeal circuit (QB), the body weight loss (WL) and the dialysis time (T). 2- Efficiency of the extra-corporeal circuit of the dialyzer; this aspect is evaluated by measuring the pres-sure of the circuit before (arterial pressure, AP) and after (venous pressure VP) the dialyzer (i.e. the device where the exchange of water and solutes is performed) and the output pressure of the dialyzer (OP) 3- Body water reduction and hypotension episodes. The monitoring of body water and of patients arterial pressure is performed by measuring the water flow through the dialyzer (UF), systolic and diastolic pres-sures (SP, DP), the cardiac frequency (CF), the hemoglobin concentration (Hb) and the estimated plasma volume (PVol). The conducibility of the dialyzer solution (CD) is also monitored, to keep track of water exchanges due to osmosis. The body water reduction is monitored through the (already mentioned) weight loss, too. All those parameters are monitored during each session. Finally, for each dialysis session, the physician defines a set of prescriptions, that is the collection of hemodialyzer settings and HD targets that should be followed and reached at the end of the dialysis.

3.2 Data Summaries for Quality Assessment

In order to assess the performance of each dialysis session, we compute a suitable summary of the intra-dialysis time series. In particular, each session is synthesized through the classical non parametric statistic indexes: the median and the 10th and 90th percentiles of each monitored variable. After having calculated the median values, we obtain a new multidimensional time series, in which each point is the vector of the median values of the 13 monitoring variables, computed on the data collected during a dialysis session. We will refer to this time series as the median time series.

Exploiting the median values, it is possible to assess the quality of a session by per-forming a comparison between a set of reference values and a set of dialysis outcome parameters. In more detail, we consider six parameters: i) the median levels of QB, VP, AP, that we will denote as QB*, VP*, AP*; ii) the time difference between the pre-scribed dialysis time and the effective one (ΔT); iii) the difference between the pre-scribed weight loss and the weight loss measured at the end of the dialysis (ΔL); iv) the difference between the weight reached at the end of the dialysis and the ideal weight prescribed by the physician (ΔW).

A general index of success is derived by judging as positive a treatment in which all (AND) the logic conditions of Table 1 are satisfied:


Table 1. Outcome parameters and the corresponding logical conditions for their assessment

Parameters Condition

QB* Not less than 2% of the prescription VP* Less or equal to 350 ± 3 mmHg AP* Greater or equal to –250 mg ΔT Less or equal than 2% of the prescription ΔL Less or equal than 7% of the prescription ΔW Less or equal than 5% of the prescription

If any of the conditions is not satisfied, the dialysis is considered to be failed. The failure is determined by one or more failure parameters, i.e. the outcome parameters that do not satisfy the condition. The parameters of Table 1 has been derived on the basis of the available background knowledge.

In an audit session, that is typically performed monthly (but may be performed weekly or even daily), the physician can easily calculate the percentage of failures at the center or at the patient level.

3.3 Looking for Temporal Patterns and Knowledge Discovery

A crucial aspect of our project is to be able to provide clinicians and nurses with a deeper insight in the HD temporal patterns, in order to discover the reasons of fail-ures, derive associations between monitoring variables and understand if there are relationships between monitoring variables and failures that hold at the center (popu-lation) level. To this end, we have defined a temporal data mining strategy to analyze the data based on the time series of the median values of each dialysis session. Such strategy is based on the following steps: A) Extraction of basic Temporal Abstractions (State and Trends) from the median time series; the series are pre-processed for trend detection through a multi-scale filtering method; B) Search for associations between Temporal Abstractions and failures; these associations may be interpreted as associa-tion/classification rules, which may hold on a single patient; C) Discover subgroups of patients that show the same associations between monitoring variables and failure parameters. The remaining part of the paper will describe in detail the steps which have been implemented in our auditing system.

3.3.1 Representing the Time Series through Temporal Abstractions Temporal Abstractions (TA) are techniques exploited to extract specific patterns from temporal data; such patterns should represent a meaningful summary of the data and should be conveniently used to derive features that characterize the dynamics of the system under observation [10,11]. The goal of the TA mechanisms is the identifica-tion of intervals corresponding to specific patterns shown by the data that should represent a condition that occurs and evolves during specific time periods. Our TA extraction technique is based on the analysis of time-stamped data and refers to two different categories of TAs: Basic and Complex TAs. Basic TAs represent simple pat-terns derived from numeric or symbolic uni-dimensional time series. In particular, we extract Trend abstractions, representing an increase, decrease or stationary trend of a numerical time series, and State abstractions, representing qualitative patterns of low,


high, normal values of a numerical or symbolic time series. Complex TAs represent complex patterns of uni-dimensional or multi-dimensional time series which corre-spond to specific temporal relationships among Basic TAs. The relationships investi-gated with Complex TAs include the temporal operators defined in the Allen algebra [12]. In the hemodialysis problem, we use Basic TAs to summarize the dynamics of each variable during each session. Before running the TA mechanisms, the median time series is pre-processed in order to robustly detect trend TAs.

3.3.2 Pre-processing of the Median Time-Series through Multi-scale Filtering Methods

One of the major defects of applying trend detection algorithms directly to the raw data is the dependence of the result from the sampling frequency and from the meas-urement errors. Usually, trend detection is performed on a filtered series in order to reduce these problems; however, the choice of the filter can strongly condition the trend detection results. In particular, the filtering algorithms outputs depend on smoothing parameters, which reflect the prior knowledge available both on the proc-ess that generates the data and on the measurement noise.

In our case, while it is possible to assume that the noise on the calculated median values is intrinsically low or absent, we do not have any information about the degree of regularity that can be expected in order to properly evaluate the trends from a clini-cal viewpoint. In other words, since the analysis of the median time series in HD is a new approach to dialysis quality control, we cannot rely on existing knowledge about the dynamics underlying the data generation process. For this reason, we resorted to a new robust strategy, based on a multi-scale smoothing filter. Multi-scale filtering can be performed through a variety of techniques, which have been proposed in the majority of cases in the image processing field. In our case, we resorted to the so-called discrete wavelets [13].

- The smoothing filter chosen is the discrete 1-D wavelet decomposition with Daubechies wavelets.

- Thanks to the multi-scale nature of wavelet decomposition, five different smoothed series are reconstructed from the median time series using different wavelet trans-form coefficients. These series correspond to the first five wavelets scale values of the discrete wavelet transform. Each scale is related to a different smoothing level.

- For each of the five times series, the trend is detected on the basis of a standard method [10]. In this way it is possible to assign to each time point of the filtered time series a TA within the set {decreasing, stationary, increasing}.

- On the basis of a voting strategy, each time point of the median time series is as-signed to one TA {decreasing, stationary, increasing}: the TA is confirmed if it is found at the majority of scales. The trend detection so obtained is robust since it is

independent from the smoothing scale used for filtering the curve.

3.3.3 Search Associations between TAs and Outcomes Once we have derived the state and trend TAs for each monitoring variable, we want to look for possible associations between the TAs and the failure parameters. We would like to obtain rules of the kind “IF the Trend of Venous pressure is increasing THEN dialysis fails due to insufficient weight loss”. To achieve this goal it is possible to search for the co-occurrences of TAs and failures and, then, to select the most fre-


quent ones. The search algorithm described in this section has been inspired by the work of F. Hoppner [14] and by the well-known PRISM algorithm [15].

In order to describe the search algorithm we exploited, it is necessary to introduce some definitions and notations.

Given the median time series Vj of a variable j, the state TAs for Vj can be repre-sented as the collection of time intervals in which Vj is either high (H) or normal (N) or low (L), while the trend TAs for Vj can be represented as the collection of time intervals in which Vj is either increasing (I), stationary (S) or decreasing (D).

Given two or more TAs, for example “Vj is N” and “Vi is I”, we can easily calculate their intersection as the intersection of their time intervals; the time span (TSji(N,I)) of such intersection corresponds to the number of dialysis sessions in which both TAs occur.

The intersection can be calculated also for one or more TAs with any failure pa-rameter. In this case, given the abstraction “Vj is S”, and the failure parameter O=Oi, we denote the TS of their intersection as TSjo(S, Oi).

Finally, let us note that TSjjj(H,N,L) and TSjjj(I,S,D) are equal to zero. The search procedure aims to define rules of the form A � Oi where A is the body

and Oi is the head of the rule: in our case A is any conjunction of TAs (e.g. “Vi is L” and “Vi is D” and “Vj is H”), and Oi is a failure parameter (i.e. failure due to ΔL)1. A is here interpreted as the intersection of TAs involved in the conjunction. It is there-fore possible to calculate the time span of A (TSA) resorting to the definition given above.

We define the support (SU) of a rule as the number of sessions in which the rule holds (i.e. SU= TSAOi) and we define the confidence (CF) of the rule as the conditional probability P(Oi|A), which may be calculated dividing SU by TSA, i.e. CF= TSAOi/TSA.

Resorting to the definitions given above, we run a search strategy which recur-sively constructs rules with the maximal body having SU>SUmin and CF>CFmin, being SUmin and CFmin suitable threshold values.

The strategy works as follows:

- Step 1. Select a single patient. Select a failure parameter Oi. Put all the TAs for all variables in the set A0.

- Step 2. Compute the intersection of each member of A0 with Oi. Put the results with SU> SUmin and CF> CFmin in the set A1 and in the basic set B. Set the counter k to 1.

- Step 3. Do:

o Put in the set Ak+1 the intersection of each of the TAs in Ak with each of the TAs in B and with Oi, such that SU>SUmin and CF>CFmin.

o Set k=k+1 while Ak is not empty.

- Step 4. Put A=Ak-1. The rule A � Oi contains the maximum number of basic TAs, i.e. the rule with most complex body. Let us note that the algorithm allows to de-rive more than one rule for each Oi.

1 Although the search procedure looks for rules with the same head and may be thus interpreted as a supervised learning problem, its final goal is not to derive predictive rules, but to extract a description of the co-occurrences of abstractions and failures. For this reason, we use the term association rules to denote the outputs of the algorithm.


The derived rules with their support can be shown to the users also in graphical form, thus highlighting the temporal location of the sessions in which each rule holds. Moreover, it is also possible to represent the association rules as Complex TAs, in which both the conjunction and the implications are interpreted as a contemporaneous occurrence of TAs and failures.

3.3.4 Search for Predictive Models at the Population Level The algorithm described in the previous section is useful to derive a description of the single patient behaviour over time. In order to derive a model at the population level, it is possible to resort to a different strategy. Since in the approach described in the previous sections each variable in each dialysis has only one state and one trend ab-straction which holds, we may easily derive a matrix M of features, where the col-umns represent the state and the trend value of each variable and each row represents a dialysis session. The matrix M is completed with the patient number, and the differ-ent values of the outcomes of each dialysis session. The matrix M can be used to in-vestigate the relationships between the outcomes and the dynamic behavior of the monitored variables at the population level. However, such an investigation must take into account the fact that the rows of M are not independent from each other. In par-ticular, two kinds of dependencies are present: the data may belong either to one pa-tient or to different patients and the values of consecutive abstracted values may be related to each other since they belong to the same episodes. In order to avoid these problems, we randomly sampled the rows of M to obtain a submatrix Ms, in which the new data are not anymore correlated. On the basis of those data it is possible to apply a recently proposed algorithm for subgroup discovery [16], able to describe at the population level the subgroups which show peculiar behaviors.

4 Results

The system described in the previous section is undergoing a clinical evaluation at the Limited Assistance Center Located in Mede, Italy, which is clinically managed by the Unit of Nephrology and dialysis of the Hospital of Vigevano, Italy. The current ver-sion of the software contains the basic auditing procedures that allow physicians in assessing the percentages of success and in visualizing both the time series of the median values and the time series of each variable in each session. Together with the association rules search, several graphical data navigation procedures have been im-plemented, based on simple plots, histograms and tables.

The extraction of rules at the population level have been instead tested off-line us-ing the Data Mining Server from the Rudjer Boskovic Institute, Croatia [17].

The methods presented in the previous section have been applied to the data com-ing from 33 patients monitored for 3-8 months, for a total of 2527 dialysis sessions. Each of the monitored variables was sampled every [1–15] mins. Table 2 shows a synthesis of the dialysis center performance. For each outcome the number of failures and the percentage of the total number of dialysis is reported.

Let us note that several times multiple failures occur. This explains why the overall number of failures is lower than the sum of the number of failures of each outcome.


Table 2. Outcomes assessment

QB* �T �L VP* AP* �W Overall

# Failures 620 321 169 152 1 60 992 % of total

dialysis 23 12 6 6 0 (0.004) 2 39

Search for Associations between TAs and Outcomes. The search strategy described in the previous section was implemented with a SUmin equal to the maximum value between 4 and the 40% of the sessions failed, while the minimum confidence was set to 0.5 for extracting the basic set B and to 0.9 for deriving the association rules. This step of the analysis allowed us to derive 18 association rules on the data of 7 patients, while for the other patients no rules have been found. Almost all rules are related to VP* (17 over 18). Two examples of the rules derived for two different patients are shown below:

Patient 1: IF Trend of AP* is Decreasing, State of CF* is LOW and Trend of DP* is In-creasing THEN VP* FAILS

Support: 15 sessions, Confidence: 1, Total number of session failed: 36

Patient 9: IF State of SP* is HIGH and State of DP* is HIGH THEN VP* FAILS

Support: 30 sessions, Confidence: 1, Total number of session failed: 56

The first rule describes a situation in which there is an increasing trend of both sys-temic pressure (DP*) and the hemodialyzer blood circuit pressure (AP* 2) for patient 1; these are clinically relevant reasons to justify a value of VP* out of the normal range. The second rule describes the fact that the patient hypertension problems cause VP* failures for patient 9; this fact, although not proved by specific clinical studies, can be justified on the basis of available clinical knowledge.

The overall set of extracted rules is currently under evaluation by physicians. We plan to carry on a formal evaluation of the results involving at least three nephrolo-gists. Search for Subgroups at the Population Level. Thanks to the subgroup discovery algorithm implemented in the Data Mining Server, it was possible to derive subgroups for several target attributes. The results for some causes of failure are reported below:

Failure of �W (sensitivity 20%, specificity 100%): State of T* is LOW and State of Hb* is HIGH and State of SP* is LOW

Failure of �T (sensitivity 35%, specificity 99.5%): State of WL* is LOW and State of T* is NOT NORMAL and State of OP* is NOT HIGH

Failure of VP* (sensitivity 14%, specificity 100%): State of CD* is LOW and State of DP* is HIGH

2 Let us note that AP always assumes negative values, and a decreasing episode corresponds to

an increasing episode of the absolute value.


Those rules turn out to be easily explainable on the basis of the available clinical knowledge. ΔL often fails due to hypothension problems (SP* is Low and Hb* is High); ΔT* is highly related to a low WL*, while VP failures are related to hyperten-sion problems which cause an increase in the pressures of the hemodialyzer hematic circuit. The information extracted is clinically relevant, since it highlights what are the reasons of the problems that the dialysis center has to face with, and therefore, it may properly guide therapeutic decisions.

5 Discussion and Future Developments

The project described in this paper applies a set of Artificial Intelligence techniques to address the needs of a clinical center in terms of data summarization and quality as-sessment. The auditing system is now in clinical use and it is planned to re-engineer the software for its widespread dissemination. It might be interesting to note that, if we consider the performance of the dialysis center from the beginning of the system use (17/06/2002), the percentage of failures decreased from 47.6% (first two months) to 32.5% (last two months). This result seems to show a potential impact of the use of the auditing system on the performance of the clinical center. Clearly, all the results described in this paper need to be assessed through an evaluation on a larger data set.

From a methodological viewpoint several issues have to be still investigated. First, it must be formally evaluated the significance and usefulness of the association rules extracted. Moreover, it should be interesting to investigate the feasibility of extraction of rules in which the body of the rule is composed by a complex temporal relationship between the TAs, instead of the conjunction of co-occurrent TAs. This may lead to rules such as, for example, “IF Trend of AP is decreasing BEFORE State of VP is High THEN VP fails” [14]. Finally, our aim will also be to identify patients at “risk of failures”, and to develop instruments to prevent unsuccessful outcomes. To this end, we are working on a probabilistic model for describing the temporal evolution of the patients in the TAs state space.

Acknowledgements

This work is part of the project “Analysis, Information Visualization and Visual Query in Databases for Clinical Monitoring”, funded by the Italian Ministry of Education. We gratefully acknowledge Maria Grazia Albanesi, Daniele Pennisi, Andrea Pedotti, Antonio Panara and Davide Lazzari for their methodological and technical contribu-tions. We also thank the team of the Unit of Nephrology and dialysis of the Mede and Vigevano hospitals.

References

1. Stefanelli, M.: The socio-organizational age of artificial intelligence in medicine. Artif In-tell Med. 23 (2001) 25-47.

2. Abidi, S.S.: Knowledge management in healthcare: towards ’knowledge-driven’ decision-support services. Int J Med Inf. 63 (2001) 5-18.


3. Zoccali, C.: Medical knowledge, quality of life and accreditation of quality in health care. The perspective of the clinical nephrologist. Int J Artif Organs. 11(1998)717-20.

4. Roddick, J.F., Spiliopoulou, M.: A survey of temporal knowledge discovery paradigms and methods. IEEE T. Knowl. Data En., 14 (2002) 750-766.

5. Registro Italiano di Dialisi e Trapianto, http://www.sin-italia.org 6. McFarlane, P.A., Mendelssohn, D.C.: A call to arms: economic barriers to optimal dialysis

care. Perit Dial Int 20 (2000) 7-12. 7. Moncrief, J.W.:Telemedicine in the care of the end-stage renal disease patients. Adv Ren

Replace Ther 5 (1998) 286-291. 8. Ronco, C., Brendolan. A., Bellomo. R.:Online monitoring in continuous renal replacement

therapies. Kidney Int 56 (1999) S8-14. 9. Bellazzi, R., Magni, P., Bellazzi, R.: Improving dialysis services through information tech-

nology: from telemedicine to data mining. Medinfo. 10(Pt 1) (2001) 795-9. 10. Shahar, Y.: A Framework for Knowledge-Based Temporal Abstraction, Art. Int. 90 (1997)

79-133. 11. Bellazzi, R., Larizza, C., Riva, A.: Temporal Abstractions for Interpreting Diabetic patients

monitoring data, Intelligent Data Analysis, 2 (1998) 97-122. 12. Allen, J. F.: Towards a general theory of action and time. Artificial Intelligence 23 (1984)

123-154. 13. Cohen, A., Daubechies, I., Jawert, B., Vial, P.: Bioorthogonal basis of compactly supported

wavelets. Comm. Pure Aplli. Math. (1992) 45 485-560. 14. Höppner, F.: Discovery of Temporal Patterns - Learning Rules about the Qualitative

Behaviour of Time Series. Proc. of the 5th PPKDD, LNAI 2168 (2001) 192-203. 15. Witten, I., Frank, E.: Data Mining, Academic Press, 2000. 16. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: Methodology and Applica-

tion. J Artif Intell Res. 17 (2002) 501-527. 17. Gamberger, D. & Šmuc, T. (2001). Data Mining Server [http://dms.irb.hr/]. Zagreb, Croa-

tia: Rudjer Boskovic Institute, Laboratory for Information Systems.


Idan: A Distributed Temporal-Abstraction Mediator for Medical Databases

David Boaz and Yuval Shahar

Department of Information Systems Engineering Ben Gurion University, Beer Sheva 84105, Israel {dboaz,yshahar}@bgumail.bgu.ac.il

Abstract. Many clinical domains involve the collection of different types and substantial numbers of data over time. This is especially true when monitoring chronic patients. It is highly desirable to assist human users (e.g., clinicians, re-searchers), or decision support applications (e.g., diagnosis, therapy, quality as-sessment), who need to interpret large amounts of time-oriented data by provid-ing a useful method for querying not only raw data, but also its abstractions. A temporal-abstraction database mediator is a modular approach designed for answering abstract, time-oriented queries. Our approach focuses on the integra-tion of multiple time-oriented data sources, domain-specific knowledge sources, and computation services. The mediator mediates abstract time-oriented que-ries from any application to the appropriate distributed components that can an-swer these queries. We describe a highly modular, distributed implementation of the temporal database mediator architecture in the medical domain, and pro-vide insights regarding the effective implementation and application of such an architecture.

1 Introduction: The Need for Integration of Data and Knowledge in Clinical Practice

Most clinical tasks require measurement and capture of numerous patient data of multiple types, often on electronic media. Making diagnostic or therapeutic decisions requires context sensitive interpretation of these data. Most stored data include a time stamp in which the particular datum was valid. Temporal trends and patterns in clini-cal data add significant insights to static analysis. Thus, it is desirable to automati-cally create abstractions (short, informative, and context-sensitive interpretations) of time-oriented clinical data, and to be able to answer queries about such abstractions. Providing these abilities would benefit both a human physician and an automated decision-support tool (e.g., patient management, quality assessment and clinical re-search). To be both meaningful and concise, a summary cannot use only time points, such as dates when data were collected; it must be able to also aggregate significant features over intervals of time.

An approach that fulfills these requirements must intelligently integrate knowledge sources, data sources and computational services. Thus, an appropriate approach must fulfill the following desiderata: The architecture must be modular enabling in-dependent modification and updates of its components. It must support knowledge

22 David Boaz and Yuval Shahar

and data sharing. As much as possible, all components should use standard medical vocabularies. Data, knowledge and computational services might be integrated in multiple configurations. Therefore, the architecture must be distributed and, prefera-bly, accessible over the Web, to answer the needs of care providers and clinical re-searchers. The computational method chosen must be able to exploit domain specific knowledge and must be able to support several modes of interaction by various appli-cations that use its services. In particular, an interactive computational mode is often highly desirable, in addition to a standard batch (off-line) mode.

2 Background: Temporal Mediation

Temporal reasoning involves generation of conclusions from time-oriented data, possibly based on the domain specific knowledge. Temporal-data maintenance in-volves storage, query and retrieval of time-oriented data.

To support clinical needs, both tasks must be performed, often at the same time. Thus, it is reasonable to create a service that can mediate time oriented queries from decision support applications to patient databases. A general approach, called a tem-poral-database mediator, was previously suggested in an early work, describing the Tzolkin system [1]. This approach encapsulates the temporal reasoning and the tem-poral maintenance capabilities in a reusable software component that can answer raw or abstract, time-oriented queries. Such a system is called a mediator because it serves as an intermediate layer of processing between client applications and data-bases [2]. As a result, the mediator is tied to neither a particular application, nor to a particular database [3]. Furthermore, the temporal reasoning method encapsulates the task-specific reasoning algorithm that uses the domain-specific knowledge (thus, Tzolkin is really a temporal-abstraction mediator). Reusing the mediator in a new application involves modifications of only the domain-specific data and knowledge.

There are multiple advantages to the use of a task-specific, domain-independent temporal-abstraction mediator. However, to fully exploit all theoretical advantages, the mediator needs to be fully modular and reusable with respect to the distributed data, knowledge, and computational services, and use as much as possible standard controlled medical vocabularies to support sharing of data and knowledge. All of these extensions have been implemented in the Idan architecture introduced here.�

3 Idan1: A Modular Distributed Temporal-Abstraction Mediator

Idan is an architecture that fully implements the temporal-abstraction mediation ap-proach. Idan integrates a set of 1) time-oriented data sources; 2) domain specific knowledge sources; 3) vocabulary servers, 4) a computational process specific to the task of abstraction of time oriented data using domain specific knowledge, and 5) a controller that integrates all services (Fig. 1).

1 Idan is the Hebrew word for era, or a long time period.

Idan: A Distributed Temporal-Abstraction Mediator for Medical Databases 23

Fig. 1. The Idan architecture. User applications submit time-oriented queries to the temporal mediator. The temporal mediator, using data from the appropriate local data-source, and tem-poral-abstraction knowledge from the appropriate domain-specific knowledge base, answers these queries. Plain arrows indicate the “uses” relation. Z-like arrows denote remote connec-tions. Dotted arrows denote control links. KB = knowledge base, DB = Database

The default computational method Idan uses is the Knowledge Based Temporal Abstraction (KBTA) method [4, 5, 6, 7]. Domain-specific knowledge is represented in the knowledge base, using the KBTA ontology. Ground terms in the knowledge bases come from a set of standard domain-specific vocabularies. Domain experts, using a distributed graphical knowledge-acquisition tool that is a distributed version of tool we previously evaluated [8], maintain the temporal-abstraction knowledge. Each knowledge base stores a set of knowledge objects and is accessible through a knowledge service. Clinical data is accessed through distributed Data Access Mod-ules (DAMs). Each DAM encapsulates the patient database and exposes a simple data-query language that is used mainly for local, raw-data queries. The simple data-query language uses standard terms, which enable querying a DAM without knowing actually how this term is stored in the database. The DAM accesses a local mapping table, maintained by the local database owner, which lists mappings between local terms and measurement units, and standard medical terms and units. In addition, the local data-source schema is mapped into a virtual view of a time-oriented database.

To answer an abstract query from a user application, a top level module, the con-troller, manages the query processing flow between the application and the various services. Each application selects a configuration of a computation module, knowl-edge services, and DAMs. (Note that selecting a computational module, such as the KBTA, constrains the knowledge bases appropriate for it.)

Temporalmediator

Knowledge serviceAbstractionservice

Controller

Domainvocabulary

server

Local data source site

DB

Knowledgeacquisition

tool

Userapplication

Term-Mapper

Local data

source owner

Domainexpert

Mappingtable

Data AccessModule

KB

Maintenance& validation

Search &retrieval

VirtualSchemaAdaptor


3.1 The Knowledge-Based Temporal Abstraction Method

We define the temporal-abstraction (TA) task as follows [4-7]: The input includes a set of time-stamped (clinical) parameters (e.g., blood glucose values), events (e.g., insulin injections), and abstraction goals (e.g., therapy of patients who have insulin-dependent diabetes). The output includes a set of interval- based, context-specific parameters at the same or at a higher level of abstraction and their respective values (e.g., "5 weeks of grade III bone-marrow toxicity during therapy with AZT").

Fig. 2 shows an example of input for the TA task, and the resulting output, in the case of a patient who is being treated by a clinical protocol for treatment of chronic graft-versus-host disease (GVHD), a complication of bone-marrow transplantation.

3.2 The Standard Medical Vocabularies Service

We have created a vocabulary service that serves as a search engine for a set of dis-tributed, web-based standard medical vocabulary servers that we had implemented. The vocabulary servers use the ICD-9-CM code for diagnoses, the most common diagnostic coding system; the CPT code for procedures, the most common coding system for reporting (used by most health-insurance companies); and the LOINC vocabulary for laboratory tests and physical signs, (selected as the standard for labora-tory observations for HL/7 and HIPPA). We are currently adding a drug-ontology server.

Using a standard vocabulary is a key concept in our framework; it enables us to share knowledge bases that are not specific to a particular set of data-source terms, but can be applied to any clinical database that stores similar domain-specific data types. The vocabulary server is used by local data-source owners to associate local data-source concepts with standard medical concepts, and also by medical experts to associate clinical terms in the knowledge base, with standardized medical terms.

.

•

0 400 200 100 50

•

Δ Δ 1000 2000

Δ ( ) Δ Δ Δ

100K 150K ( )

• • •

• • • • Δ Δ Δ •

• •

Δ Δ Δ Δ Δ Δ • • •

Granu- locyte counts

• • • Δ Δ Δ Δ

•

Time (days)

Platelet counts

PAZ protocol

B[0] B[1] B[2] B[3] B[0] B[0]

BMT

Potential CGVHD

Fig. 2. Temporal-abstraction in a medical domain. Raw data are plotted over time at the bot-tom. External events and the abstractions computed from the data are plotted as intervals above the data. = an external event (medical intervention); • = platelet counts; Δ = granulocyte counts; = a context interval; = an abstraction (derived concept) interval; BMT = bone-marrow transplantation (an external event); PAZ = a therapy protocol for treating chronic graft-versus-host disease (CGVHD), a complication of BMT; B[n] = bone-marrow–toxicity grade n, an abstraction of platelet and granulocyte counts


3.3 The Data Access Service

Each data-access module (DAM) accesses a local clinical database, referred to as a data source, since its structure need not be known. The DAM deals with three prob-lems: (1) the internal schema of the data-source might be unknown to the querying applications, (2) the local data-source vocabulary might be unknown to these applica-tions and/or non standard, (3) the local measure units might be unknown, or might be non standard. For example, a local database might store hemoglobin values in a par-ticular table, call them “HGB”, and use a non-standard unit to store their values.

These problems are solved by using the following methods: (1) we enable every local data-source owner (who best knows her database) to implement a Virtual Schema Adaptor that maps the local schema to a standardized time-oriented data structure that we have defined (writing the adaptor requires DBA technical skills). The adaptor returns, for a given patient identifier and local vocabulary term a set of raw data. (2) We have developed a tool that enables local data-source owners to cre-ate a term-mapping table, which maps each local vocabulary term into a standard vocabulary term in one of our vocabularies. (3) The mapping tool stores, in the term-mapping table, the units in which the local term is measured and their functional transformation to standard units, if needed.

The DAM is responsible for processing raw data queries, using the term-mapping table and the virtual schema adaptor. Fig. 3 outlines the details of the data and control flows involved in processing a query in a data source.

Fig. 3. A conceptual schema of the components of a local data source and their functionalities: (1) The DAM receives a data request that caries the patient identifier Patient, the requested standard term- StdTerm, and the requested output unit – OutUnit. (2, 3) The DAM selects from the term mapping table the term and unit –LocalTerm, LocalUnit used in this site to represent and measure the concept. (4) The DAM sends a request to the virtual schema adaptor to select all patient’s data of type LocalTerm. (5) The Data are returned back to the DAM; if LocalUnit and OutUnit are different, then a value transformation is necessary. (6, 7) The DAM gets from the transformation-function library the appropriate transformation function – TransFunc. (8) The DAM applies TransFunc to the Data. (9) The result is returned back, using the originally specified term and unit. Dotted ellipse = modules are under local data-owner responsibility

Unknown schema

Virtualschema adaptorData

accessmodule(DAM)

Local data source site

4: Data request(Patient, LocalTerm)

3: LocalTerm, LocalUnit

5: Data

2: get local term and unit(StdTerm)

?

6: get transformationfunction(LocalUnit, OutUnit)

1: Data request (Patient, StdTerm,OutUnit)

7: TransFunc

8: Result = transform(Data, TransFunc)

9: Result Transformationfunctions library

Term mapping table


Fig. 4. The concept dependency tree of post-BMT myelotoxicity, part of the oncology-domain knowledge base. Directed arcs represent dependency relations

3.4 The Knowledge Service and the Knowledge-Acquisition Tool

The knowledge-base service supports acquisition, validation, and retrieval of the knowledge needed to support abstract time-oriented data. Each knowledge service accesses a particular knowledge base and supports the following capabilities: • Maintenance of knowledge objects and classes. This module enables domain ex-

perts to modify knowledge objects (e.g., definition and properties of bone-marrow toxicity in a specific context) and to easily create additional knowledge frames that reuse (by inheritance) most of the knowledge in the frame that subsumes them. The module also performs Semantic validation of the knowledge to ensure that the knowledge is consistent and complete.

• Application of specialized knowledge-base operators, such as getDependencyTree (Concept) which returns the concept dependency tree of Concept (Fig 4).

• Performance of search and retrieval. This operator enables searching for existing knowledge objects by their attribute values. We have also implemented a graphical Knowledge-Acquisition Tool (KAT). The

KAT facilitates the acquisition and maintenance of a temporal-abstraction knowledge base, such as defining several types of basic functions that derive an abstract concept from a set of intermediate ones, or defining, in declarative fashion, complex temporal patterns in a specialized constraint-based language [9]. The tool is distributed and web-based, thus facilitating sharing and reuse of domain-specific knowledge, and displays results of search by the knowledge service.

3.5 The Temporal-Abstraction Module: Alma2

Alma is Idan’s default computational module responsible for all temporal reasoning tasks. Alma uses the KBTA method [4, 6, 7] as the reasoning method, augmented by a mechanism that implements the CAPSUL temporal-pattern-matching constraint-based language [9]. Alma is an object that contains two slots, or properties:

1. Current-Knowledge, which contains a sub-set of the knowledge base, asserted by the controller. Alma can only access this knowledge.

2 Alma is an Aramaic word meaning “hence”. It is typically used as part of a logical argument.

myelotoxicity.post_bmt

platelet_state.post_bmt wbc_state.post_bmt

platelet wbc

post_bmt context

bmt


2. Fact-Base, which contains a set of Facts about primitive concepts in the database, asserted by the controller, and a set of Facts about abstract concepts, generated by Alma at runtime. A Fact is a <patient, concept, time-interval, value> tuple, and denotes the value of concept during time-interval for patient. The Facts in the Fact-Base are organized in a special data-structure that sorts all Facts by temporal order. Given a fact, we can easily find all facts occurring after it, or the next (tem-porally) fact of the same type. Such a data structure is crucial for inference opera-tions such as temporal interpolation [6].

Alma contains the following functional behaviors, or methods: 1. hasKnowledgeAbout?(Concept): returns true if knowledge about Concept was

already asserted into the Current-Knowledge slot. 2. assertKnowledge!(Concepts): adds a set of knowledge Concepts to the Current-

Knowledge slot. 3. assertData!(Facts): adds a set of Facts to the Fact-Base slot. 4. getPrimitivesNeeded(Patient, Concept): accepts a patient and a concept, and re-

turns a list of all the raw-data types that are currently needed, for this patient, to compute Concept. This function is used by the controller for minimizing the num-ber of queries to the data source. If a primitive concept (raw-data type) needed to compute a specific abstract concept for a given patient already exists in the Fact-Base, it is redundant to retrieve it again from the data source. Note that Alma can determine what data is needed by accessing the Current-Knowledge slot (in which, typically, the knowledge about the concept, including its derivation tree, already exists). Thus, to compute one abstraction derived from (among other concepts) hemoglobin values, after computing another such abstraction, hemoglobin values need not be retrieved again.

5. select(Patient, Concept, Constraints): returns a set of Facts regarding Patient and Concept, where time-interval and value satisfy Constraints. Select first generates all the facts regarding Patient and Concept and stores them in the Fact-Base. The generation is performed using a goal-directed, recursive top-down, then bottom-up evaluation. First, the derivation tree is descended until concepts that exist in the Fact-Base are reached (e.g., raw data asserted by the controller). Then, the deriva-tion tree is ascended while computing each abstract concept from its deriving con-cepts, which by that time must exist in the Fact-Base (having been computed along the way). Finally, the Fact-Base is filtered for the facts that satisfy Constraints. Note that the Select algorithm uses the Fact-Base for caching to optimize perform-ance, both within the same query and over several queries, since the facts remain in the Fact-Base. Caching saves considerable time during the reasoning process, even though the Select function initially computes all facts, some of which are poten-tially unneeded, due to the following reasons: (1) applications are often interested in retrieving the entire set of facts of a specific concept, and (2) consecutive que-ries by an application are often semantically related (e.g., require abstract concepts derived from concepts already requested by a previous query). The automatic caching provided by the Fact-Base caters for this common situation.

6. hold?(Patient, Concept, Constraints): a Boolean predicate that returns true if there exists (or can be derived) at least one Fact regarding Patient and Concept, where time-interval and value satisfy Constraints.


3.6 The Controller

The Controller is the top-level module that accepts queries from client applications, using a query language which is a subset of the pattern-matching language used inter-nally within the Alma module, and coordinates the interaction between the three Idan core services (computation, knowledge, and data). The Controller is responsible for calling each service in the appropriate order, for ensuring that each service has the necessary data or knowledge to complete its task, and for returning the results of a query to the requesting client application.

When a client application connects to the Idan mediator through the controller, a session object is created for this client. The session stores the specifications of the computational, knowledge, and data services that the client wants to work with. Each session creates an instance of Alma. During the dialog, the Current-Knowledge and Fact-Base slots are populated in that instance. Holding different sessions for different clients enables Idan to create separate workspaces for each application.

The most important controller operator is Fetch, which answers a client query. Consider a typical scenario (1) A client starts a new session with the Controller, us-

ing knowledge source KS and data source DS. The Controller creates an instance of Alma. (2) The client sends a query: fetch(patient=123, concept=myelotoxicity, con-straints={value=grade_3}). In order to process this query, the Controller: (2.1) asks the new Alma instance if it has knowledge about myelotoxicity (using hasKnowl-edgeAbout?(myelotoxicity). Since the session is new, Alma returns No. (2.2) The Controller retrieves from KS the myelotoxicity dependency tree (see Figure 4) (using getDependencyTree (myelotoxicity)) and asserts it into Alma (using assertKnow-eldge!). (2.3) The Controller asks Alma which raw concepts are needed for processing myelotoxicity (using getPrimitivesNeeded(123, myelotoxicity) which returns {platelet, wbc, bmt}). (2.4) The data about platelet, wbc and bmt of patient 123 are retrieved from DS (using their standardized terms and their required units), and are asserted into Alma (using the assertData! method). (2.5) Alma processes the query select(123, myelotoxicity, {value=grade_3}) and the result is the returned to the client. (3) The client sends another fetch query: fetch(123, wbc_state, {value=very_low and dura-tion> 2weeks}). The controller processes the query in the same manner, but since the knowledge about wbc_state was already retrieved and asserted in step 2.2 (as part of the myelotoxicity dependency tree), and since the wbc data of patient 123 were al-ready asserted, Alma requests no knowledge or data and computes select(123, wbc, {value=very_low and duration> 2weeks}).

3.7 Idan Implementation Notes

Idan was implemented using several environments. The medical vocabularies (CPT, ICD and LOINC) are stored in an MSSQL server. Alma was implemented in SICStus Prolog. The rest of the services, including the controller, were implemented in the Microsoft .Net environment, written in the C# programming language. The services interact with the controller using web-services (network services). All communica-tion is performed using XML documents. The knowledge acquisition tool interacts with the knowledge service using the .Net Remoting technology.


4 Discussion and Future Work

Idan is used by multiple applications in our projects. KNAVE-II, a distributed re-implementation of KNAVE [14], supports interactive knowledge-based visual explo-ration of time-oriented clinical databases by sending queries to the Idan controller, and displaying the resulting data and knowledge. DeGeL [11] is a distributed frame-work that supports clinical-guideline specification, retrieval, application, and quality assessment, by sending runtime queries about the current patient to the controller.

In the Tzolkin architecture [1], the temporal-reasoning and temporal-maintenance tasks were performed by different modules. The Resume module generated all ab-stractions and wrote them into a database; then, the Chronus module applied the query’s temporal constraints to the database (which now included also the abstrac-tions), to generate the answer. A similar relationship exists between the RASTA [13] and Chronus-II [14] systems. The Idan architecture is more uniform, because a subset of the temporal- and value-constraints language used in the Alma reasoning process is used in the controller’s query interface. Thus, Alma can also process the query’s temporal constraints. Unifying both tasks avoids reimplementation of the constraint-satisfaction process and the use of a temporary storage space. Combined with the Alma goal-directed computation, Idan supports highly interactive applications, such as KNAVE-II, including even a capability for performance of a “What-If?” dynamic sensitivity analysis that enables propagation of data and knowledge modifications.

Tzolkin was implemented in a fixed architecture (although the potential for future extension into distributed data and knowledge sources was noted). Idan is fully modu-lar with respect to all services (data, knowledge and computation). Thus, it is very easy to add new data and knowledge sources, and even replace the computational module, as long as a similar interface is preserved. Furthermore, the efficient, fo-cused, goal-driven mode incorporated in the controller and in the Alma computational module, together with the full modularity, make the Idan architecture highly scalable.

In the short term we intend to enhance the Idan mediator in several aspects:

• The mediator will support aggregate abstract queries that refer to a large set of patient records. We will explore several methods for reducing the response time, such as parallelizing the computations performed on different patients.

• We will enhance the support for “What If” dynamic sensitivity-analysis queries by better management of hypothetical modifications of data or knowledge.

• The mediator currently supports a limited form of explanation of given abstrac-tions. The explanation consists of the knowledge and data types used. We intend to add more explicit data dependencies (as in a truth-maintenance system) to di-rectly provide the data instances from which each fact was abstracted.

• We will explore the use of graphical metaphors, both for display of existing peri-odic and linear patterns, as needed in the KNAVE-II system, and for specification of new patterns, either in the knowledge acquisition tool, or as part of an interac-tive application that enables the user to formulate a new query. We intend to gain insights from previous research done by Combi and Chitaro [12].

• We intend to enhance the mediator query language to be as fully expressive as the knowledge-definition language used to define patterns in the knowledge base, which are then computed by the Alma temporal-abstraction module.


Acknowledgments

This research was supported in part by NIH award No. LM-06806. We thank Samson Tu and Martin O’connor for useful discussions regarding the Chronus-II and RASTA systems, and Drs. Mary Goldstein, Susana Martins, Lawrence Basso, Herbert Kaizer, Aneel Advani, and Eitan Lunenfeld, for assessing the Idan and KNAVE-II systems.

References

1. Nguyen J. H, Shahar Y, Tu S. W., Das A. K., and Musen M. A. (1999). Integration of Temporal Reasoning and Temporal-Data Maintenance Into A Reusable Database Mediator to Answer Abstract, Time-Oriented Queries: The Tzolkin System. Journal of Intelligent In-formation Systems 13(1/2):121-145.

2. Wiederhold, G. (1992). Mediators in the architecture of future information systems. IEEE-Computer, 25:38–50.

3. Wiederhold, G. and Genesereth, M. (1997). The Conceptual Basis of Mediation Services, IEEE Expert, 12(5), 38–47

4. Shahar Y (1997). A framework for knowledge-based temporal abstraction. Artificial Intel-ligence 90(1–2): 79–133.

5. Shahar Y. and Musen M.A.(1996). Knowledge-based temporal abstraction in clinical do-mains. Artificial Intelligence in Medicine 8(3): 267–298.

6. Shahar Y. (1999). Knowledge-based temporal interpolation. Journal of Experimental and Theoretical Artificial Intelligence 11: 123-144.

7. Shahar Y. (1998). Dynamic temporal interpretation contexts for temporal abstraction. An-nals of Mathematics and Artificial Intelligence. 22(1-2): 159-92.

8. Shahar Y., Chen H., Stites D.P., Basso L.V., Kaizer H., Wilson D.M., and Musen M.A.. (1999) Semiautomated acquisition of clinical temporal-abstraction knowledge. Journal of the American Medical Informatics Association. 6:494-511.

9. Chakravarty S. and Shahar Y. (2000) CAPSUL: A Constraint-Based Specification of Re-peating Patterns in Time-Oriented Data. Annals of Mathematics and Artificial Intelligence (AMAI); Vol. 30: pgs. 3-22.

10. Shahar Y. and Cheng C. (2000) Model-Based Visualization of Temporal Abstractions. Computational Intelligence 16(2):279-306.

11. Shahar Y., Young O., Shalom E., Mayaffit A., Moskovitch R., Hessing A., and Galperin M. (2003) DEGEL: A Hybrid, multiple-ontology framework for specification and retrieval of clinical guidelines. Proceedings the Ninth Conference on Artificial Intelligence in Medi-cine Europe (AIME-03), Protaras, Cyprus.

12. Chittaro L and Combi C. (2001). Representation of Temporal Intervals and Relations: In-formation Visualization Aspects and their Evaluation. C. Bettini and A. Montanari (eds): Proceedings of the Eighth International Symposium on Temporal Representation and rea-soning (TIME 2001). Los Alamitos, IEEE Computer Society Press, p. 13-20.

13. O’Connor M.J., Grosso W.E., Tu S.W., and Musen M.A.(2001) RASTA: A Distributed Temporal Abstraction System to Facilitate Knowledge-Driven Monitoring of Clinical Da-tabases, Proceedings of MEDINFO-2001, the Tenth World Congress on Medical Informat-ics, pp. 508-512, London, UK.

14. O’Connor M., Tu S.W., and Musen M.A. (2002). The Chronus II Temporal Database Me-diator. Proceedings of the 2002 American Medical Informatics Fall Symposium (AMIA-2002), pp. 567-571, San Antonio, TX.


Prognosis of Approaching Infectious Diseases

Rainer Schmidt and Lothar Gierl

Universität Rostock, Institut für Medizinische Informatik und Biometrie Rembrandtstr. 16 / 17, D-18055 Rostock, Germany

{rainer.schmidt,lothar.gierl}@medizin.uni-rostock.de

Abstract. Few years ago, we have developed an early warning system concern-ing multiparametric kidney function courses. As methods we applied Temporal Abstraction and Case-based Reasoning. In our current project we apply very similar ideas. The goal of the TeCoMed project is to compute early warnings against forthcoming waves or even epidemics of infectious diseases in the German federal state of Mecklenburg-Western Pomerania. Furthermore, these warnings shall be sent to interested practitioners, pharmacists etc. We have de-veloped a prognostic model for diseases that are characterised by cyclic, but ir-regular behaviour. So far, we have applied this model to influenza and bronchi-tis.

1 Introduction

Few years ago, we have developed an early warning system concerning multi-parametric kidney function courses [1], which inspired us to develop the prognostic model for TeCoMed [2]. The goal of the TeCoMed project is to compute early warn-ings against forthcoming waves or even epidemics of infectious diseases and to send them to interested practitioners, pharmacists etc. in the German federal state Meck-lenburg-Western Pomerania.

Since our method combines temporal abstraction with Case-based Reasoning, we just very briefly introduce both methods. Afterwards, we present the prognostic model for the TeCoMed project and its application to influenza.

Temporal abstraction has become a hot topic in Medical Informatics since the early 90th of the last century. The main principles of temporal abstraction have been outlined by Shahar [3]. The idea is to describe a temporal sequence of values, actions or interactions in a more abstract form, which provides a tendency about the status of a patient. For example, for monitoring the kidney function it is fine to provide a daily report of multiple kidney function parameters. However, abstracted information about the development of the kidney function on time means a huge improvement [1].

Case-based Reasoning means to use previous experience to understand and solve new problems. When solving a new problem, a case-based reasoner remembers for-mer, similar cases and attempts to modify their solutions to fit for a new problem. The Case-based Reasoning cycle developed consists of four steps [4]: retrieving for-mer similar cases, adapting their solutions to the current problem, revising a proposed solution, and retaining new learned cases. However, there are two main subtasks in Case-based Reasoning [4]: Retrieval, a search for similar cases, and adaptation, a modification of solutions of retrieved cases.

32 Rainer Schmidt and Lothar Gierl

2 TeCoMed

The goal of the TeCoMed project is to compute early warnings against forthcoming waves or even epidemics of infectious diseases in the German federal state Mecklen-burg-Western Pomerania. So far, we have mainly focused our research on influenza. Available data are written confirmations of unfitness for work, which have to be sent by affected employees to their employers and to their health insurance companies. These confirmations contain the diagnoses made by their doctors. Since 1997 we receive these data from the main German health insurance company.

2.1 Influenza

Many people believe influenza to be rather harmless. However, every year influenza virus attacks worldwide over 100 million people [5]. The most lethal outbreak ever, the Spanish Flu in 1918, claimed 20-40 million lives worldwide, which is more than the second world war on both sides together [6]. In fact, influenza is the last of the classic plagues of the past, which has yet to be brought under control [7]. Conse-quently, in the recent years some of the most developed countries have started to generate influenza surveillance systems [e.g. 7, 8].

Usually, each winter one influenza wave can be observed in Germany (fig.1). However, the intensities of these waves vary very much. In some years they are nearly unnoticeable (e.g. in the winter of 1997/98), while in other years doctors and pharmacists even run out of vaccine (e.g. in the winter of 1995/96).

Fig. 1. Influenza seasons in Mecklenburg-Western Pomerania from October till March. The 1st week corresponds to the 40th week of the calendar and 14th week to the 1st week of the next year.

Prognosis of Approaching Infectious Diseases 33

Influenza waves are difficult to predict, because they are cyclic, but not regular [9]. Because of the irregular cyclic behaviour, it is insufficient to determine average values based on former years and to give warnings as soon as such values are no-ticeably overstepped. So, we have developed a method that again combines temporal abstraction with Case-based Reasoning. The idea is to search for former, similar courses and to make use of them for the decision whether early warning is appropri-ate.

Viboud [10] applies the method of analogues [11], which originally was developed for weather forecasting. It also takes former, similar courses into account. However, the continuations of the two most similar former courses are used to predict future values, e.g. the influenza incidences of next week. Instead, we intend to discover threatening influenza waves in advance and to provide early warnings against them.

2.2 Prognostic Model for TeCoMed

Since we believe that warnings can be appropriate in about four weeks in advance, we consider courses that consist of four weekly incidences. However, so far this is just an assumption that might be changed in the future. Figure 2. shows the prognos-tic model for TeCoMed. It consists of four steps (the grey boxes on the right side).

Fig. 2. The prognostic model for TeCoMed.

Temporal Abstraction. For the first step, we have defined three trends concerning the changes on time from last week to this week, from last but one week to this week, and from last but two weeks to this week. The assessments for these three trends are "enormous decrease", "sharp decrease", "decrease", "steady", "increase", "sharp in-crease", and "enormous increase". They are based on the percentage of change. For

Weekly Incidences

Course Description Parameters

Temporal Abstraction

Warning if appropriate

List of All Former Courses

Retrieval: Distances

Most Similar Former Courses

Sufficient Similarity

Adaptation

34 Rainer Schmidt and Lothar Gierl

example, the third, the long-term trend is assessed as "enormous increase" if the inci-dences are at least 50% higher than those three weeks ago. If they are only at least 30% higher, it is assessed as "sharp increase", and if they are only at least 5% higher, it is just an "increase".

Together with the four weekly data these assessments are used to determine simi-larities between a query course and all courses stored in the case base. Our intention for using these two sorts of parameters is to ensure that a query course and an appro-priate similar course are on the same level (similar weekly data) and that they have similar changes on time (similar assessments).

Searching for Similar Courses. So far, we sequentially compute distances between a query course and all courses stored in the case base. The considered attributes are the three nominal valued trend assessments and the four weekly incidences.

When comparing a current course with a former one, distances between equal as-sessments are valued as 0.0, between neighbouring ones as 0.5, and otherwise as 1.0 (e.g. "increase" and "sharp increase" are neighbouring). Additionally, we use weights; the values for the short-term trend are weighted with 2.0, those for the medium-term trend with 1.5, and those for the long-term trend with 1.0, because we believe that more recent developments should be more important than earlier ones.

For the weekly data, we compute differences between the values of the query and those of each former course. We compute an absolute difference between a value of the query course and a value of a former course. Afterwards we divide the result by the value of the query course and weight it with the number of the week within the four weeks course (e.g. the first week gets the weight 1.0, the current week gets 4.0).

Finally, the distance concerning the trend assessments and the distance concerning the incidences are added.

Sufficient Similarity. The result of computing distances is a very long list of all former four weeks courses sorted according to their distances. For the decision whether a warning is appropriate, this list is not really helpful, because most of the former courses are rather dissimilar to the query course. So, the next step means to find the most similar ones. One idea might be to use a fixed number, e.g. the first two or three courses in the sorted list. Unfortunately, this has two disadvantages. First, even the most similar former course might not be similar enough, and secondly, vice versa, e.g. the fourth, fifth etc. course might be nearly as similar as the first one.

So, we decided to filter the most similar courses by applying sufficient similarity conditions. So far, we use just two thresholds. First, the difference concerning the three trend assessments between the query course and a most similar course has to be below a threshold X. This condition guarantees similar changes on time. And sec-ondly, the difference concerning the incidences of the current week must be below a threshold Y. This second condition guarantees an equal current level. Of course fur-ther conditions concerning the incidences of the 3 weeks ago might also be used.

Adaptation. So, now we have got a usually very small list that contains only the most similar former courses. However, the question arises how these courses can help to decide whether early warning is appropriate. In Case-based Reasoning, the re-trieval usually provides just the most similar case whose solution has to be adapted to fit for the query course. As in Compositional Adaptation [12] we take the solutions of a couple of similar cases into account. So, we have marked those time points of the

Prognosis of Approaching Infectious Diseases 35

former courses where we, in retrospect, believed a warning would have been appro-priate. This means that a solution of a four weeks course is a binary mark, either a warning was appropriate or not.

For the decision to warn, we split the list of the most similar courses in two lists. One list contains courses where a warning was appropriate; the second list gets the other ones. For both of these new lists we compute their sums of the reciprocal dis-tances of their courses to get sums of similarities. Subsequently, the decision about the appropriateness of a warning depends on the question: which of these two sums is bigger.

2.3 First Results

Our program computes early warnings of approaching influenza waves for the Ger-man federal state Mecklenburg-Western Pomerania. Since we receive data since 1997, our case base contains just six influenza periods. For each of them, our pro-gram exactly computes the desired warnings and it computes no warnings if none are desired by using the other five periods as case base. However, the question arises if it is more adequately to warn earlier than we have done so far.

References

1. Schmidt, R., Pollwein, B., Gierl, L.: Medical Multiparametric Time Course Prognoses Ap-plied to Kidney Function Assessments. Int J Med Inform 53 (2-3) (1999) 253-264

2. Schmidt, R., Gierl, L.: Case_based Reasoning for Prognosis of Threatening Influenza Waves. In: Perner, P. (eds.): Advances in Data Mining. LNAI 2394, Springer Berlin (2002) 99-107

3. Shahar, Y.: A Framework for Knowledge-Based Temporal Abstraction. Artificial Intelli-gence 90 (1997) 79-133

4. Aamodt, A., Plaza, E.: Case-Based Reasoning: foundation issues. Methodological varia-tion- and system approaches. AI Communications 7 (1) (1994) 39-59

5. Nichol, K.L. et al.: The effectiveness of Vaccination against Influenza in Adults. New Eng-land Journal of Medicine 333 (1995) 889-893

6. Dowdle, W.R.: Informed Consent Nelson-Hall, Inc. Chicago, III 7. Prou, M. et al..: Exploratory Temporal-Spatial Analysis of Influenza Epidemics in France.

In: Flahault, A. et al. (eds.): Abstracts of 3rd International Workshop on Geography and Medicine, Paris (2001) 17

8. Shindo, N. et al.: Distribution of the Influenza Warning Map by Internet. In: Flahault, A. et al. (eds.): Abstracts of 3rd International Workshop on Geography and Medicine, Paris (2001) 16

9. Farrington, C.P., Beale, A.D.: The Detection of Outbreaks of Infectious Disease. In Gierl, L et al. (eds.): GEOMED ’97, International Workshop on Geomedical Systems, Teubner Stuttgart (1997) 97-117

10. Viboud, C. et al.: Forecasting the spatio-temporal spread of influenza epidemics by the method of analogues. In: Abstracts of 22nd Annual Conference of the International Society of Clinical Biostatistics, Stockholm, August 20-24 (2001) 71

11. Lorenz, E.N.: Atmospheric predictability as revealed by naturally occuring analogies. J Atmos Sci (1969) 26

12. Wilke, W., Smyth, B., Cunningham, P.: Using Configuration Techniques for Adaptation. In: Lenz, M. et al. (eds.): Case-Based Reasoning Technology, LNAI 1400. Springer Berlin (1998) 139-168

Modeling Multimedia and Temporal Aspectsof Semistructured Clinical Data

Carlo Combi, Barbara Oliboni, and Rosalba Rossato

Dipartimento di Informatica, Universita degli studi di VeronaCa’ Vignal 2, Strada le Grazie 15, 37134 Verona, Italy

{combi,oliboni,rossato}@sci.univr.it

Abstract. In this paper, we propose a semistructured data model forrepresenting multimedia and temporal clinical information. Motivationsare provided, taken from the domain of cardiac angiography.

1 Introduction

During the last years, the amount of multimedia clinical data available electroni-cally has been growing up [3]. Data resides in different form and this informationis accessible through different interfaces like Web browsers, database query lan-guages, application-specific interfaces or data exchange formats. This informa-tion can be raw, like images or sounds, or structured even though the structurecan be implicit. Sometimes the structure exists but has to be extracted from thedata. For this reason, this kind of information is called semistructured data [1].

To this regard, the eXtensible Mark-up Language (XML) is spreading out asa general format for representing, exchanging and publishing information on theWeb and more generally as a standard for representing semistructured data.

As for clinical data, XML has been extensively considered as a mean fordata exchange among clinical applications, for the specification, through a stan-dard language, of widely accepted medical ontologies and taxonomies [5], andfor the definition of suitable languages for clinical domains [4, 6]. Thus, the inter-est for XML-related technologies and methodologies in the medical informaticscommunity can be evaluated as relevant; nevertheless, several theoretical andmethodological issues related to the adoption of semistructured data models formedical data have not yet been completely considered with the same accuracyused in recent past years for relational and object-oriented data models [2, 3].Multimedia and temporal aspects of medical information have been studied insome details: suitable data models, query languages, and systems have been pro-posed and applied to several clinical domains as cardiology [3], radiology, andoncology [2]. In this paper, we mainly consider theoretical and methodologicalissues concerning the definition of a suitable semistructured data model whereboth temporal and multimedia features of clinical information are explicitly ad-dressed. The proposed data model is named Multimedia Temporal GraphicalModel (MTGM), and is an extension of the Temporal Graphical Model (TGM)presented in [7]. MTGM allows one to define multimedia presentations based

M. Dojat, E. Keravnou, and P. Barahona (Eds.): AIME 2003, LNAI 2780, pp. 36–40, 2003.c© Springer-Verlag Berlin Heidelberg 2003

Modeling Multimedia and Temporal Aspects of Semistructured Clinical Data 37

on multimedia objects stored in a semistructured, temporal and multimediadatabase.

2 MTGM at Work

In this Section we describe the main features of MTGM, by showing how itworks with a real example taken from a clinical scenario. In particular, we willrepresent a database containing information on cardiology patients, undergoingcardiac angiographies.

Cardiac angiography is a technique adopted to study the situation of coronaryvessels (coronary angiography) and the heart functionalities (left ventriculogra-phy). The result of a cardiac angiography consists of a X-ray movie, displaying,in different parts, both heart and coronary vessels functionalities [8].

Let us to consider the following piece of stored information: on October 10,2001, at 10:00 a.m. the physician visits for the first time Ed Bawer who becomes,from this moment, his patient. Afterwards, Ed Bawer reported that from October15, 2001, at 8:40 a.m. to October 15, 2001, at 11:20 a.m. and from October 15,2001, at 4:00 p.m. to October 15, 2001, at 4:50 p.m. he suffered from lightchest trouble and the physician diagnosed this symptom as chest discomfort.From October 22, 2001, at 10:30 a.m to October 22, 2001, at 12:45 Ed Bawerunderwent a cardiac angiography. The cardiac angiography revealed a severestenosis on the left main coronary artery segment.

Let us now consider how these data are represented through a MTGM graph.MTGM has complex nodes (depicted as rectangles), such as Patient, and atomicnodes (depicted as ovals), such as Name. Moreover, MTGM has the new nodetype “stream”: stream nodes contain multimedia (semistructured) informationas unstructured text, movies, sounds; they are depicted as thick ovals and are aparticular kind of atomic node. For example the atomic node Streamfile contain-ing the file “xa12.mpg”, which encodes the movie of the patient angiography,is a stream node. The valid time (i.e., the time at which a fact is true in themodeled word) of a complex node is represented in its label, while the valid timeof an atomic (stream) node is represented in the label of the edge between theatomic node and its parent (“now” indicates that the object is currently true).Complex nodes are related to other complex nodes by labelled relational edges.The label of a relational edge is composed by the name of the relationship andits valid time.

MTGM allows us to compose a multimedia presentation starting from thenodes stored in an MTGM database. In this context we suppose that a physi-cians is interested to the composition of videos and other clinical informationabout some patients. We suppose to define, for the given patient, a multimediapresentation consisting of three parts. In the first part the name of the patientis shown simultaneously with the textual description of the clinical situation ofthe patient. The second part shows the complete natural-language description ofthe symptom for the patient. Finally, the name of the patient and the video ofhis coronarography are shown simultaneously. The timeline of this multimedia

38 Carlo Combi, Barbara Oliboni, and Rosalba Rossato

presentation is shown in Figure 1. In order to represent a multimedia presenta-tion as the one shown in Figure 1, we have to solve some problems related to thecomposition of the presentation starting from a multimedia temporal database.First of all, a media object can be inserted several times in the same presen-tation. In each visualization of a media object different spatial and temporalcoordinates can be required. Intuitively, we introduce setting nodes (depictedas thick squares) to represent the information related to the visualization of amedia object.

Name

Description

StreamFilep_symptom.rt

StreamFilexa12.mpg

t0 t1 t2 t3 t

1

2

3

4

5

Fig. 1. The timeline of the multimedia presentation with its mpis.

For example, in the first part of the presentation the object Name can beshown on the top of the screen, while in the last one it can be visualized in thecenter of the screen. In order to recognize the different instances of the samemedia object in a given presentation, we introduce the concept of the mediapresentation identifier (mpi). In the example related to the object Name, wehave two instances: the first has the mpi having the value “1”, the second hasmpi = 5. The media presentation identifier represents also the order accordingwhich the media objects are played in the presentation (e.g. the name instancewith mpi = 1 is the first media object visualized). As shown in Figure 2, welabel the edges between atomic (streams) and setting nodes by means of thesuitable mpi. For example, due to the fact that the object Name is played intotwo different places, we need two different couples of spatial coordinates: the first(related to the first Name instance) is (1, 1) and the second is (20, 20). Thus, thelabel between the object Name and the setting node C x (which represents thecoordinate on the x axes on the screen and which has value 1) is labelled {1}.

The second problem we have to solve is related to the multimedia constraintsrepresentation. For example, in the first part of the multimedia presentationshown in Figure 1, the object Name and the object Description must be playedsimultaneously. In order to represent a multimedia relationship between twomedia objects, we introduce a particular kind of edge: the multimedia edge (de-picted as thick edges). MTGM allows us to insert a multimedia edge betweentwo complex objects, between two atomic objects if they are connected to the

Modeling Multimedia and Temporal Aspects of Semistructured Clinical Data 39

< Patient,[10/10/01 10:00, now)>

< Symptom,[15/10/01 8:40, 15/10/01 11:20) U

[15/10/01 16:00, 15/10/01 16:50) >

< Coronarography,[22/10/01 10:30, 22/10/01 12:45) >

< Diagnosis,[22/10/01 10:30, now) >

Name

Description

StreamFile

StreamFile

StreamFileD_description

<HasProperty,[10/10/01 10:00, now)> Ed Bawer

<HasProperty,[15/10/01 8:40, 15/10/01 11:20) U[15/10/01 16:00, 15/10/01 16:50)>

<HasProperty,[15/10/01 8:40, 15/10/01 11:20) U[15/10/01 16:00, 15/10/01 16:50)>

<HasProperty,[22/10/01 10:30,

22/10/01 12:45) >

<HasProperty,[22/10/01 10:30, now) >

<HasProperty,[22/10/01 10:30, now)>

<Patient_Symptom,[15/10/01 8:40, 15/10/01 11:20) U[15/10/01 16:00, 15/10/01 16:50)>

<Patient_Coronarography,[22/10/01 10:30,

22/10/01 12:45) >

<Observation,[22/10/01 10:30,

22/10/01 12:45)>

Chest discomfort

xa12.mpg

diagnosis.rt

p_symptom.rt

Severe stenosison the left main

coronary artery segment

< Presentation,[25/09/02 11:20, 25/09/02 11:25)>

C_x

Type

C_y

20

20

Text

Dur

Type

NamePres

1 min

Text

2

2

Type C_x C_y

Type

Rep

4

4

<HasProperty,[25/09/02 11:20,

25/09/02 11:25) >

Text 40 20

3 3 3

Video

4

P1

C_x

C_y

{1,5}

{1}

{1}

{5}

{5}

1

1

{< P1, {4, T_Equals, 5,[25/09/02 11:23,

25/09/02 11:25)}>}

{< P1, {1, T_Equals, 2,[25/09/02 11:20,

25/09/02 11:21)}>}

{< P1, {3, T_Meets, 4,[25/09/02 11:21,

25/09/02 11:23)}>}

{< P1, {2, T_Meets, 3,[25/09/02 11:20,

25/09/02 11:21)}>}

{<P1, {2, [25/09/02 11:20,25/09/02 11:21)}>}

{<P1, {4, [25/09/02 11:23,25/09/02 11:25)}>}

{<P1, {1, [25/09/02 11:20, 25/09/02 11:21)}>,<P1, {5, [25/09/02 11:23, 25/09/02 11:25)}>}

Fig. 2. Multimedia temporal semistructured graph.

same parent, and between a complex object and an atomic object if the complexobject is the parent of the atomic object.

In the first and in the second cases, the edge label is structured as{〈PresNamei, {〈mpij , Relt, mpik, T imeIntervalr〉}〉} where PresNamei is thename of the presentation in which the objects mpij and mpik are visualized andRelt represents the synchronization relationship between them andTimeIntervalr represents the valid time of this edge. For example, in Figure 2the edge between the complex objects Patient and Symptom is labelled with〈P1, {〈1, T Equals, 2, [25/09/02 11:20, 25/09/02 11:21) 〉}〉 and the edge be-tween the atomic objects Description and the atomic objects StreamFile is la-belled with 〈P1, {〈2, T Meets, 3, [25/09/02 11:20, 25/09/02 11:21)〉}〉. In the lastcase the edge label is structured as {〈PresNamei, {〈mpij , T imeIntervalr〉}〉}where PresNamei is the name of the presentation in which the object mpij isvisualized and TimeIntervalr represents the time of the visualization. For ex-ample in Figure 2 the edge between the complex node Patient and the atomic

40 Carlo Combi, Barbara Oliboni, and Rosalba Rossato

object Name is labelled with {〈P1, {〈1, [25/09/02 11:20, 25/09/02 11:21)〉} , {〈5,[25/09/02 11:23, 25/09/02 11:25)〉}〉}.

The complex node Presentation represents the starting point of the presen-tation and has an atomic node NamePres representing its unique name w.r.t.the MTGM database. The Presentation node is connected by means of an edgeto the first media object of the presentation. The valid time of the Presentationnode represents the time interval in which the presentation has been visualized.For example the presentation named “P1” reported in Figure 2 has been visu-alized by the physician in the interval [25/09/02 11:20, 25/09/02 11:25). Thenode Presentation is depicted with thick lines as the edge between the nodePresentation itself and the Patient node. Thick lines highlight the (multimedia)nodes composing a multimedia presentation. Setting nodes can be related onlyto atomic and stream nodes composing a multimedia presentation and representinformation about visualization. The label between atomic (stream) nodes andthe setting nodes is composed by a sequence of mpis. For example, in Figure 2the labels between the node Description and its setting nodes Dur and Type arecomposed by a single mpi with value 2.

References

1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations toSemistrucutred Data and XML. Morgan Kaufmann, 1999.

2. A. F. Cardenas and J. D. N. Dionisio. A Unified Data Model for RepresentingMultimedia, Timeline and Simulation Data. IEEE Transactions on Knowledge andData Engeneering, 10(5):746–767, Sept-Oct 1998.

3. C. Combi, L. Portoni, and F. Pinciroli. User-oriented views in health care infor-mation systems. IEEE Transactions on Biomedical Engineering, 49(12):1387–1398,2002.

4. R. H. Dolin, L. Alschuler, F. Behlen, P. V. Biron, D. Essin S. Boyer, L. Hard-ing, T. Lincoln, J. E. Mattison, R. Sokolowski W. Rishel, J. Spinosa, and J. P.Williams. HL7 Document Patient Record Architecture: An XML Document Archi-tecture Based on a Shared Information Model. In AMIA Annual Symposium, pages52–56, 1999.

5. C. Grover, E. Klein, M. Lapata, and A. Lascarides. XML-Based NLP Tools forAnalysing and Annotating Medical Language. In Proceeding of the 2th InternationalWorkshop on NLP and XML (NLPXML-2002), 2002.

6. C. E. Kahn and N. de la Cruz. Extensible Markup Language (XML) in healthcare: integration of structured reporting and decision support. In Proceedings of theAMIA Annual Fall Symposium, pages 725–729, 1998.

7. B. Oliboni, E. Quintarelli, and L. Tanca. Temporal aspects of semistructured data.In Proceedings of The Eighth International Symposium on Temporal Representationand Reasoning (TIME-01), pages 119–127. IEEE Computer Society Press, 2001.

8. P.J. Scanlon, D.P. Faxon, J.L. Ritchie, R.J. Gibbons, and et al. ACC/AHA Guide-lines for Coronary Angiography. Journal of the American College of Cardiology,33(6):1756–1824, 1999.


NEONATE: Decision Support in the Neonatal Intensive Care Unit – A Preliminary Report

Jim Hunter1, Gary Ewing1, Yvonne Freer3, Robert Logie2, Paul McCue1, and Neil McIntosh3

1 Department of Computing Science, University of Aberdeen King’s College, ABERDEEN, AB24 3UE, UK

{jhunter,gewing,pmccue}@csd.abdn.ac.uk 2 Department of Psychology, University of Aberdeen

King’s College, ABERDEEN, AB24 2UB, UK [email protected]

3 Department of Neonatology, University of Edinburgh, Edinburgh, UK [email protected], [email protected]

Abstract. The aim of the NEONATE project is to investigate sub-optimal deci-sion making in the neonatal intensive care unit and to implement decision sup-port tools which will draw the attention of nursing and clinical staff to situa-tions where specific actions should be taken or avoided. We have collected over 400 patient-hours of data on 31 separate babies, including physiological pa-rameters sampled every second, observations made by a research nurse of all the actions performed on the baby with an accuracy of a few seconds, occa-sional descriptions of the appearance, mobility, sleep patterns, etc of the baby. We describe our attempts to use this data to discover examples of sub-optimal behaviour.

1 Introduction

The original objectives of the NEONATE project (Hunter et al. 2003) were: (i) to identify situations in the Neonatal Intensive Care Unit (NICU) where sub-optimal performance might occur; (ii) to develop a number of data processing algorithms aimed at alerting the clinical staff to those situations; and (iii) to evaluate which ap-proaches would be most effective in bringing about improvements in performance. It has been shown that simply displaying complex time series data does not automati-cally lead to improvements in patient care (Cunningham et al. 1998, McIntosh et al. 2000). The COGNATE project (Alberdi et al. 2000, 2001) concluded that some assis-tance in the form of additional data processing is necessary to support the decisions made by the clinical staff.

Identifying sub-optimal performance poses certain methodological problems. In complex domains such as medicine, judgments about what is sub-optimal can only be made by a recognized expert in the domain. However, if the expert is physically pre-sent on the ward, the normal (unobserved) behaviour of the junior decision-maker will almost certainly be changed by that presence; this would be in addition to the practi-cal difficulty of obtaining long periods of an expert’s time. The alternative that we have adopted is to capture in real time as much information as possible about the

42 Jim Hunter et al.

baby, and to present this to the expert at a later time. This paper describes the details of this approach:

• Through interviews with medical staff, we established a lexicon of terms used to describe a baby and the management actions that can be taken.

• We collected as rich a data set as possible for a number of babies through many hours of on-ward observation, noting the actions that were taken and descriptions of baby state (“descriptors”), as well as acquiring physiological and other data.

• We identified important single actions (e.g. handbagging) and attempted to ac-quire protocols from our expert clinician describing the circumstances under which these actions should be taken.

We have come to the (somewhat unexpected) conclusion that getting experts to comment on the appropriateness of individual actions is not practicable and we will discuss the consequences of this.

2 Developing a Lexicon of Observations and Actions

It is clear that medical staff acquire data about a patient through seeing, hearing and touching the baby as much as (or perhaps more than) by referring to physiological data acquired from instruments; we will refer to this information as “descriptors”. In attempting to capture as complete a data set as possible, we considered it necessary to attempt to record these descriptors. A pre-requisite was to agree on a suitable lexicon (or ontology). We interviewed clinical staff at all levels asking them to say how they would describe the current state of a baby to a colleague. Thirty-two staff were inter-viewed and 552 descriptors were generated. Senior clinical staff subsequently re-viewed these lists for consistency and to remove synonyms and singletons (words used by only one person), thus reducing the list to 166 terms. These terms were grouped under seven headings: Bowels (and urine), Crying (and facial expression), Feeding, Movement, Size (including shape and weight), Skin (including colour), Sleep (and demeanour). Examples of the 32 descriptors for Skin include: Pink, Good Capillary Refill, Blue, Jaundiced, Dry.

In a similar way, interviews elicited 191 terms to describe the actions that can be taken. This was reduced to 51 terms which were organised into a hierarchy; higher level nodes in this included intermediate abstractions such as: Care, Collect Data, Feeding, Respiration, Communication.

3 On-ward Data Collection

A research nurse was employed for approximately four months to observe the activity at one or more cots and to make as accurate a record as possible. The information captured was:

• the equipment used to monitor, ventilate, etc.; • the actions taken by the medical staff (see above); • occasional descriptions of observable state (descriptors) (see above);

NEONATE: Decision Support in the Neonatal Intensive Care Unit 43

• the alarm limits in force on the monitors; • the settings on the various items of equipment (including the ventilator); • the results of blood gas analysis and other laboratory results; • the drugs administered.

Data were entered with a timing accuracy of a few seconds on a laptop computer

using a specially written program called 'BabyWatch' running under Windows. All data (with one or two exceptions) were entered by selecting from pre-compiled lists. In addition the research nurse could enter short free-text comments.

At the same time as data was being entered manually, the 'Badger' data collection system was automatically acquiring physiological data with a time resolution of one second. The actual parameters sampled depended on the monitoring in place but typi-cally included heart rate, transcutaneous O2 and CO2, saturation O2, core and periph-eral temperatures, and blood pressures.

Before the observations began, a detailed protocol was established to set out how the study was to be conducted. This included guidelines for clock synchronisation, subject selection, descriptor recording and ethical considerations. Ewing et al. (2002a) describe this in more detail and Hunter (2002) describes the BabyWatch software.

4 Observational Results

Data collection started in mid October 2001 and finished in mid February 2002. We collected about 407 patient-hours of observations on 31 separate babies consisting of over 32,000 individual data records. Details of the data collected are available in Ew-ing et al. (2002a).

No experiment goes exactly as planned, and a certain amount of post processing was required. The BabyWatch and ‘Badger’ clocks had to be reconciled, obvious errors in data entry corrected; again details of the post-experimental processing are contained in Ewing et al. (2002a).

An existing tool, the Time Series Workbench (TSW), was adapted to allow us to present all of this data together. In addition to the usual presentation of time-series physiological data, it displayed:

• periods where the nurse was observing; periods where specific actions were taking place; the presence of observations entered by the nurse; the administration of medication, and the presence of blood gas and laboratory results; the existence of comments;

• the hierarchy of actions; the basic problem is that there are too many actions to display easily – our solution is to allow the user to select one (or a subset of) ac-tion(s) to be displayed by interacting with this hierarchy.

• the comments entered by the research nurse;

In addition, a tool was developed within the TSW which allowed us to view the data from the perspective of a particular type of action, observation, etc. and to collect overall statistics.

We believe that our database linking physiological measurements to simultaneous observations is one of the richest to have been collected.

44 Jim Hunter et al.

5 Clinical Protocol Development

Recall that our initial objective was to identify sub-optimal decisions. From our per-spective, a decision manifests itself as an observed action or the absence of such an action. We looked initially at the action of ‘handbagging’ - the manual ventilation of a baby. Overall we had 58 instances of this action with an average duration of 2 min-utes. Handbagging often takes place more than once in a short period of time and we grouped related instances into ‘episodes’; we had 29 such episodes. Because hand-bagging causes extended fluctuations in most physiological parameters, we only con-sidered the first action in a given episode.

At first sight the methodology might appear obvious: given that we have almost complete data, get the expert to look at the episodes of handbagging and decide for each whether the action was performed optimally or not. However this is incomplete – the expert’s attention is reasonably easily focused on the times when actions were taken (and perhaps should not have been), but we must also consider the occasions when an action should have been taken but wasn’t. Without additional support, this would have required the expert to inspect all of the times when that particular action was not taken – many hours worth of data – and this is just not practical for a busy expert. To focus attention on possible candidates for such times, we asked the expert to define a simple protocol for when handbagging should be carried out:

((OX < 3 or SO < 60) for at least 10 seconds) and (HR < 100)

where OX is transcutaneous oxygen, SO is oxygen saturation and HR is heart rate. As with all knowledge acquisition, our expectation was that the first attempt as

formalisation would be inaccurate and incomplete. To begin with we intended to take as our gold standard the actions that were actually taken by the clinical staff – i.e. we would assume that they always made the right decision. We expected that the protocol as implemented would generate false positives and negatives (with respect to the decisions actually made to act or not to act) – as well as true positives and negatives. Once the protocol had been refined over several iterations, we anticipated that a change of emphasis would occur, in that our expert’s attention would be focused more and more on the ‘false’ positives and negatives and we anticipated that (s)he would start to query whether they really were ‘false’. In other words the assumption that the correct decision was always made would become increasingly subject to question and we might decide in some cases that the protocol was correct and that the decision made was in some sense sub-optimal.

6 (Somewhat Unexpected) Conclusion

We discovered that our experts were reluctant to comment on the appropriateness or otherwise of a specific single handbagging action (whether recommended or actual) without reviewing the way in which the respiratory function of the patient had been managed over a considerable period of time (including the ventilator settings, drugs, X-rays, suction, repositioning, etc). It was clear that to put pressure on them to come to a view based on purely local (in a temporal sense) information would be counter-productive. We now consider that we need to identify and formalise the protocol for respiratory management taken as a whole. Such a protocol will be much more com-

NEONATE: Decision Support in the Neonatal Intensive Care Unit 45

plex; however we believe that languages such as ASBRU (Shahar et al. 1998) are sufficiently rich to express it. Ultimately we are convinced that in the highly complex environment of the ICU, protocols must represent complete management strategies if our medical experts are going to be willing to devote time to developing them, and if the end users are going to see their advice as appropriate.

References

Alberdi E, Gilhooly K, Hunter J, Logie R, Lyon A, McIntosh N and Reiss J, ‘Computerisation and Decision Making in Neonatal Intensive Care: A Cognitive Engineering Investigation’, Journal of Clinical Monitoring and Computing, Vol 16, No 2, pp 85-94, 2000.

Alberdi E, Becher J-C, Gilhooly K, Hunter J, Logie R, Lyon A, McIntosh N and Reiss J, ‘Ex-pertise and the Interpretation of Computerized Physiological Data: Implications for the De-sign of Computerized Monitoring in Neonatal Intensive Care’, International Journal of Hu-man Computer Studies, Vol 55, No 3, pp 191-216, 2001.

Cunningham S, Deere S, Symon A, Elton RA and McIntosh N, 'A Randomized, Controlled Trial of Computerized Physiologic Trend Monitoring in an Intensive Care Unit', Crit.Care Med, Vol 26, pp 2053-2060, 1998.

Ewing G, Ferguson L, Freer Y, Hunter J and McIntosh N, 'Observational Data Acquired on a Neonatal Intensive Care Unit', University of Aberdeen Computing Science Departmental Technical Report: TR 0205, 2002a.

Hunter J, 'BabyWatch User Manual', University of Aberdeen Computing Science Departmental Technical Report: TR 0206, 2002.

Hunter J, Logie R, McIntosh N, Ewing G and Freer Y, ‘NEONATE: Effective Decision Sup-port in the Intensive Care Unit’, http://www.csd.abdn.ac.uk/~gewing/neonate/, 2003.

McIntosh N, Becher JC, Stenson BJ, Laing IA, Lyon AJ, and Badger P, 'The clinical diagnosis of pneumothorax is late: use of trend data and decision support might allow preclinical de-tection', Pediatric Research, Vol 48, pp 408-415, 2000.

Shahar Y, Miksch S, and Johnson P, ‘The Asgaard Project: A Task-Specific Framework for the Application and Critiquing of Time-Oriented Clinical Guidelines’, Artificial Intelligence in Medicine, Vol 14, pp 29-51, 1998.

Abstracting the Patient Therapeutic Historythrough a Heuristic-Based Qualitative Handling

of Temporal Indeterminacy

Jacques Bouaud, Brigitte Seroussi, and Baptiste Touzet

STIM, DPA/DSI/AP–HP, Paris, France{jb,bs,bt}@biomath.jussieu.fr

Abstract. Applying a guideline-based therapeutic strategy in the context of achronic disease requires the decision maker, physician or system, to have a clearpicture, at the appropriate level of abstraction, of a patient’s particular therapeutichistory. However, like most clinical data, information on past treatments is subjectto temporal indeterminacy. We propose temporal abstraction mechanisms based ona simple qualitative and heuristic treatment of temporal indeterminacy on periodbounds. Allen’s intervals are extended to unknown bounds, then the conditionsfor continuity and simultaneousness are analysed. The aim is to restore a patient’stherapeutic history, in the case of chronic diseases, to position her within guidelinetherapeutic recommendations.

1 Introduction

The consideration of time in the long-term management of chronic diseases is an addi-tional difficulty to therapeutic decision since it depends on decisions made and actionstaken at previous consultations, as well as on patient outcomes of those actions. Even ifclinical practice guidelines (CPGs) establish what should be the right therapeutic strategyto be adopted in a number of theoretical clinical situations, a clear picture of a patient’stherapeutic history is necessary to position her within the recommended sequence oftherapies and to adopt the best next step of treatment. To take the proper therapeuticdecision for any given patient suffering from a chronic disease, it is mandatory to knowwhich drugs s/he has already received, how s/he responded, and the periods of admin-istration to determine the level of therapeutic combination: mono-, bi-, or tritherapy.Unfortunatelly, these elements of information are usually expressed independently, at alow level of abstraction (commercial names) in medical records. They are often incom-pletely temporally stamped. Besides, the date of the medical consultation that decidedof a treatment is recorded as the starting date of the treatment and physicians may omitto record when and why a given treatment has been given up. Such indeterminaciesin clinical data impedes the execution of guideline-based decision support systems andtheir acceptance in routine use.

The importance of synthesizing time-oriented clinical data has often been stressed[1] and some systems are dedicated to this task [2–5]. From Allen’s seminal work oninterval-based temporal representations [6], many general theoretical models for tem-poral indeterminacy management have been proposed based on probabilities (e.g. [7]),logic (e.g. [8]), or fuzzy sets (e.g. [9]).


Abstracting the Patient Therapeutic History 47

In this paper, we propose domain-based temporal abstraction mechanisms relying ona simple qualitative and heuristic approach to handle temporal indeterminacy on periodbounds. Our objective is to build, from incomplete low-level temporal data on pasttreatments, a high-level representation of a patient’s therapeutic history, to be mappedto a guideline therapeutic strategy.

2 Method

The goal of this work is to enable the practical implementation of a guideline-baseddecision support system on a chronic disease, like arterial hypertension.

The first step is to abstract drug prescriptions that exist in patient medical recordsand to formulate them at the level of therapeutic classes expressed in CPGs. This time-independant abstraction of drugs is performed through the use of the ATC classification,developed by WHO’s Collaborating Centre for Drug Statistics Methodology.

In the following steps, the aim is to clear up as many temporal indeterminacies aspossible to abstract the patient’s therapeutic history and to position the patient withinthe guideline flow. Starting with the classical interval-based representation of temporaldata introduced by Allen [6], we propose an extension to account for unknown bounds.Then we characterize continuity inference to determine the temporal range of a treatmentover the patient’s history, and simultaneousness inference to determine the level of drugcombination of the patient’s therapy.

2.1 Extension of Allen’s Representation Formalism

In Allen’s temporal framework, an event is defined by a time interval [a, b] characterizedby a starting time point a and an end time point b. Allen’s interval algebra is governed bya set of 13 mutually exclusive relations on two intervals: “before (b), meets (m), overlaps(o), starts (s), during (d), finishes (f)”, their inverse relations, and “equals (e)”. We usedthe same interval representation but considered that bounds may be indeterminate withan unknown time point, noted “?”, leading to four basic types of intervals:

– intervals where both bounds are known, denoted [a, b] and depicted as ,– left or right semi-indeterminate intervals defined when the starting, resp end, point is

unknown, denoted ]?, b], resp. [a, ?[, and depicted as , resp. ,– fully indeterminate intervals denoted ]?,?[ and depicted as .

The exhaustive analysis of all the relations that could exist between these 4 types ofintervals led to a total of 208 configurations. After eliminating redundancies and takinginto account symetry, we finally considered a set of 28 basic configurations. For each ofthem, we qualitatively studied its possible interpretations in Allen’s classical frameworkto determine whether indeterminacy was transferred to intervals’ relationships. Theseinterpretations are reported in table 1.

Seven configurations correspond to Allen’s configurations, 6 configurations, thoughpartially indeterminate, are unambiguous with respect to Allen’s semantics, e.g. R8corresponds to “meets” whatever the indeterminate bound. For the 15 remaining con-figurations, data indeterminacy is transferred to configurations since there are multiplepossible interpretations, e.g. R10 may be interpreted by any of Allen’s relationships.

48 Jacques Bouaud, Brigitte Seroussi, and Baptiste Touzet

Table 1. The 28 configurations and their possible interpretations as Allen’s relationships

Conf. Graphical Possible# representation interpretations

R1* ji “b”

R2 ji “b”

R3 ji “b”

R4 ji “b”

R5* ji “m”

R6 ji “m”

R7 ji “m”

R8 ji “m”

R9* ji “o”

R10 ji “b, m, o, s, d, f, e”

R11 ji “b, m, o, d, f”

R12 ji “b, m, o, s, d”

R13* ji “s”

R14 ji “s, e”

Conf. Graphical Possible# representation interpretations

R15* ji “d”

R16 ji “b, m, o, d, f”

R17 ji “b, m, o, s, d”

R18 ji “b, m, o, s, d, f, e”

R19 ji “o, d, f”

R20 ji “b, m, o, s, d, f, e”

R21 ji “o, s, d”

R22 ji “b, m, o, s, d, f, e”

R23* ji “f”

R24* ji “e”

R25 ji “f, e”

R26 ji “s, e”

R27 ji “f, e”

R28 ji “b, m, o, s, d, f, e”

2.2 Continuity Inference

The aim is to aggregate continous prescriptions of the same therapeutic class. The firststep is thus to characterize continuity between intervals to be able to formalize how tomerge them. We define continuity between two intervals i and j, denoted cont(i, j),when i and j are not disjoint. Using Allen’s relation “before”, we have:

∀i, j, cont(i, j) ≡ ¬(before(i, j) ∨ before(j, i))

Formal Continuity. According to the different configurations, continuity may be for-mally established or indeterminate. Continuity is false for configurations whose in-terpretation is limited to the “before” relationship (R1–R4). It is true when possibleinterpretations do not include “before” (R5–R9,R13–R15,R19,R21,R23–R27). In theother 9 situations, formal continuity cannot be established and remain indeterminate.

Heuristic Continuity. When continuity is indeterminate, we have defined a heuristiccontinuity. The principle is to adopt reasonable assumptions that eliminate the “be-fore” interpretation. Taking into account contextual knowledge, i.e. the management of

Abstracting the Patient Therapeutic History 49

a chronic disease in primary care, we have assumed that the value of drug prescriptionduration, noted M , was adjusted to the recommended periodicity of medical consul-tations. For instance, in the domain of hypertension management in primary care, thisperiod is of 3 months.

In indeterminate configurations, semi-indeterminate intervals are given a durationof M . Some indeterminacies can be then cleared up. Considering configuration R11 forinstance, if the distance between the start points of the two intervals is less than M , thenthe “before” interpretation is discarded, and continuity can be inferred. This heuristicprinciple may concern 5 configurations (R10–R12,R16,R17).Ambiguous interpretationsare not resolved by this method for R18, R20, R22, and R28 which involve at least afully indeterminate interval and therefore remain indeterminate.

Continuity Calculation. When continuity holds between 2 intervals i and j, they aremerged to yield a new interval. However, even in the case of formal continuity, theresulting interval remains indeterminate as soon as one of the arguments is indeterminate.Using the M threshold, a new operator fusionh is defined allowing to heuristicallyinterpolate i and j.

2.3 Simultaneousness Inference

The aim is to identify the therapeutic classes that have been prescribed together, thusadding their therapeutical effects, to infer the level of drug combination. Intervals ofcontinuous administration of different therapeutic classes are compared to identify andquantify overlapping periods. The second step of our temporal abstraction is then tocharacterize interval simultaneousness to identify windows of mono-, bi- and tritherapy.

Two intervals i and j are simultaneous, denoted simult(i, j), when the intersectioni ∩ j is a non-empty interval. Using cont and “meets”, we have:

∀i, j, simult(i, j) ≡ cont(i, j) ∧ ¬(meets(i, j) ∨ meets(j, i))

Formal Simultaneousness. Like continuity, simultaneousness may be formally estab-lished or indeterminate. Simultaneousness is false for configurations that are interpretedeither as “before” or as “meets” (R1–R8). It is true for configurations that can neverbe interpreted in terms of these two relationships (R9,R13–R15,R19,R21,R23–R27). Inthe other 9 situations, formal simultaneousness cannot be established.

Heuristic Simultaneousness. Similarly to heuristic continuity, we have defined aheuristic simultaneousness for configurations where interpretations by “before” and“meets” relationships can be discarded. We use the same mean duration for a drugprescription, M , to clear up some temporal indeterminacies and go further in inferringsimultaneousness from the patient therapeutic history. This is possible for the same 5configurations (R10–R12,R16,R17). Still, the same 4 cases remain indeterminate.

Simultaneousness Calculation. Similarly to continuity calculation, an operator cooch

is defined to calculate the interval that corresponds to the heuristic intersection of i andj using M .

50 Jacques Bouaud, Brigitte Seroussi, and Baptiste Touzet

3 Discussion and Conclusion

The only way to clear up indeterminacy is to bring additional knowledge. In the case oftemporal indeterminacy, probabilistic approaches mostly rely on probability distributionfunctions (PDFs) to characterize indeterminate instants [7]. However, providing suchPDFs may not be possible when they are themselves unknown.

The proposed approach relies on a qualitative analysis of Allen’s relationships whensome interval bounds are totally unknown. Theoretical works have been proposed forsemi-interval-based representations [8] in the context of temporal reasonning. However,for pragmatic reasons and sake of simplicity, we adopted domain-based heuristic prin-ciples to clear up some temporal indeterminacy. We assumed that the standard valueof drug prescription duration is adjusted to the recommended periodicity of medicalconsultations, i.e. 3 months in the special case of the follow-up of chronic diseases inprimary care. This hypothesis is a strong hypothesis. Though validated by GPs “in theaverage”, it is not true when a new treatment leading to non tolerated side-effects isproposed to a patient. In this case, the next medical consultation is generally earlier,but then the treatment stop might be reported as well as its cause. Another limitation ofour method is that it does not account for other causes of temporal indeterminacy liketime granularity mismatch [7]. Here, we assumed a consistent granularity on intervalspecifications, which is acceptable for chronic diseases in primary care.

As a conclusion, we proposed to deal with temporal indeterminacy on period boundsby considering some knowledge-based heuristic principles dedicated to therapeutic pre-scription. This seems realistic since it enables to make the most of the poorly time-stamped data available in medical records. It allows to build a synthetic representationof a patient’s therapeutic history, which is a mandatory preliminary step to the imple-mentation of guideline-based decision support for the management of chronic diseases.

References

1. Shahar, Y., Musen, M.A.: Knowledge-based temporal abstraction in clinical domains. ArtifIntell Med 8 (1996) 267–298

2. Shahar, Y., Musen, M.A.: Resume: a temporal-abstraction system for patient monitoring.Comput Biomed Res 26 (1993) 255–273

3. Shahar, Y., Miksch, S., Johnson, P.: The Asgaard project: a task-specific framework for theapplication and critiquing of time-oriented guidelines. Artif Intell Med 14 (1998) 29–52

4. Duftschmid, G., Miksch, S., Gall, W.: Verification of temporal scheduling constraints in clinicalpractice guidelines. Artif Intell Med 25 (2002) 93–121

5. O’Connor, M.J., Tu, S.W., Musen, M.A.: The Chronus II temporal database mediator. J AmMed Inform Assoc 8 (2002) 567–571

6. Allen, J.F.: Maintaining knowledge about temporal intervals. Communication of the ACM 26(1983) 832–843

7. Dyreson, C.E., Snowgrass, R.T.: Supporting valid-time indeterminacy. Transaction onDatabase Systems 23 (1998) 1–57

8. Freksa, C.: Temporal reasoning based on semi-intervals. Artif Intell 54 (1992) 199–2279. Badaloni, S., Giacomin, M.: A fuzzy extension of allen’s intervals algebra. In Lamma, E.,

Mello, P., eds.: AI*IA99, LNAI 1792, Berlin Heidelberg, Springer-Verlag (2000) 155–165


How to Represent Medical Ontologies in View of a Semantic Web?

Christine Golbreich1, Olivier Dameron2, Bernard Gibaud2, and Anita Burgun1

1 Laboratoire d’Informatique Médicale Faculté de Médecine, Av du Pr. Léon Bernard, 35043 Rennes France

[email protected],[email protected] 2 Laboratoire IDM, UPRES-EA 3192

Faculté de Médecine, Av. du Pr. Léon Bernard, 35043 Rennes Cedex France, {Olivier.Dameron,Bernard.Gibaud}@chu-rennes.fr

Abstract. The biomedical community has concrete needs of a future Semantic Web. An important issue is to know whether the W3C languages, will meet its requirements. This paper aims at contributing to this question in evaluating two presently available languages, Protégé and DAML+OIL, on an actual ontology under development, the brain cortex ontology. It draws conclusions on their ex-pressiveness, compares it to other ontology languages, in particular to the next standard OWL, and the hybrid language CARIN-ALN, and discusses the main features that should be in a Web language for medical ontologies in view of a Semantic Web.

1 Introduction

With the development of the Web, as well as the proliferation of biomedical knowl-edge, end-users of the medical community may potentially access to growing amounts of information. However in practice, it is still difficult to access such information in a satisfactory way, i.e. in a timely way and with minimum noise and silence, due to the limitations of currently available Web search and navigation tools. Thus, a major challenge for the Web is to evolve towards a « Semantic Web », where information has more explicit semantics, enabling machines to make a better use of available data and make it more easily accessible. Semantic markup of data is the key to reach that goal. Ontologies play a central role, since they define the concepts to use for it. They provide a shared meaning, supposed to be re-usable for various applications and us-ers. The biomedical community has major needs regarding the Web. However, repre-senting medical ontologies raises some difficulties. The paper evaluates the expres-siveness of two ontology languages presently available, Protégé and DAML+OIL, on a concrete medical ontology, in order to highlight features that seem important for Web ontologies in the biomedical domain, and to know whether the future standard OWL1, Web Ontology Language, will be suited to medicine.

1 DAML+OIL and OWL sublanguage OWL-DL are quite similar

52 Christine Golbreich et al.

2 A Case Study in the Medical Domain

This study focuses on an ontology of brain anatomy that has been developed for neuroimaging. Medical imaging plays a prominent role in medicine, contributing to diagnosis, treatment preparation and performance, as well as its follow up. The exten-sive use of digital imaging equipment makes it now possible to produce semi-structured documents representing the observations made on the images by physicians or Computer Assisted Detection (CAD) programs that analyze those images. Such observations describe the findings, their anatomical locations, and eventually the inferences leading to the conclusion. Furthermore, all these elements can be used to index and retrieve documents, and in this regard, the relation to anatomy is of primary importance [19]. But, successful exploitation of this information, for clinical or re-search purposes, requires that the concepts involved have precise and shared seman-tics [3] and explicit representation. This precision of image descriptions is actually needed to enable a successful communication between healthcare professionals, and is really critical to process the documents by automatic tools, such as CAD systems, e.g. to assess the evolution of pathology between successive imaging studies. For exam-ple, to describe a lesion (shape, texture, etc) and give its location with respect to neighbouring anatomical structures, one has to refer to anatomical concepts whose properties are well-formalised. In particular, it is necessary to know whether the rela-tions that are used (taxonomy, part-whole) are transitive, or the inverse of one an-other, or whether integrity constraints have to be checked. Besides, exploiting such documents for research purposes requires a consensus about the meaning of the in-formation. Indeed, in most cases, the data are produced and stored in distributed and automomous data bases. Therefore, pooling them to apply a common process requires the data to be articulated around a common ontology [22]. This ontology must be adequately formalised to enable consistent and uniform data querying, coping with the semantic heterogeneity of the original data (e.g. various abstraction and granular-ity levels). For example, the observation « glioma located in the left postcentral gyrus » should be matched by the query « tumours located in the parietal lobe ». Such het-erogeneous data can be reconciled, thanks to suitable mappings based on an explicit representation of taxonomy and part-whole relations. So implemented and shared ontologies are needed for successful information processing in the medical imaging field, but they have specificities, some of which are examplified in the following (§4).

3 Semantic Web Languages

There are many languages, with various expressiveness, that might be used to formal-ize ontologies. W3C standards have been defined but several other languages (§5) have been developed these last few years, however their analysis is out of the scope of this paper (see for instance [10]). OWL, the future Web Ontology Language standard is part of the Semantic Web "stack" related to the W3C recommendations (see http://www.w3.org):

• XML/XMLS: XML [6] provides the syntax transport layer for structured docu-ments but imposes no semantic constraints on the meaning of these documents. XMLS [23] is a language for defining the abstract structure of XML documents.

How to Represent Medical Ontologies in View of a Semantic Web? 53

XML et XMLS might look sufficient for publishing or exchanging medical data, but only if people have previously agreed on the tags definitions.

• RDF/RDFS: RDF [18] is a simple data model for objects (resources) and relations between them and provides a simple semantics for this datamodel. RDFS enables to define classes, subclasses, properties, domain, and range. RDFS can be seen as a simple ontology language with a quite poor semantics. It might be sufficient to per-form only simple tasks on medical resources

• DAML+OIL [5] provides the logical layer. It comes from DAML (DARPA American Agent Markup Language) [11] and from OIL [8]. It borrows its intuitive modelling primitives to frames, its syntax to XML and RDF, its formal semantics and reasoning to description logics (DL).

• OWL [17], following DAML+OIL, offers additional primitives for describing properties and classes, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equivalence, characteristics of properties (e.g. symmetry), and enu-merated classes, along with a formal semantics. OWL provides three increasingly expressive sublanguages: OWL Lite, OWL DL, and OWL Full. OWL Lite is less expressive, thus according to its designer, “it should be simpler to provide tool support for OWL Lite than for the two others, and easier to provide a quick migration path for thesauri and other taxonomies”. OWL-DL offers completeness (all solutions are guaranteed to be computed) and decidability (all computations finish in finite time). OWL Full with its maximal expressivity and syntax freedom of RDF, offers no computational guarantees.

• Further layers e. g. rules (http://www.dfki.uni-kl.de/ruleml) are expected.

4 Representing the Brain Ontology in Protégé and DAML+OIL

Two presently available languages have been used for the brain cortex anatomy ontology: a frame-based language supported by the Protégé2000 editor, and DAML+OIL, based on the description logics SHIQ, supported by the OILEd editor.

4.1 Ontology in Protégé

Knowledge representation in Protégé is based on frames. Protégé-2000 [15] is a graphical and easy-to-use ontology-editing tool developed at Stanford University (http://protege.stanford.edu). The class inheritance hierarchy is visualised as a tree, multiple inheritance is allowed. Users define and organize classes, subsumption relationships, properties and property values. Metaclasses can be defined. A UMLS™ client [13] has been developed. It allows users who are developing and populating their knowledge base in Protégé to search and import UMLS [14] elements directly into Protégé-2000. Other on-line resources can be used in a similar manner for knowledge acquisition in Protégé, e.g. WordNet [7].

The definitions of the anatomical concepts in the following examples are based on anatomy atlases such as [16], as well as terminology sources such as NeuroNames [2]. For instance, a « brain hemisphere » is defined as an anatomical part of the cortex which is lateralized (i.e. located either on the right or on the left side), includes five


anatomical subdivisions called lobes (frontal, temporal, parietal, occipital and limbic lobes) and occupies a specific region of space. «Left hemispheres» are represented in Protégé by the class LeftHemisphere, subclass of Hemisphere and of LeftLateral-izedAnatomicConcept whose slots are inherited but some of them overloaded : slots hasSide restriction : LeftSide, hasDirectAnatomicalPart restriction :

LeftLobe, facets at least, at most with value 5 etc. This representation only ex-presses that an hemisphere has 5 lobes, each types confused. It would be difficult with frames (but possible) to represent that an hemisphere has exactly one lobe of each type (frontal, temporal, etc.) (Ex14).

4.2 Ontology in DAML+OIL

Knowledge representation in DAML+OIL is based on the description logics SHIQ. DAML+OIL provides a more expressive language, including reasoning services, together with a friendly graphical user-interface using metaphors common to frame-based systems. The OILEd editor [1] is a graphical ontology-editing tool2 developed by the University of Manchester. Users define classes, subsumption relations, properties with type restriction (Fig. 1). Complex descriptions can be used as slot value. Axioms allow for representing additional knowledge, e.g. asserting that two classes are disjoint (Ex5). The next examples illustrate the rigourous formalisation of concepts and taxonomy, and automatic classification supported by DAML+OIL.

Ex1. /An anatomical concept is composed of direct parts, which are

anatomical concepts, occupies exactly one region of space/ AnatomicalConcept:= (∀ hasDirectAnatomicalPart AnatomicalConcept) ∧(= 1 hasLocation SpaceRegion)

Ex2. /A lateralized concept is located either on the right side, or on the left side of the brain, one can distinguish right-sided and left-sided lateralized concepts / LateralizedAnatomicalConcept:= AnatomicalConcep ∧ (= 1 hasSide LeftSide ∨ RightSide) LeftLateralizedAnatomicalConcept:= LateralizedAnatomicalConcet ∧ (∀ hasSide LeftSide) resp. RightLateralizedAnatomicalConcept

Ex3. /An hemisphere is a lateralized concept whose direct parts are lobes, each part being of a distinct type/

Hemisphere := LateralizedAnatomicalConcept ∧ (∀ hasDirectAnatomicalPart Lobe)∧(= 1 hasDirectAnatomicalPart FrontalLobe)∧(= 1 HasDirectAnatomicalPart ParietalLobe) ∧ (= 1 hasDirectAnatomicalPart OccipitalLobe)∧(= 1 hasDirectAnatomicalPart LimbicLobe)∧(= 1 hasDirectAnatomicalPart TemporalLobe)

Ex4. /A left (resp. right) hemisphere is an hemisphere located on the left (resp. right) side/

LeftHemisphere := LeftLateralizedAnatomicalConcept ∧ Hemisphere The LeftHemisphere concept is defined as an Hemisphere as well as a LeftLateral-izedAnatomicalConcept, together with a number of restrictions on its direct parts (Ex3). Consequently, it has exactly 5 direct parts, which are LeftFrontalLobe, 2 for a survey of tools : Corcho, O. and Fernandez-Lopez, M. and Gomez-Perez, A. Method-

ologies, tools and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering 2003


LeftParietalLobe, LeftOccipitalLobe LeftLimbicLobe, LeftTemporalLobe. Thus, the FaCT classifier automatically classifies it as subsumed by the FiveDirect-PartAnatomicalConcept, as shown on the Post-classification hierarchy (Fig.1), whereas it was firstly only subsumed by LeftLateralizedAnatomicalConcept and Hemisphere.

Fig. 1. Left: Post-classification hierarchy - Right : �� definition with OILEd

4.3 List of Needed Primitives

These examples present the features that are covered by Protégé or DAML+OIL (Ex1 to Ex18), and those which are not (Ex15 to Ex18), thus enabling to draw first conclu-sions about the expressiveness required of a Web language for medical ontologies. For each example, the number refers to the PROTEGE-2000 and DAML+OIL primi-tive that have been used (§ 5 Table 1 Table 2 column 2 - 3).

Ex5. disjointWith is needed to represent disjunction (#9) / Hemisphere, Lobe, Gyrus et Sulcus are disjoint classes /

disjointWith (Hemisphere Lobe Gyrus Sulcus) Ex6. disjunction is required (#2)

/Lateralized anatomical concept are either right or left/ LateralizedAnatomicalConcept := AnatomicalConcept ∧ (= 1 hasSide LeftSide � RightSide)

Ex7. negation is needed (#5) / Class of the anatomical concepts that are not lateralized /

NonLateralizedAnatomicalConcept:= AnatomicalConcept ∧ ¬ LateralizedAnatomicalConcept


Ex8. disjointUnionOf is a primitive needed to represent a partition of A into a list of concepts (#10) /A side of the brain cortex is either right or left but not both/

disjointUnionOf(CortexSide LeftSide RightSide) /A lobe is one of the following type : frontal, parietal, temporal, occipital, limbic lobe /

disjointUnionOf(Lobe FrontalLobe ParietalLobe TemporalLobe OccipitalLobe LimbicLobe)

/An hemisphere is either a right hemishere or a left hemisphere / disjointUnionOf(Hemisphere LeftHemisphere RightHemisphere])

Ex9. � (equivalent) is needed to represent classes equivalence (#8) /The left lobe concept is equivalent to left lateralized anatomical concept and lobe/

LeftLobe ≡ LeftLateralizedAnatomicalConcept ∧ Lobe Ex10. � subsumption is needed to represent class or relation specialization hierarchies (#11) /The relation hasAnatomicalPart is a specialisation of hasDirectAnatomicalPart /

hasDirectAnatomicalPart ⊆ �hasAnatomicalPart Ex11. transitive is needed for representing transitivity of relations (#14) /has-part is transitive (hasDirectPart no)/

Slot-def has-part Properties transitive

Representing such property characteristics (reflexivity, symmetry, transitivity) is required. For example, transitivity enables to elicit the distinction between hasDirectAnatomicalPart and hasAnatomicalPart. The latter corresponds to the transitive closure of hasDirectAnatomicalPart, e.g. direct anatomical parts of hemispheres are lobes, direct anatomical parts of lobes are gyri, thus anatomical parts of hemispheres are lobes and gyri. DAML+OIL provides such a possibility while Protégé does not.

Ex12. inverse relation is needed (#13) /inverse of hasLocation/

isLocatedIn inverseOf hasLocation Ex13. equivalence of relations must be represented (#12)

/concept A is anatomical part of a concept B if and only if the space occupied by A is a subspace of that occupied by B/

isAnatomicalPartOf ≡ (isLocatedIn o isSubAreaOf o hasLocation)

From this definition, constraints on body spaces can be inferred for two anatomical concepts A and B linked by �sAnatomicalPartOf� and inversely A isAnatomicalPartOf B can be inferred from their respective regions. Moreover, equivalence between relations is crucial to merge several Web ontologies.

Ex14. cardinality and non exclusive constraints on relations have to be represented (n°6) /An hemisphere is a lateralized concept whose direct parts are lobes and which has exactly one

lobe of each type/ Hemisphere := LateralizedAnatomicalConcept

∧ (∀ hasDirectAnatomicalPart Lobe) ∧ (= 1 hasDirectAnatomicalPart FrontalLobe) ∧ (= 1 HasDirectAnatomicalPart ParietalLobe) ∧ (= 1 hasDirectAnatomicalPart OccipitalLobe) ∧ (= 1 hasDirectAnatomicalPart LimbicLobe) ∧ (= 1 hasDirectAnatomicalPart TemporalLobe)

minCardinality maxCardinality Cardinality in DAML+OIL allow for representing such constraints whereas frame-based languages do not.

The following examples exhibit needs that have not been covered by DAML+OIL and that should be satisfied.

Ex15. composition between relations is not provided but required / a concept which has a location which is included in a region occupied by another concept C’ / isLocatedIn ° isSubAreaOf ° hasLocation


A possible solution for representing composition is using rules Ex16. n-ary relation is not provided but required /Ternary relation : a sulcus is a separator for two lobes, or two gyri, or one gyrus one lobe/ Separation := AnatomicalConcept ∧ (= 1 separator Sulcus) (1)

∧ (= 2 separate Lobe ∨ Gyrus) parts(S V) ∧ 1stPart(V A) ∧ 2ndPart(V B) → separation(S A B) (2)

Frames and description logics allow only binary relations. Possible solutions for representing n-ary relations include relation reification, i.e. to represent it by a concept, e.g. Separation (1), and rules (2) like CARIN-ALN rules [20].

Ex17. rules are not provided but required (#15) Rule # 1 : IF A is part of B THEN A has the same side as B isAnatomicalPartOf (A B) ∧ hasSide (B,C) → hasSide (A,C) Rule#2 : IF C is part of D and not A part of D and S separates A and C THEN S separates A and

D isAnatomicalPartOf (C D) ∧ ¬ isAnatomicalPartOf (A D) ∧ separation

(S A C) → separation (S A D) Rule#3 :definition of a ternary predicate from roles (binary relations) separate(S V) ∧ firstpart(V A) ∧ 2ndpart (V B)→ separation(S A B)

Rules are required for complex properties which cannot be represented with DLs expressiveness. In this application, they may enable to express dependen-cies between relations and consistency constraints. For example, if a sulcus S separates two gyri G1 G2 that belong to different lobes (G1 is part of L1, G2 is part of L2), then S separates G1 from L2, G2 from L1, and L1 from L2. Such a rule would generate 221 relations in the brain cortex ontology presented in [4].

Ex18. metaclasses are not provided but required (#16) /The class FrontalLobe, instance of the metaclass MetaAnatomicalConcept, is related by the

property UMLS-ID to the UMLS Concept Unique Identifier C0016733/ <MetaAnatomicalConcept rdf:ID="FrontalLobe"> <UMLS-ID rdf:resource="&rdfs;Literal">C0016733</UMLS-ID>

Since Metaclass exist in Protégé, defining a metaclass with a slot UMLS-ID for connecting the ontology concepts to the UMLS™ concepts is possible in Protégé but not in DAML+OIL (but will be legal in OWL-Full).

4.4 Results

The use of PROTEGE 2000 and DAML+OIL for the brain cortex ontology has led to the following conclusions:

• First, representing the brain cortex anatomy ontology led to difficulties with both languages, but many limitations of Protégé are overcome by DAML+OIL, thanks to the enhanced expressiveness of SHIQ description logics versus frames.

• Next, it comes out that most DAML+OIL constructors (Table 1) and axioms (Table 2) in particular negation, disjunction, inverse, were needed for the ontology and would certainly be in a Web language for biomedical ontologies

• Equivalence of classes or relations, subclass and subproperty are key axioms to assert relationships between classes and relations of separatly developped ontologies, thus are specially required for merging several Web ontologies.

• Finally, the main expressivity limitations of DAML+OIL, and that the future Web Ontology Language shall overcome, is the lack of rules (Ex15 Ex17), in particular to join relations. Metaclasses might be useful to connect ontologies to existing medical standards like UMLS™.


In conclusion, an expressive DL similar to DAML+OIL is required to express complex taxonomic knowledge, rules should enable to express dependencies between relations and to use predicates of arbitrary arity, while metaclasses might be useful to take advantage of the existing medical standards.

5 Discussion

W3C standards but also other formal languages are available for Web ontologies. Table 1 and Table 2 compare the main constructors and axioms supported by Protégé-2000 and DAML+OIL which is quite similar to OWL-DL, to those of OWL Lite which is less expressive, and to CARIN-ALN [20], an hybrid language with rules.

Table 1. Main languages constructors : + means ‘available’, - ‘lacking’, +/- ‘limited’

Constructor

Protégé 2000

DAML+OIL (~ OWL-DL)

Example OWL Lite

CARIN-ALN

1. conjunction + + Ex4 + + 2. disjunction - + Ex2 - - 3. universal + + Ex1 + + 4. existential - + - + - 5. negation - + Ex7 - +/- 6. cardinality +/- + Ex14 +/- +

Table 2. Main axioms (used in the brain cortex ontology ontology)

Axiom Protégé 2000

DAML+OIL (~ OWL-DL)

Example OWL Lite

CARIN-ALN

7. subsumption + + Ex10 + +/- 8. class equivalence - + Ex9 + - 9. disjointness - + Ex5 - +/- 10. disjoint union - + Ex8 - - 11. subproperty + + Ex10 + - 12. property

equivalence - + Ex13 + -

13. inverse + + Ex12 + - 14. transitivity - + Ex11 + - 15. rule - - Ex17 - + 16. metaclass + - Ex18 - -

From a formal point of view, DAML+OIL is quite equivalent to the description logics SHIQ extended by the oneOf constructor and datatypes together with a nice set of algebraic axioms. It can make use of the FaCT system which provides a reasoner with sound and complete tableaux algorithms to reason on ontologies, thus supports auto-matic tasks like ontology consistency checking, concepts classification, instantiation. CARIN- ALN is based on the less expressive ALN description logics, but combines it with a powerful rules language. OntoClass provides for CARIN-ALN the same ser-vices as FaCT, but subsumption and satisfiability are polynomial instead of exponen-tial. Moreover, thanks to its rules, CARIN–ALN can be used as a query language to


consult heterogeneous informations via mediators built with PICSEL [20]. The previ-ous use case leads to conclude that ideally a hybrid language integrating an expressive DL with rules, similar to CARIN– ALN or TRIPLE [21] would benefit to medical ontologies. Besides, it might be used for a Web query language to search medical information. But, combining description logics with rules implies to restrict, either the description logics part or/and the form of rules, to remain decidable and to have sound and complete algorithms [12]. An open question is to define a relevant subclass of DL and a subset of rules to be integrated into a uniform language suited to Web medical applications. This study about expressiveness is a first step, to go further the priority uses of a Semantic Web expected by the biomedical community and its main re-quirements should be precised: are decidable reasoning, sound and complete reason-ing procedures, efficient reasoning procedures necessary? Suited modularization mechanisms to assemble separatly developped medical ontologies (e.g. gene, protein, disease etc.) is another important open question to be tackled.

6 Conclusion

A Web language for medical ontologies should have formal semantics and maximum expressiveness so as to enable a fine and precise representation of both taxonomic and deductive knowledge, but also efficient means to reason with large amounts of knowledge that characterize the biomedical domain: automatic classification and ontology consistency checking. User-friendly interfaces like Protégé-2000 is another crucial feature for the medical community. Connecting ontologies with existing medical standards like UMLS is also required. Thus, the next standard OWL which gathers both advantages seems a good candidate. But it should be extended by a rules layer, for several purposes: representing dependencies betweeen relations, constraints, consistency checking, etc. However, since expressiveness and tractability are opposed, a balance shall be found that supports representation of the most important medical knowledge for the main Web uses in the biomedical domain.

References

1. Bechhofer S., Horrocks I., Goble C., Stevens R. OILEd: a Reason-able Ontology Editor for the Semantic Web. Proceedings of KI2001, Joint German/Austrian conference on Artificial Intelligence, Vienna. Springer LNAI Vol. 2174, (2001) 396-408

2. Bowden, DM and Martin, RF. NeuroNames Brain Hierarchy, Neuroimage,2 (1995) 63-83 3. Brinkley J.F. and Rosse C. Imaging informatics and the Human Brain Project : the role of

structure, Yearbook of Medical Informatics (2002) 131-148 4. Dameron O., Burgun A., Morandi X., Gibaud B. Modelling dependencies between relations

to insure consistency of a cerebral cortex anatomy knowledge base. Proceedings of Medical Informatics in Europe (2003)

5. DAML+OIL Reference Description. Dan Connolly, Frank van Harmelen, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider, and Lynn Andrea Stein. W3C Note 18 December 2001. http://www.w3.org/TR/daml+oil-reference.

6. Extensible Markup Language (XML) 1.0 (Second Edition). Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler, eds. (2000). http://www.w3.org/TR/REC-xml.


7. Fellbaum C, edt. WordNet : an electronic lexical database. Cambridge, MIT Press (1998) 8. Fensel D., van Harmelen F., Horrocks I., McGuinness D.L., and Patel-Schneider P. F. OIL

An ontology infrastructure for the semantic web. IEEE Intell Systems, 16(2) 38-45 (2001) 9. Rector. A., Nowlan W.A. and the GALEN Consortium, The GALEN Project Computer

Methods and Programs in Biomedicine, 45 (1993) 75-78 10. Gomez-Perez A., Corcho O., Ontology languages for the Semantic Web, IEEE Intelligent

Systems (2002) 17, 4 54-60 11. Hendler, J. , McGuinness, D.L. The DARPA Agent Markup Language. IEEE Intelligent

Systems 16(6) (2000) 67-73 12. Levy A. Y, Rousset MC, The Limits on Combining Recursive Horn Rules with Description

Logics, AAAI/IAAI, Vol. 1 (1996) 13. Li Q, Shilane P, Noy NF, Musen MA Ontology acquisition from on-line knowledge

sources. Proc. AMIA Symp. (2000) 497-501. 14. Lindberg D.A, Humphreys, B.L. McCray AT. The Unified Medical Language System.

Meth. Inf Med Aug; 32(4) (1993) 281-91 15. Noy N. F. Sintek M., Decker S., Crubezy M, Fergerson R. W., Musen M. A.. Creating

Semantic Web Contents with Protege-2000. IEEE Intelligent Systems 16(2) (2001) 60-71 16. Ono M, Kubik, S and Abernathey, Geog Thieme Verlag, Atlas of the Cerebral Sulci,

Thieme Medical Publishers Inc (1990) 17. OWL Web Ontology Language Reference Version 1.0. Dean M., Connolly D., van

Harmelen F., Hendler J., Horrocks I., McGuinness D. L, Patel-Schneider P. F. and Stein L. A. W3C Working Draft 31 Mars 2003. http://www.w3.org/TR/owl-ref/

18. RDF/XML Syntax Specification (Revised) Dave Beckett, ed. W3C Working Draft 23 January 2003. http://www.w3.org/TR/rdf-syntax-grammar/.

19. Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw KP, Brinkley JF. Motivation and organizational principles for anatomical knowledge representation: the digital anatomist symbolic knowledge base. J Am Med Inform Assoc. Jan-Feb 5(1) 17-40 (1998).

20. Rousset M-C, Bidault A, Froidevaux C, Gagliardi H, Goasdoué F, Reynaud C, Safar B. Construction de médiateurs pour intégrer des sources d’information multiples et hétérogènes : le projet PICSEL, Revue I3 : Information - Interaction - Intelligence (2002)

21. Sintek M., Decker S, TRIPLE An RDF Query, Inference, and Transformation Language. DDLP'2001 Japan (2001)

22. Toga A.W. Neuroimage databases : the good, the bad and the ugly, Nature reviews neuroscience vol 3 (2002) 302-309

23. XML Schema Part 2: Datatypes.. Paul V. Biron and Ashok Malhotra, eds. W3C Recommandation 02 May 2000. http://www.w3.org/TR/xmlschema-2/.


Using Description Logics for Managing Medical Terminologies

Ronald Cornet and Ameen Abu-Hanna

Dept. of Medical Informatics, Academic Medical Center, University of Amsterdam P.O. Box 22700 1100 DE Amsterdam, The Netherlands

{r.cornet,a.abu-hanna}@amc.uva.nl

Abstract. Medical terminological knowledge bases play an increasingly impor-tant role in medicine. As their size and complexity are growing, the need arises for a means to verify and maintain the consistency and correctness of their con-tents. This is important for their management as well as for providing their us-ers with confidence about the validity of their contents. In this paper we de-scribe a method for the detection of modeling errors in a terminological knowledge base. The method uses a Description Logic (DL) for the representa-tion of the medical knowledge and is based on the migration from a frame-based representation to a DL-based one. It is characterized by initially using strong assumptions in concept definitions thereby forcing the detection of con-cepts and relationships that might comprise a source of inconsistency. We demonstrate the utility of the approach in a real world case study of a termino-logical knowledge base in the Intensive Care domain and we discuss decisions pertaining to building DL-based representations.

1 Introduction

Medical terminological knowledge bases (TKBs) represent knowledge about medical concepts, relationships and terms. For example, a concept may be defined as “in-flammation of the membranes of the brain or spinal cord”, and described by the syn-onymous terms “cerebrospinal meningitis” and “meningitis”. TKBs provide an in-valuable source of structured medical knowledge, serving a range of purposes.

A frame-based representation is commonly used to express definitions of concepts. This formalism supports an intuitive way of knowledge modeling but it lacks explicit semantics, making it hard to automate reasoning. Examples of services expected from the utilization of the TKB include the classification of concepts and consistency checking of the TKB. To perform this automatically, a formal basis is needed for the knowledge representation formalism.

A seemingly attractive formalism to consider is that of Description Logics (DLs), a family of formal languages that are subsets of First Order Logic (FOL) and that pro-vide for an object-oriented like structure of concept definitions.

In this paper we explore a way for deploying DLs for supporting the reasoning ser-vices of classification and consistency checking of a medical TKB. Our starting point is that the TKB at hand is specified or implemented in a frame-based language. This is the case in the great majority of TKBs available today. In our approach we migrate

62 Ronald Cornet and Ameen Abu-Hanna

the frame-based KB to a DL-based one. Because the frame-based representation is ambiguous, this migration requires making its semantics explicit. We have developed a method to perform this migration by posing explicit assumptions on semantics e.g. of a frame slot. The idea is to start with strong assumptions about definitions in order to force the reasoning system to identify potentially inconsistently defined concepts. This identification is realized by exploiting the satisfiability services of a DL. Each unsatisfiable concept may indicate a too strong assumption but may also indicate errors in the original frame-based definition. Our hypothesis is that going through the migration process and performing satisfiability testing provides a serious contribution for maintaining the contents of medical TKBs. To assess this hypothesis, we have applied our method to a real world knowledge base of Reasons for Admission in In-tensive Care, which has been developed in recent years at our department.

This paper is organized as follows. In Section 2 we provide preliminaries on Frame-based representation, Description Logics, and the differences between them. We describe our method in Section 3 and focus on error detection in Section 4. Sec-tion 5 reports on the results of this case study. We conclude with observations on application of our method, and on modeling medical terminological knowledge bases.

2 Frame-Based and Description Logic-Based Representations

Frames (Minsky 1981) provide a means of describing classes and instances, with slots of frames representing either relations to other classes, or properties of the represented class. Frames can represent subclasses by means of a KindOf relation, allowing slots (and any slot-fillers) to be inherited from the superclass by the subclass.

As an example of a medical TKB, we will use the DICE knowledge base, which is developed at our department (de Keizer, Abu-Hanna et al. 1999). The DICE system (Diagnoses for Intensive Care Evaluation) represents knowledge in the domain of Intensive Care, with a focus on reasons for admission. Like many medical TKBs, it is organized around health problems, which are defined according to their anatomy, abnormality, etiology, and system (e.g. vascular system, digestive system), as shown in Figure 1. The model is implemented using class frames only.

The model provides the possibility of specifying two special facets of slots, namely transitivity (for example the “part of” slot is transitive), and refinability (for allowing choices of slot-fillers). Figure 2 shows an example of refinability, where the etiology of viral meningitis is indicated by our notation as OR(Virus), meaning that any sub-class of virus is accepted here. The application will in that case present to the user the possible values (i.e. all viruses) and request the user to specify one or more viruses that caused the patient’s meningitis.

Description Logics (DLs) (Baader, Calvanese et al. 2003) provide fragments of FOL for formal definition of concepts. These definitions can either be primitive (specifying only necessary conditions), or non-primitive (specifying both necessary and sufficient conditions). For example, consider the following two axioms

Mother Parent ; Mother Woman AND Parent The first states that a mother is necessarily a parent, whereas the second states that a mother is necessarily both a woman and a parent, and that anyone who is a woman and a parent is necessarily a mother.

Using Description Logics for Managing Medical Terminologies 63

The formal, set-theoretic semantics of DLs provide statements with an unequivocal meaning, which makes reasoning with DL-based knowledge reproducible and appli-cation independent. Each DL is characterized by the concept and role constructors it allows for. Examples of concept constructors are AND ( ), OR ( ), NOT (�), SOME ( ), ALL ( ), AT-LEAST (�). For example: Happy Father Father AND (Rich OR At-Least 3 Children). Examples of role constructors are transitivity (e.g for the “part of” role: if A part of B and B part of C, then A part of C), inverse roles (e.g. “is_caused_by” is the inverse role of “causes”), or role taxonomies (e.g. “has sister” is a kind of “has sibling” role).

DL-based knowledge bases generally consist of a TBox (Terminology box) con-taining axioms (such as the above-mentioned examples), and an ABox (Assertion box) containing assertions (e.g. Mary is a Mother; Betty is a child of Mary).

The foremost reasoning tasks with DLs are subsumption (classification) and satis-fiability checking. Reasoning is based on the open world assumption, basically mean-ing that the set of given individuals is not assumed to be complete.

2.1 Differences between Frames and Description Logics

Frames and Description Logics both provide means of representing concepts, rela-tions, and instances. There are however a number of significant differences, which need to be taken into account in the process of migration from frames to DL.

Classes versus Concepts. As DL-based reasoning makes it possible to infer sub-sumption, the resulting taxonomy will be a combination of stated and inferred sub-sumption (e.g. consider the “Mother” example above). Class frames, in constrast, need to be explicitly defined as subclasses of all applicable superclasses.

Direct Health Problem system: System location: Anatomy abnormality: Abnormality etiology: Etiology syndrome_part: Direct Health Problem

Operative Procedure OP_system: System OP_location: Anatomy OP_abnormality: Abnormality OP_act: Act

Anatomy part_of: Anatomy part_of_system: System

Health Problem caused_by: Health Problem

Act

Abnormality

System

Etiology

kind of kind of

Fig. 1. Domain model of the ontology of DICE. Two types of health problems are distin-guished, direct health problems and operative procedures. The domains of the slots are repre-sented in Italics. Various examples of subclasses are shown in Figure 2.


Disjointness and Covering. As opposed to most Frame-based representations, DLs allow to formally specify that concepts are mutually exclusive (disjoint), by stating that one is subsumed by the complement of the other: Virus �Bacterium.

This axiom renders any concept defined as both a Virus and a Bacterium as unsat-isfiable. In addition one could specify that there are no other microorganisms, by:

Microorganism Virus Bacterium

Slots versus Roles. Without additional constructs, Frame slots and any slot-fillers may be interpreted in various ways. For example, a slot cause with slot-filler “(Virus, Bacterium)” may mean that both virus and bacterium are an actual cause, or both are possible causes (possibly combined), either with or without other possible causes, etc.

Description Logics leave no room for such ambiguity. Role quantification is used to express the required meaning. For example (Disease cause Virus), uses exis-tential quantification ( ) to denote diseases that have a cause, which is a virus. Uni-versal quantification ( ) is used to limit possible role-values. E.g. (Disease cause Virus) denotes diseases of which all causes (if any) are viruses. Combining existential and universal quantification makes it possible to precisely define the se-mantics of roles.

Slot Facets versus Role Constructors. Semantics of slot facets are often unclear and application-dependent. Examples of such facets are both the refinability and the tran-sitivity facet as described above. In contrast, the semantics of role constructors are explicitly defined, and taken into account by DL reasoners.

3 Migration from Frame-Based to DL-Based Representation

The first step in our method is the translation of a Frame-based representation to a Description Logic-based representation. Because of the loose semantics of frames, assumptions will be made about their semantics. We will focus on disjointness, role quantification and role values, and part-whole reasoning, as these are believed to have the greatest impact on inconsistency detection.

Meningitis Kind of: Brain Disease Anatomy: Meninges Abnormality: Infection Etiology: OR(Virus, Bacterium, Fungus)

Viral Meningitis Kind of: Meningitis Etiology: OR(Virus)

Meninges Kind of: Body Part Part of: Brains

Microorganism Kind of: Etiology

Fungus Kind of: M.-organism

Bacterium Kind of: M.-organism Aerobe: XOR(true, false)

Virus Kind of: M.-organism

Fig. 2. Examples of frame-based class definitions. The “Kind of” slot defines direct super-classes. Slot facets “XOR” and “OR” specify whether instances can be defined with exactly one (XOR), or more than one (OR) value from the slot fillers.


Disjoint Definitions. In order to detect as many potential inconsistencies as possible, maximally stringent definitions were assumed, explicitly stating disjointness of siblings. We have defined all concepts subsumed by Act, Abnormality, System and Etiology as mutually disjoint to each of their siblings. In Figure 2 for example, Virus, Fungus, and Bacterium are defined as disjoint. In this way, we can express meningitis caused solely by a virus as:

ViralMeningitis Meningitis cause Virus cause Virus. An attempt to define viral meningitis caused by a bacterium will result in an unsatisfi-able concept, as disjointness of Bacterium and Virus is now explicitly stated.

Role Quantification and Role Values. As discussed earlier, semantics of slot-fillers are unclear, allowing multiple interpretations. The assumptions we have posed on the semantics are shown in Table 1, where we present the frame-based representation and its DL-based counterpart, where the slot “cause” and the fillers are taken as examples. In the case of DICE, also the refinability facet of slots needed to be taken into account. Fillers of regular slots are assumed to represent both existentially and universally quantified roles. Fillers of slots with an OR facet (used in DICE to specify zero, one or more of the values when creating an instance) represent only universal quantification. Fillers of slots with an XOR facet (to specify at most one value) are represented as a number restriction (at-most 1) and a universal quantification. As the assumption of universal quantification is too stringent in numerous cases, a special purpose facet has been added to the slots to explicitly specify whether a slot should be considered to represent universal quantification or not. This facet can be updated during the migration process to override the default assumption.

Part-Whole Relations. Partitive relations play an import role in medical knowledge bases but may demand great expressiveness of Description Logics. This can be overcome by the use of Structure-Entity-Part triplets (SEP), as suggested by (Schulz, Romacker et al. 1998). Motivation for SEP triplets was is the avoidance of the use of transitive roles and role chaining, but comes at the cost of having to define every anatomical component in three ways (as an entity, a part, and a structure). Also for the aim of detecting inconsistencies we found SEP representation to be very useful.

Table 1. Frame-based slot-fillers and their assumed DL-based counterparts.

Frame-based representation Assumed DL-based equivalent cause: (Virus, Bacterium) cause Virus cause Bacterium

cause (Virus Bacterium) cause: OR(Virus, Bacterium) cause (Virus Bacterium) cause: XOR(Virus, Bacterium) ��1 cause cause (Virus Bacterium)

4 Detecting Errors

In order to detect errors one needs an automatic classifier. A standard Description Logic classifier such as FaCT (Horrocks, Sattler et al. 2000) or RACER (Haarslev and Möller 2000) can be used to find unsatisfiable concepts in the DL-based knowledge


base. Unsatisfiability of a concept however does not necessarily imply incorrect defi-nition of the concept. Generally, there can be three explanations for unsatisfiability:

1. The concept itself is correctly defined but refers to an unsatisfiable concept (e.g. it is a child of an unsatisfiable concept)

2. The concept is correctly defined, but the semantics assumed during migration of that concept or any of its subsumers do not represent the intended semantics (e.g. a role is incorrectly assumed to represent universal quantification)

3. The concept is semantically incorrect (e.g. a kind of hepatitis which is defined as located in the kidneys instead of the liver).

In the first situation one unsatisfiable concept can cause a large number of unsatis-fiable concepts. As finding such a concept is non-trivial, research is ongoing to de-velop methods to support this (Schlobach and Cornet 2003). One approach to sort out such situations is to start with concepts that are used as role-values for other concepts. For example, in the case of the Intensive Care knowledge base, subsumers of Anat-omy, Act, Etiology, and System are such concepts, hence it is expedient to first ad-dress unsatisfiable concepts subsumed by those concepts.

5 Results

We have applied the method described above to the DICE knowledge base, in order to gain insight into the feasibility of this approach. The DICE knowledge base consists of about 2500 concept frames, with over 3000 filled slots (other than “kind of” slots). We used RACER to process the DL-based representation of the knowledge base and check the consistency of the TBox. As mentioned earlier, assumptions posed on the semantics of the frame-based representation may turn out not to be justified. The facet to overrule default interpretation of role quantification made it possible to iteratively migrate from frames to DL, find unsatisfiable concepts, and determine whether the unsatisfiability stemmed from an incorrect assumption or from a modeling error. In either case, the frame-based representation could be changed accordingly, and a new DL-based representation emerged iteratively.

Below we will make a distinction between unsatisfiability introduced by the migra-tion method, and unsatisfiability caused by modeling errors. As the actual migration process is still ongoing, the results are not yet fully quantified. Moreover, the analysis presented here is specific for the DICE knowledge base, and may differ significantly for other TKBs. It does however provide insight in the possibilities of using our method.

5.1 Unsatisfiable Concepts Caused by the Migration Method

The stringent assumptions put on the frame-based representation resulted in two types of assumption errors: errors caused by incorrect assumption of disjointness, and errors caused by incorrect assumptions on quantification.

Disjointness errors were found in the descendants of etiology. For example, the (false) assumption was made that “addictive drug” and “analgesic” are disjoint, but “Morphine and Opioids” is (correctly) defined as a descendant of both.


This unsatisfiability could be overcome by removing the assumption of disjoint-ness. It needs to be noted that we have not posed disjointness on the descendant of “health problem”. This is motivated by the fact that the axioms defining them should make it possible to distinguish between them, which is not possible by most of the other concepts, such as descendants of etiology, which lack specification of distin-guishing properties.

A large number of unsatisfiable health problems were found, which could be ex-plained by the stringent assumptions posed on the quantification of roles. Universal role quantification was frequently falsely assumed. For many cases, this could be explained by the fact that a frame-based representation requires explicit classification. This led to a large number of grouper concepts, such as “lung disease”, which (falsely) assumed the location to be lungs, and nothing else. This led to unsatisfiabil-ity of all diseases that were defined as a “lung disease”, but that also involved a loca-tion different from lungs. In these cases, the frame-based representation was altered by tagging the relevant slots as “not universal”.

5.2 Unsatisfiable Concepts due to Incorrect Definitions

Various types of modeling errors were found in the process of migration. We will categorize them as: misclassification, false quantification, missed slot-fillers, and incorrect relations.

Misclassification. A small number of misclassifications have been found, i.e. concepts that were misplaced in the taxonomy. This mainly involved concepts that were placed as siblings where one of the concepts should have been subordinate to the other (be its child). Another example of misclassification is illustrated by a concept that was defined as both a health problem and an abnormality, which are disjoint. Instead of being subsumed by abnormality, it should have been related to abnormality by a slot-filler. The most notable case of misclassification was found in the anatomy taxonomy, where a part of the hierarchy was defined incorrectly by switching subsumers and subsumees. This involved the concept “laryngo tracheo bronchitis” which was defined as the subsumer of laryngitis, tracheitis and bronchitis, whereas it should be defined as a subsumee of these three concepts.

False Quantification. A number of incorrect quantifications were found. These mainly involved concepts for which the OR or XOR facet was not (correctly) specified (see Table1). As this is very specific for the DICE KB, we will not go into further detail. It is however important to realize that correct specification of universal and/or existential quantification is necessary to be able to detect incorrect role values, (or slot-fillers in the frame representation).

Missed Slot-Fillers. A number of concepts were found that were lacking slot-fillers. Two typical situations were found, of which examples are given below.

Table 2 shows how the slot-filler “endocrine system” overrides the inherited “nervous system”, instead of being an additional system. Hence, acromegaly should also involve the nervous system, stating system: (endocrine system, nervous system).


In the migration process, it turned out that the “system” role in brain disease should not be defined as universal, as brain diseases can involve other systems.

Table 2. Example of missing slot-filler.

Frame-based representation Assumed DL-based equivalent Brain disease system : nervous system

Brain_disease � system nervous_system

Acromegaly kind of: Brain disease system: endocrine system

Acromegaly �brain_disease ��system endocrine_system ��system endocrine_system

The other typical case was related to OR slots. Frequently, classes had OR slots

without slot-fillers that were defined for some of their subclasses. In such cases, these slot-fillers were added to the OR slot of the superclass.

Incorrect Slots. The anatomy taxonomy revealed a number of concepts for which a part-of relation was accidentally mixed up with a kind-of relation. This is an error that has been found in other systems as well, and for which DL reasoning provides a powerful means for detecting it (Schulz and Hahn 2001).

5.3 Observations from the Case Study

During the process of error detection and resolving them, a number of issues came to light that require further investigation. We only have made changes needed to resolve inconsistencies in the original knowledge base. However, studying the definitions indicated that in some cases a more rigorous redefinition would be justified. Also more attention should be paid to the computational properties of the resulting TBox.

Groupers and Patterns. As mentioned earlier, a frame-based representation requires classes to be defined as subclasses of all superclasses involved. As DLs make inference possible on subsumption, a better way of modeling would be to define concepts based on their actual properties, without referring to the grouper concepts. For example, hepatitis would be defined as a disease located in the liver instead of as a “liver disease”, as the latter can be inferred from the definition of hepatitis.

Other concepts were found that indicated inconsequent modeling rather than incor-rect definition of concepts. For example, both a “part-of” relation and the concepts “body part” and “organ part” are present in the knowledge base. This makes it possi-ble to define a concept by means of either “kind-of organ part” or “part-of organ”. Whereas these definitions are logically equivalent, preferably only one of them should be used throughout modeling a knowledge base. Guidelines or modeling patterns might need to be developed to stimulate standardized modeling.


TBox Properties. The language that was used for the DL-based representation was ��, which allows the constructors , , �, , , �� . As we have represented anatomy using SEP triplets, no role hierarchies or transitive roles were required, keeping the language relatively simple.

As the frame-based representation did not contain any axioms other than frame-definitions, and no cycles, the migration resulted in an unfoldable TBox. This means that all definitions are simple (defining only atomic concepts), unique (only one defi-nition for each atomic concept exists), and acyclic (meaning the definition of a con-cept has no reference to the definiendum, either directly or indirectly). Reasoning on this type of TBox generally has a lower complexity than reasoning on arbitrary TBoxes with cycles and general concept inclusion axioms (Baader, Calvanese et al. 2003).


We have devised a method for the semi-automated migration from a frame-based representation to a DL-based representation and demonstrated how it helps focusing on weaknesses of a medical terminological knowledge base in Intensive Care. As this knowledge base is modeled in a way comparable to other medical knowledge bases (for example Clinical Terms Version 3 (Read, Sanderson et al. 1995)), it is expected that the methods described here will prove useful in general. There are however a number of remarks to be made.

It is important to realize that although these methods may support detection of in-correct definitions, it cannot be assumed that definitions in a satisfiable knowledge base are correct. For example, if viral meningitis would be defined as hepatitis (in-stead of meningitis) caused by a virus, this could result in a satisfiable concept, al-though it is obviously incorrect.

As Description Logics enable automatic subsumption, it can be argued whether or not concepts should be modeled using grouper concepts such as ‘liver diseases’. This is in line with the discussions about compiled versus model-based knowledge. In a Frame-based representation, grouper concepts are necessary in order to assure that a disease is considered a liver disease. Using Description Logics, it seems appropriate to define a disease according to its actual properties (e.g. hasLocation liver) and infer the fact that such a disease is a liver disease. Likewise, in a Frame-based repre-sentation a concept such as “body part, organ or organ part” would be defined pref-erably as a disjunction of the constituent concepts in DL.

Application-specific slots or facets, of which the semantics are unclear or non-definitional, cannot be represented using Description Logic. This means that these elements (such as the facets to support post-coordination that allows for the creation of new concepts based on combining existing ones) are lost in the process of migra-tion. Therefore, parts of the functionality provided in the original frame-based repre-sentation will have to be realized outside of the DL-based environment. Although this seems to be a drawback at first, it may well turn out to be advantageous as it leads to better understanding of the various aims for which knowledge modeling is being performed.


Admittedly, the DL-representation would include a large number of too strict as-sumptions. These are mainly concerned with the universal quantification and disjoint-ness. However, the approach provides an automated reasoning tool to identify areas for focusing human attention. Still, a weakness of our approach is that there is no support for tracing or explaining DL-based unsatisfiability. As a consequence, pin-pointing and resolving conflicts in definitions is a time-consuming task. Working on explanation facilities comprises important further work that we are planning to ad-dress.

References

Baader, F., D. Calvanese, et al. (2003). The Description Logic Handbook: Theory, Implementa-tion, and Applications. Cambridge, University Press.

de Keizer, N. F., A. Abu-Hanna, et al. (1999). "Analysis and design of an ontology for inten-sive care diagnoses." Methods of Information in Medicine 38(2): 102-12.

Haarslev, V. and R. Möller (2000). High Performance Reasoning with Very Large Knowledge Bases. International Workshop in Description Logics 2000 (DL2000), Aachen, Germany.

Horrocks, I., U. Sattler, et al. (2000). "Practical reasoning for very expressive description lo-gics." Logic Journal of the IGPL 8(3): 239-263.

Minsky, M. (1981). A framework for representing knowledge. Mind Design. J. Haugeland, The MIT Press.

Read, J. D., H. F. Sanderson, et al. (1995). "Terming, encoding, and grouping." Medinfo 8 Pt 1: 56-9.

Schlobach, S. and R. Cornet (2003). Non-Standard Reasoning Services for the Debugging of Description Logic Terminologies. to be published in : International Joint Conference on Ar-tificial Intelligence, Acapulco, Mexico.

Schulz, S. and U. Hahn (2001). "Medical knowledge reengineering--converting major portions of the UMLS into a terminological knowledge base." Int J Med Inf 64(2-3): 207-21.

Schulz, S., M. Romacker, et al. (1998). "Part-whole reasoning in medical ontologies revisited--introducing SEP triplets into classification-based description logics." Proc AMIA Symp: 830-4.


Ontology for Task-Based Clinical Guidelines and the Theory of Granular Partitions

Anand Kumar1 and Barry Smith2

1 Laboratory of Medical Informatics, Department of Computer Science University of Pavia, Italy

2 Institute for Formal Ontology and Medical Information Science, University of Leipzig Germany and Department of Philosophy, University at Buffalo

Abstract. The theory of granular partitions (TGP) is a new approach to the un-derstanding of ontologies and other classificatory systems. The paper explores the use of this new theory in the treatment of task-based clinical guidelines as a means for better understanding the relations between different clinical tasks, both within the framework of a single guideline and between related guidelines. We used as our starting point a DAML+OIL-based ontology for the WHO guideline for hypertension management, comparing this with related guidelines and attempting to show that TGP provides a flexible and highly expressive ba-sis for the manipulation of ontologies of a sort which might be useful in provid-ing more adequate Computer Interpretable Guideline Models (CIGMs) in the future.

1 Introduction

1.1 Clinical Practice Guidelines from an Ontological Point of View

Clinical Practice Guidelines (GLs) are ‘systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances.’ [1] Their use in clinical decision-making is intended to improve the outcomes of clinical care. Given that most GLs are free texts or simple flowcharts, there is a growing need to create Computer Interpretable Guideline Models (CIGMs) [2]. For this, however, we require standardized terminologies based on coherent on-tologies of clinical activities [3].The Unified Medical Language System (UMLS) of the National Library of Medicine integrates a number of standard medical terminol-ogies into a single unified knowledge representation system. [4,5]

While the UMLS provides its terms with associated Semantic Types, in order to use the latter in CIGMs one needs to incorporate them within some ontological framework. Among the emerging standards in this field, the DARPA Agent Markup Language and Ontology Interface Language (DAML+OIL) is a recent proposal for an ontology representation language suitable for such purposes [6].

1.2 Reference Ontologies and Applications Ontologies

DAML+OIL is an ontology language within the currently dominant paradigm, which views ontologies as applications capable of running in real time and exploiting the

72 Anand Kumar and Barry Smith

reasoning power of one or other variant of Description Logic. There is however a second paradigm in ontology – that of ‘Reference Ontology’ – whose proponents hold that the needs of terminology integration and standardization can be met only through the development of ontological theories marked by a high degree of descriptive ade-quacy.

One product of the Reference Ontology approach is the theory of granular parti-tions (TGP). This is designed to yield a framework within which both formal and informal representations of reality at different levels of granularity (for example molecule, cell, and whole-organism granularities) can be incorporated together [7, 8]. Our task here is to compare the results of adding to the pure DAML+OIL framework for analysis of guidelines the supplementary resources of TGP.

2 The UMLS Semantic Network, DAML+OIL and Guidelines

2.1 UMLS Semantic Types for Task-Based Guidelines

Most of the actions referred to in GLs can be mapped into that part of the UMLS terminology that is associated with the Semantic Types Laboratory Procedure, Diag-nostic Procedure and Therapeutic or Preventive Procedure. All of these are sub-types of the Semantic Type Health Care Activity.

Other Semantic Types closely associated with Health Care Activity but used less frequently in GLs are: Educational Activity, Governmental or Regulatory Activity and Research Activity, all of which are subtypes of Occupational Activity. An instance of Research Activity, for example, is an instance of Health Care Activity marked in addi-tion by the feature: strength of evidence.

2.2 The Case of Hypertension

The Guidelines for the Management of Hypertension prepared in 1999 by the WHO International Society of Hypertension were used as a basis for our DAML+OIL-based ontology for hypertension GLs [9,10]. The Semantic Types mentioned in the GL text were mapped to the three Semantic Types mentioned above, using operators such as ‘Determination of’ (abbreviated ‘DOF’) to signify the relationships among the Se-mantic Types in the UMLS Semantic Network. For example, the term ‘Proteinuria’ was assigned the following mapping:

Term – Proteinuria Semantic Type – Laboratory or Test Result, Disease or Syndrome

Operator – DOF (Determination of) Term – DOF Proteinuria New Semantic Type – Laboratory Procedure

According to this analysis, Proteinuria is either a laboratory or test result or a disease or syndrome. Since our GL ontology is restricted to the Semantic Type Health Care Activity we need to find a roundabout way of incorporating Proteinuria and similar terms, and we do this precisely by means of constructions such as: determination of (the presence of) proteinuria.

Ontology for Task-Based Clinical Guidelines and the Theory of Granular Partitions 73

3 The Theory of Granular Partitions (TGP)

3.1 Background Rationale

When human beings engage in listing, mapping or classifying activities – for example when they seek to classify the domain of clinical activities in terms of UMLS Seman-tic Types or in terms of Guidelines or CIGMs – then they partition reality into cells of various sorts.

Perhaps the most important feature of TGP is that it recognizes that different parti-tions may represent cuts through the same reality at different levels, and even cuts through reality which are skew to each other.

Each partition consists of cells and subcells, the latter being nested within the for-mer. Partitions can be hierarchical: they then consist of many layers of cells and sub-cells (for example in the animal kingdom the layers of genus, species, family, order, phylum, kingdom and so forth). The lowest layer of subcells corresponds to the finest grain of objects recognized by the partition in question.

3.2 The Axioms of TGP

The axioms of TGP can be given in partial and simplified form as follows. (We ig-nore here those aspects of the theory dealing with mereological structure and with vagueness of projection; for details see [7], [8]). For orientation one can think of the relation between an object and a cell in which it is located as analogous to the relation between an element and its singleton in set theory. The subcell relation is then a re-stricted version of the set-theoretical subset relation, formulated in such a way that each partition is isomorphic to a tree in the graph-theoretical sense:

A1: Every partition has a unique maximal or root cell in which all other cells are included as subcells. A2: The subcell relation is reflexive, antisymmetric, and transitive. A3: Each cell in a partition is connected to the root via a finite chain of immediate succeeding cells. A4: If two cells within a partition overlap, then one is a subcell of the other.

These axioms relate to a granular partition as a system of cells.

4 Use of TGP as a Supplement to DAML+OIL-Based Ontologies

4.1 The Hypothesis

After translating our GL texts annotated in terms of UMLS Semantic Types into DAML+OIL ontologies, the results still need to be supplemented by machinery of the sort provided by TGP if they are to allow the comparison and manipulation of distinct ontologies within a single framework. Our hypothesis is that we can achieve better (more natural, more scaleable, and more expressively powerful) results if we supple-ment the DAML+OIL framework with the resources of TGP.

74 Anand Kumar and Barry Smith

4.2 Implementation Based on TGP

By A1, a unique maximal cell contains as sub-cells all the cells present in the parti-tion. In our present example, the domain of the partition is: the totality of activities in accordance with the given clinical guidelines. The maximal cell is then the UMLS Semantic Type Health Care Activity, which covers all the tasks specified in the guide-lines. All other cells stand to this maximal cell in a subcell relation which satisfies A2. The immediate sub-cells of this maximal cell (as in the UMLS Semantic Network) are Laboratory Procedure, Diagnostic Procedure and Therapeutic or Preventive Proce-dure, which have further subcells depending on the GL text at issue.

By A2, A3 and A4, each distinct cell in a partition is connected to the root via a fi-nite chain of immediately succeeding cells. This generates a nestedness of cells in the form of chains, terminating in the smallest cells, also called the leaves of the tree. For example, in Fig. 2, determination of smoking and determination of women’s age to be greater than 65 are leaves for the total partition.

Fig. 2. Task-based ontological representation of the 1999 WHO International Society of Hyper-tension Guidelines for the Management of Hypertension

Leading out from Diagnostic Procedure via subcell relations we have cells for: Deter-mination of Forecast of Outcome, Determination of Cardiovascular Risk Factor, De-termination of Factors Used in Risk Stratification and finally Determination of Hyper-tension Classification. This last is a leaf in the granular partition corresponding to the WHO GL for hypertension management.

By transitioning between taxonomical and partonomical partitions we can now rep-resent the summation of subtasks at the same level in the granular hierarchy of a GL task-subtask structure, for example as follows: DOF Family History of Premature Cardiovascular Disease ∪ DOF Hypertension Classification ∪ DOF Men.Age >> 55

Ontology for Task-Based Clinical Guidelines and the Theory of Granular Partitions 75

∪ DOF Total Cholesterol >> 6.5 ∪ DOF Women.Age >> 65 together yield the task: DOF Factors Used for Risk Stratification on the next highest granularity level. When the latter is summed on this level with the cells Central Nervous System Examination, Abdominal Examination etc., then this yields the cell Diagnostic Procedure, which is present both in the UMLS Semantic Types and in the GL-specific ontology.

5 Conclusion

We have sketched how the theory of granular partitions can be used in the creation of ontologies for clinical practice guidelines by providing a framework within which we can transition between ontologies of different sorts and at different granular levels. We believe that this will produce a robust and flexible platform for the formulation of intuitive and easily extendible computer-interpretable guideline models.

Acknowledgement

Work on this paper was supported by the Wolfgang Paul Program of the Alexander von Humboldt Foundation.

References

1. Field M, Lohr KN. Attributes of good practice guidelines. In: Field M, Lohr KN, editors. Clinical practice guidelines: directions for a new program. Washington, DC: National Academy Press, 1990: 53-77.

2. Peleg M., Tu S, Bury J, Ciccarese P, Fox J, Greenes RA, Hall R., Johnson PD, Jones N., Kumar A., Miksch S., Quaglini S., Seyfang A., Shortliffe EH, and Stefanelli M. Comparing Computer-Interpretable Guideline Models: A Case-Study Approach. J Am Med Inform Assoc. 2003 Jan-Feb;10(1): 52-68.

3. Nigel S., Michel C, and Jean PB. Which coding system for therapeutic information in evi-dence-based medicine. Computer Methods and Programs in Biomedicine 2002; 68(1): 73-85.

4. Humphreys BL, Lindberg DA, Schoolman HM, Barnett. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc. 1998 Jan-Feb; 5(1): 1-11.

5. UMLS website http://www.nlm.nih.gov/research/umls/ 6. Bechhofer S, Horrocks I, Goble C, Robert S. OilEd: a Reason-able Ontology Editor for the

Semantic Web. Proceedings of KI2001, Joint German/Austrian conference on Artificial In-telligence, September 19-21, Vienna: Springer-Verlag LNAI Vol. 2174, 396-408. 2001.

7. Bittner, T. and Smith, B. (2003a). Granular Spatio-Temporal Ontologies, To appear in: Proceedings of the AAAI Spring Symposium on Foundations and Applications of Spatio-Temporal Reasoning (FASTR).

8. Bittner, T. and Smith, B. (2003). A Theory of Granular Partitions. In: Foundations of Geo-graphic Information Science, M. Duckham, M. F. Goodchild and M. F. Worboys (eds.), London: Taylor & Francis, 117–151

9. Kumar A, Ciccarese P, Quaglini S, Stefanelli M, Caffi E, Boiocchi L. Relating UMLS se-mantic types and task-based ontology to computer-interpretable clinical practice guidelines. Proc MIE 2003.

10. WHO Hypertension Guideline. http://www.bpcr.com/diet/who/


Speech Interfaces for Point-of-Care Guideline Systems

Martin Beveridge, John Fox, and David Milward

Cancer Research UK, London WC2A 3PX * {mb,jf,dm}@acl.icnet.uk

Abstract. A major limiting factor in the acceptability of interactive guideline and decision support systems is the ease of use of the system in the clinic. A way to reduce demands upon users and increase flexibility of the interface is to use natural language dialogues and speech based interfaces. This paper de-scribes a voice-based data capture and decision support system in which knowl-edge of underlying task structure (a medical guideline) and domain knowledge (disease ontologies and semantic dictionaries) are integrated with dialogue models based on conversational game theory resulting in a flexible and config-urable interface.

Introduction

Natural language interfaces are likely to be important in future healthcare systems. However, their development is a greater challenge than applications that have been investigated to date (e.g. route planning, flight booking etc) which generally require only simple information look-up. Clinical systems may include complex reasoning and workflow management, and dialogue may need to be closely coupled to the un-derlying clinical context. This paper uses discourse analysis as a basis for combining dialogue techniques with models of clinical tasks (e.g. data-capture and decision-making) and ontological knowledge (e.g. about diseases and symptoms). The ap-proach has three benefits: 1) dialogue is tied to natural clinical tasks, providing guid-ance and constraints for understanding input and interpreting intentions; 2) dialogues can be generated automatically from domain knowledge, and 3) the dialogue genera-tor can be reconfigured for other domains without reprogramming.

The Structure of Dialogues

The linguistic structure of a discourse obviously includes the sequence of utterances that comprise the discourse, but in addition these utterances are considered to natu-rally aggregate into discourse segments (analogous to constituents in sentence syntax) [4] and the discourse must be understood in terms of functional aspects of these seg-ments: so-called intentional, informational and attentional aspects. Intentional struc-

* This work was carried-out as part of the EU-funded HOMEY project (Home Monitoring through an

Intelligent Dialogue System, IST-2001-32434). Many thanks to our partners for all their helpful com-ments and advice: Engineering (www.eng.it), Reitek (www.reitek.com), CBIM (www.cbim.it), ITC (www.itc.it), and L&C (www.landcglobal.com).

Speech Interfaces for Point-of-Care Guideline Systems 77

ture deals with two kind of relations between the intentions which underpin discourse segments [4], namely: satisfaction-precedence (SP) and dominance (DOM) relations. For example, the intention to obtain a clinical history “satisfaction-precedes” the intention to make a diagnosis, and is partly satisfied by or “dominates” the intention to find out a patient’s age. Discourse can also be described in terms of informational (semantic) relationships between discourse segments [6]. For example, in order to generate the sentence “John will be treated urgently because his condition is life threatening” it is necessary to know about the semantic relation of causality between the notion of “life threatening condition” and “being treated urgently” in order to generate the appropriate linking word “because” (rather than “unless”, “although” etc). Lastly, discourse can also be described in terms of the way it unfolds over time. In [4] this is represented through the notion of a dynamic attentional state, which describes all the objects, properties and relations that are salient at a particular point in a discourse. The attentional state coordinates the linguistic structure and non-linguistic representations such as intentions and information relations. Recently there has been a growing consensus that all three of the structures described above are re-quired to represent discourse.

In attempting to describe dialogue (conversational discourse) one approach that has proven valuable is Conversational Game Theory [5]. This represents dialogue in terms of conversational games, a plan-based level associated with intentions, and a structural level consisting of sequences of conversational moves which specify the linguistic structure required to satisfy those intentions. Dialogues are thought-of as a series of games each aiming to achieve some sub-goal of the dialogue. The present approach to specifying natural language dialogue builds on a combination of these ideas.

Intentional Structure

Intentions are implicitly captured by the structure of games. For example, a game that is initiated by a query-yn move reflects an underlying intention on the part of the “initiating conversational partner” (ICP) that the “other conversational partner” (OCP) should intend that the ICP know if some state of affairs holds. If the OCP is coopera-tive then they will adopt this intention and make an appropriate reply-yn move. Dominance and satisfaction-precedence relations can be treated as relations between games. The initiating and response moves currently implemented in our system are shown in the following table. Each participant may also respond with the initiating move of a new sub-game whose intention sub-serves that of the parent game.

Initiating moves Response moves Explain: provide information not previously requested

Acknowledge: acknowledge and signal continuation

Instruct: provide instruction Reply-yn: yes/no reply Query-yn: yes/no query for unknown information

Reply-w: reply supplying a value

Query-w: complex (wh-)query for unknown information

78 Martin Beveridge, John Fox, and David Milward

Information Structure

One of the problems associated with applying information relations to dialogue is determining the appropriate units that such relations should apply to. In text genera-tion, they are applied to successive utterances, but in dialogue they may span more than one utterance and speaker [7]. For example, here is a dialogue fragment from our system when advising on whether a patient should be referred to a specialist:

1. S: Is there any nipple discharge? [Query-yn] 2. U: Yes [Reply-yn] 3. S: Ok… [Acknowledge] 4. S: And is it bloodstained? [Query-yn] 5. U: No [Reply-yn] 6. S: Ok. [Acknowledge]

In this example, the second Query-yn game (4, 5 and 6) elaborates the information provided in the first Query-yn game (1, 2 and 3).

Attentional State

There are likely to be several playable moves at any point in a dialogue, so it is neces-sary to determine which is the preferred move. This must take into account constraints imposed by the intentional and information structures. For example, if the intentional structure specifies that I1 SP I2 then I2 should not be chosen until I1 is satisfied; if I1 DOM I2 and I1 DOM I3 then I2 and I3 satisfy a common goal so they can be presented in succession, or by aggregating them (e.g. “what are the patient’s age and sex?”). Information relations can also help to preserve dialogue coherence by ensuring that the next move made by the system is as semantically relevant as possible to previous moves. For instance: in the next example the user introduces the topic of nipple dis-charge and the system chooses its next move so as to continue that topic (rather than pursuing other intentions that it may have in its dialogue plan):

1. S: What is the patient’s age? 2. U: They’re thirty and they have severe nipple discharge 3. S: Ok… 4. S: And is the nipple discharge bloodstained? 5. U: No. 6. S: Ok…

Separating Domain Knowledge from Dialogue Knowledge

Among the aims of this work are the ability to exploit existing domain representation schemas in generating dialogue specifications automatically. The system includes domain knowledge of two types: task knowledge and ontological knowledge.

Speech Interfaces for Point-of-Care Guideline Systems 79

Task Knowledge

Since the domain plan determines the tasks to be carried out (e.g. first take a patient history then make the decision), it provides a basis for deriving the intentional struc-ture of a dialogue about that domain. In fact, the plan imposes certain obligations on the dialogue system in order that the process can be completed successfully, and the dialogue system must interact with the user to meet those obligations. This approach is consistent with the suggestion that dialogue structure is largely determined by task structure [4].

Intentional relations can be derived from relations between tasks in the domain plan. For example, preconditions of tasks within the plan can be considered to give rise to satisfaction-precedence relations in the intentional structure. Hence if task T2 has preconditions such that it cannot be started until task T1 has completed (e.g. you must take patient details before making a referral decision) then the relation I1 SP I2 can be inferred between the associated intentions I1 and I2. Dominance relations can similarly be inferred from decomposition relations between tasks, e.g. if T1 is decom-posed into T2 and T3 then the relations I1 DOM I2 and I1 DOM I3 can be inferred.

Ontological Knowledge

Task-specific knowledge must be augmented with a conceptual model that describes general domain knowledge e.g. ‘breast cancer is-a cancer’, ‘nipple discharge is-symptom-of breast cancer’ and so forth. In our work the domain ontology forms the basis for deriving information relations between dialogue games. A fragment of the ontology (taken from [2]) is shown in Figure 1.

skin changedistortion

skin distortion

skin

body partpathological process

change process

Has-Systemic-Medium

IsA IsA

IsA

IsA IsA

Has-Systemic-Medium

Has-Systemic-Medium

material object

IsA

Has-Systemic-Medium


skin distortion

skin


change process

Has-Systemic-Medium

IsA IsA

IsA

IsA IsA

Has-Systemic-Medium

Has-Systemic-Medium

material object

IsA

Has-Systemic-Medium


skin distortion

skin


change process

Has-Systemic-Medium

IsA IsA

IsA

IsA IsA

Has-Systemic-Medium

Has-Systemic-Medium

material object

IsA

Has-Systemic-Medium

Fig. 1. A fragment of the domain ontology

Use of Information in a Cancer Guideline

In other work our group has developed a system for advising doctors on whether patients require urgent referral for suspected cancer [1]. The system is accessed by a

80 Martin Beveridge, John Fox, and David Milward

standard web browser that generates web pages for collecting patient data and report-ing on results (see www.infermed.com/era).

For the present project we wish to have a voice-based mode for entering data into this system. The task knowledge component of this voice-based system is currently implemented in the PROforma task representation language [1] using the Tallis tool-set (www.openclinical.org/kpc). The domain ontology is implemented using an On-tology Browser developed by Language & Computing n.v. (L&C) [2]. A dialogue Engine uses the task descriptions provided by these components to create a high-level dialogue specification (HLDS) that describes the games to be played to complete current tasks. The HLDS is in turn used to create a sequence of moves that can be made by either participant at the current point in the dialogue. The result of this proc-ess is encoded as a VoiceXML document, which is then interpreted by a voice browser which controls automatic speech recognition (ASR) and text-to-speech (TTS) components. The voice browser and ASR are provided by Istituto Trentino di Cultura (ITC) [3] and are integrated, along with the Actor multilingual TTS produced by Loquendo (www.loquendo.com), into an interactive voice response (IVR) platform provided by Reitek Sp.A. The IVR platform typically handles telephony control, audio recording etc so that the dialogue system can be accessed over the phone.

Conclusions

An approach to building spoken dialogue systems that treats the dialogue model as having distinct high-level and low-level representations has been described. This uses current voice-based standards which are widely employed in commercial systems for the low-level elements (e.g. VoiceXML) whilst also expressing high-level notions of intention, information and attention which are required for flexible “conversational” dialogue. The high-level dialogue representation can be automatically derived from the domain knowledge (task and ontological knowledge), reducing the need to author dialogues by hand, and providing reconfigurability. A complete demonstrator has been implemented for the domain of breast cancer and is currently being evaluated.

References

1. J Bury M Humber, J Fox Integrating Decision Support with Electronic Referrals In R. Rogers , R. Haux and V. Patel (Eds). Medinfo. 2001. IOS Press, Amsterdam -

2. Ceusters, W., Beveridge, M. A., Milward, D., and Falavigna, D. (2002). Specification for Semantic Dictionary Integration, Deliverable D9, EU HOMEY Project, IST-2001-32434.

3. Falavigna, D. and Gretter, R. (1999). Flexible Mixed Initiative Dialogue over the Telephone Network. Proc. Of ASRU’00, 12th – 15th December, Colorado.

4. Grosz, B., and Sidner, C. (1986). Attention, Intention and the Structure of Discourse. Computational Linguistics 12(3):175-204.

5. Kowtko, J. C. and Isard, S. D. (1993). Conversational Games Within Dialogue, Research Paper 31, Human Communication Research Centre, Edinburgh.

6. Mann, W. D., and Thompson S. A. (1988). Rhetorical Structure Theory: Towards a func-tional theory of text organization. Text, 8(3):243-281.

7. Stent A. (2000). Rhetorical Structure in Dialog. Proc. 2nd International Natural Language Generation Conference (INLG'2000).


Text Categorization prior to Indexing for the CISMEF Health Catalogue

Alexandrina Rogozan1, Aurélie Néveol1,2, and Stefan J. Darmoni1,2

1 PSI Laboratory - FRE 2645 CNRS - INSA de Rouen, BP8 avenue de l'Université 76801 Saint-Etienne-du-Rouvray Cedex, France

{alexandrina.rogozan,aurelie.neveol}@insa-rouen.fr 2 CISMeF et L@stics - Rouen University Hospital and Rouen Medical School

1 rue de Germont, 76031 Rouen, France {stefan.darmoni}@chu-rouen.fr http://www.chu-rouen.fr/cismef

Abstract. This paper is positioned within the development of an automated in-dexing system for the CISMeF quality controlled health gateway. For disam-biguation purposes, we wish to perform text categorization prior to indexing. Hence, a global approach contrasting with the classical analytical methods based on the analysis of keyword counts extracted from the text is necessary. The use of statistical compression models enables us to proceed avoiding key-word extraction at this stage. Preliminary results show that althought this method is not as precise as others in terms of resource categorization, it can sig-nificantly benefit indexing.

1 Introduction

Internet has become a very prosperous source of information in numerous fields, including health. The CISMeF project (French acronym of Catalogue and Index of Medical On-Line Resources) was initiated in 1995 in order to meet the users’ need to find precisely what they are looking for among the numerous health documents avail-able online. As a Quality Controlled Health Gateway [1], CISMeF describes and in-dexes the most important resources of institutional health information in French. It currently contains more than 12,000 resources, and it is updated manually with 50 new resources each week. Indexing is a decisive step for the efficiency of information retrieval within the CISMeF catalogue, and it is also one of the most time consuming tasks for the librarians, demanding high-level documentary skills.

Our research work aims to develop an automatic indexing system that would broaden the CISMeF catalogue coverage while ensuring good indexing quality and achieving high precision and recall rates for information retrieval within CISMeF. For a better approach of automatic indexing, we wish to perform text categorization as a preliminary task.

In fact, the knowledge of the resource medical specialty, we also called context or category, will have a doubly important role in the indexing phase: 1. it will help lexi-cal disambiguation (Pouliquen [2] explains how a lack of such disambiguation leads to systematic indexing errors. For example, several occurrences of the term lutte in a

82 Alexandrina Rogozan, Aurélie Néveol, and Stefan J. Darmoni

resource could be related to either MeSH terms Wrestling or Prevention & Control. Now, if the context is Sports Medicine it is highly likely that the appropriate MeSH term is Wrestling) 2. It will give more weight to the context related keywords, there-fore bringing out the gist of the resource content.

After reviewing the existing methods of text categorization in section 2, a set of medical contexts based on the CISMeF terminology is defined in section 3. Then, a text categorization methodology based on compression models is presented, ongoing experiments are detailed, and their contribution to text categorization is discussed in section 4.

2 Global vs Analytical Methods for Text Categorization

Early work of Wiener et al. [3] shows that neural networks and logistic regression are appropriate approaches for topic spotting in documents. Among recent statistical approaches for text categorization, the Support Vector Machines (SVM) are emerging as they provide higher precision than four other learning algorithms, including naïve bayes, bayes nets and decision trees in an experiment conducted by Dumais et al. [4]. However, SVM performances in multi-class problems are limited in terms of speed and algorithm complexity.

Other strategies consist in combining both statistical and linguistic approaches. For instance, Wilcox et al. [5] use data mining and natural language processing tools to extract a pertinent representation of documents, and statistical methods, viz. rule gen-eration, bayesian classifiers, and information retrieval for their categorization. Wil-cox’s results confirm that using explicit domain knowledge when available, is the best methodology, because succeed at the best results. Indeed, in recent work we imple-mented a ruled based algorithm using the semantic properties of the CISMeF termi-nology for categorization purposes, and obtained 80% precision and 93% recall [6].

However, these categorization techniques, as well as other analytical techniques reviewed by Kosala [7] involve a preliminary representation of documents (i.e. bag of words). The significant words extraction is clearly redundant with the indexing proc-ess, and our goal is to identify the context prior to keyword extraction and indexing. This constraint leads us to choose a global approach. Teehan and Harper [8] show that statistical compression models, and in particular PPM (Prediction by Partial Match) models, have performances comparable to those of SVM for text categorization, while using a global approach. Therefore we have decided to adapt them to health resource categorization within CISMeF, after defining a set of medical contexts based on the CISMeF terminology.

3 Medical Context Set Based on CISMEF Terminology

In order to identify to which context(s) a given resource belongs, i.e. which medical specialty(ies) it deals with, we need to define a set of medical specialties that would be both complete and relevant for indexing purposes.

Text Categorization prior to Indexing for the CISMEF Health Catalogue 83

The CISMeF team indexes health resources using a French version1 of the MeSH (Medical Subject Headings) which is the National Library of Medicine’s thesaurus. The MeSH 2003 contains approximately 22,000 hierarchically arranged keywords and 84 qualifiers that can be coordinated to the keywords, in order to refer to particu-lar aspects of a subject.

The CISMeF terminology (described by Soualmia et al. [9]) encapsulates the MeSH. A list of synonyms, a resource types hierarchy and a set of 85 metaterms rep-resenting medical specialities were introduced in the terminology in order to enhance information retrieval within the catalogue, and create an overall vision of the terms related to each speciality [10]. In fact, metaterms have materialised links that exist between keywords, though they do not appear in the MeSH hierarchy. Moreover, CISMeF terminology created semantic links between each metaterms and the related keywords, qualifiers, and resource types. Metaterms have a coverage of 73% (as of March 2003) on MeSH keywords used in CISMeF. Therefore, it is quite relevant to use the set of medical contexts defined by the metaterms.

We now have to build appropriate compression models for health resource catego-rization within CISMeF catalogue.

4 Compression Models for Text Categorization

4.1 General Principle

The key idea behind using PPM compression scheme (see [8] and [11] for more de-tails) is to model the probability distribution of symbols within the context provided by all previous symbols in a specific type of text, viz. texts that deal with the medical specialty at hand. The PPM algorithm uses a Markov chain approximation and as-sumes a fixed order of context. Each model is able to predict the following symbol in a sample model compliant text with a better probability than for any other type of text. In terms of compression, this means that once a compression model is trained on texts dealing with a given specialty, it will be able to compress similar texts better than texts with another probability distribution, i.e. dealing with a different subject.

4.2 Learning Compression Models

Therefore, for each specialty, different order models are built on a training set which is a representative sample of resources from CISMeF catalogue, and terms: keywords, synonyms and qualifiers from CISMeF terminology. The validation set is used as a positive corpus for the model it belongs to, and also as part of a negative corpus for all the other models. Parameter optimization, and in particular the choice of the opti-mal order to be used by each compression model, is processed with validation data, so as to maximize the difference of compression ratios between positive and negative corpora. The model thus selected can be evaluated on the test set of which resources have been tagged with a categorization algorithm based on CISMeF manual index-ing [6]. 1 Translation provided by Institut National de la Santé Et de la Recherche Médicale at

http://dicdoc.kb.inserm.fr:2010/basismesh/mesh.html

84 Alexandrina Rogozan, Aurélie Néveol, and Stefan J. Darmoni

4.3 Experimental Results

Preliminary experiments were conducted on health resources that cover the four con-texts, which are the most represented in CISMeF catalogue. Maximum compression ratio difference was achieved with order 4 models. The result on the test corpus was 55% precision with small training and validation corpora of ten documents for each specialty.

Further experiments will concern a finer analysis of the compression ratios ob-tained with different models, on the test corpus, allowing ranking the medical catego-ries by relevance for a resource. For each model, if the resource compression ratio exceeds a decision threshold, we will assume that the resource deals with the respec-tive category. The decision thresholds are to be established on the validation corpus. A test resource can thus be in zero, one or more than one categories. From these val-ues, we will evaluate performances with the precision and recall measures, but also, for a better evaluation and comparison purposes, with the F-measure and possibly the precision/recall breakevent point [4, 8].

4.4 Discussion

The performances of the context identification method we proposed depend on how relevant the compression models are, and therefore on the quality of the training cor-pora. Hence, the training corpora should be non-overlapping for different models, but they also should contain discriminative resources, so as to maximize the distance, measured by compression ratio difference, between contexts.

Comparison of final results (experiments conducted on all specialties) with Teehan et al.’s results [8], will reveal whether compression models can deal with such fine granularity in topics, as we are aiming at about hundred categories within the medical domain whereas [8] tested texts belonging to 10 general subject categories.

5 Conclusion and Perspective

Research for an automated resource indexing procedure in the CISMeF catalogue has led us to tackle health resource categorization as a preliminary task to indexing. The compression method we described corresponds to a global approach that enables us to perform text categorization prior to indexing, contrary to the usual categorization techniques.

The primary results that we have obtained from experimentation with the methods we are presenting in this paper are quite promising, and encourage us to consider further experimentation. Future testing will be performed on the complete set of spe-cialties (metaterms), with larger training and validation corpora.

An automatic indexing procedure will be set up after these experiments have been carried out, and the ranking of medical contexts obtained from the classification shall be used to weight semanticaly linked keywords.

Text Categorization prior to Indexing for the CISMEF Health Catalogue 85

Acknowledgments

We would like to thank the librarians of the CISMeF team at Rouen University Hospital (Magaly Douyère, Saida Ouazir, Josette Piot and Benoît Thirion), who developed the CISMeF terminology, and kindly put it at our disposal for research purposes.

References

1. Koch, T.: Quality-controlled subject gateways: definitions, typologies, empirical overview. In: Subject gateways, Special issue of "Online Information Review", Vol. 24:1, (2000), 24-34

2. Pouliquen B.: Indexation de document médicaux par extraction de concepts, et ses utilisa-tion, PhD thesis (2002)

3. Wiener, W., Pedersen J., Weigend A.: A neural network approach to topic spotting, in Proc. of the Symposimum on Document Analysis and Information Retrieval, (1995) 317-332

4. Dumais S., Osuna, E., Platt, J., Schölkopf, B.: Using SVMs for text categorization, in IEEE Intelligent Systems Magazine, Trends and Controversies, Marti Hearst, ed., 13(4) (1998) 18-28

5. Wilcox, A., Hripcsak G.: Classification Algorithms Applied to Narrative Reports, Proc of Symp. in AMIA (1999)

6. Néveol, A., Soualmia, L.S., Rogozan, A., Douyère, M., Darmoni, S.J.: Utilisation des pro-priétés sémantiques de la terminologie CISMeF pour la catégorisation de ressources de santé, à paraître dans Actes des Journées Francophones d'Informatique Médicale (2003)

7. Kosala, R., Blockeel, H.: Web Mining Research : A Survey, in ACM SIGKDD, Vol. 2, Is-sue 1, (2000) 1-15

8. Teahan W., Harper D.: Using compression based language models for text categorization, in J. Callan, B. Croft and J. Lafferty, eds., Workshop on Language Modelling and Informa-tion Retrieval, (2001) 83-88

9. Soualmia, L.F., Thirion B., Leroy J.P., Douyère M., Darmoni. SJ.: Modélisation et repré-sentation des connaissances dans un catalogue de santé, dans les Actes des Journées Fran-cophones d'Ingénierie des Connaissances 2002, (2002) 139-149

10. Darmoni, S.J., Leroy, J.P., Baudic, F., Douyère M., Piot, J., Thirion, B.: CISMeF: a struc-tured health resource guide, in Methods of Information in Medicine, (2000) 39(1):30-35

11. Cleary, T.C., Witten, J.G.: Data compression using adaptive coding and partial string matching, in IEEE Transaction on Communications, (1984) 32(4):396-402


Bodily Systems and the Modular Structure of the Human Body

Barry Smith1,2, Igor Papakin1, and Katherine Munn1

1 Institute for Formal Ontology and Medical Information Science Faculty of Medicine, University of Leipzig, Leipzig, Germany

2 Department of Philosophy, University at Buffalo, Buffalo, NY

Abstract. Medical science conceives the human body as a system comprised of many subsystems at a variety of levels. At the highest level are bodily systems proper, such as the endocrine system, which are central to our understanding of human anatomy, and play a key role in diagnosis and in dynamic modeling as well as in medical pedagogy and computer visualization. But there is no explicit definition of what a bodily system is; such informality is acceptable in docu-mentation created for human beings, but falls short of what is needed for com-puter representations. Our analysis is intended as a first step towards filling this gap.

1 Bodily Systems and Medical Ontology

The Institute for Formal Ontology and Medical Information Science in Leipzig is constructing a reference ontology for the medical domain. It is designed not as a com-puter application in its own right but as a framework of axioms and definitions relat-ing to such general concepts as: organism, tissue, disease, therapy. Here we focus on the concept bodily system, which we believe will serve as a central factor in a robust ontology of the human organism.

The division of the body into its major bodily systems is by no means unproblem-atic. The National Library of Medicine1 lists eight body systems, including the uro-genital system; Wolf-Heidegger’s Atlas of Human Anatomy2 lists only seven systems, etc. Standard sources often divide systems into three groups of supportive systems (the integumentary and musculoskeletal systems), exchange systems (the digestive, respiratory, circulatory, and urogenital systems), and regulatory systems (the nervous, endocrine, and immune systems). However there are elements of exchange systems (for example parts of the liver and pancreas) which play a role also in regulatory sys-tems, and the three regulatory systems themselves effect their functions of regulation via a certain sort of substance-exchange.

Medical textbooks rest on informal explications of the general concepts of ‘system’ and ‘function’ which concern us here. While such informality is acceptable in docu-mentation created for human beings, who can draw on their tacit knowledge of the en-tities involved, medical information systems require precise and explicit definitions of the relevant terms. The analysis presented here is intended as a first step towards providing a framework for such definitions.

Bodily Systems and the Modular Structure of the Human Body 87

2 Body Systems, Elements, and Their Functions

The first step in making sense of standard rosters of bodily systems is to recognize that we can divide the living human body (referred to in what follows as the ‘body’) into components of specific kinds, which we shall call elements. ‘Element’ can be understood as a generalization of concepts such as organ and cell.

Elements are distinguished from other parts of the body by their causal relative isolation from their surroundings and by the functional role they have in the workings of the body as a whole. Elements are elements only for larger systems to which they belong. Some elements of the digestive system are the stomach, the serous membrane, the layers of smooth muscular tissue, and so forth. Corresponding functions include: providing the stage for the digesting process by mixing the bolus with hydrochloric acid and pepsin, external coverage of the stomach and its constriction.

Elements can be distinguished on a number of different levels of granularity3. Granularity is a means of representing the hierarchy among elements of bodily sys-tems; such a hierarchy is arranged according to the functions of the respective ele-ments. Functions located at lower levels interact in complex ways to enable functions at higher levels. At all but the highest level, each function of a system is a sub-function for an umbrella system at a higher level. The kidney’s function is to excrete urine, which it does via a composite process consisting of smaller interrelated proc-esses that occur on lower levels of granularity: the excretion of urea and creatinine, absorption of necessary ions, and excretion of redundant ions and water.

The dividing of functions into sub-functions mirrors the dividing of systems into elements: what is a system at a lower level may be an element at a higher one. The heart itself is a system consisting of the myocardium, endocardium, pericardium, etc. The latter have their own specific functions and comprehend their own elements (e.g., different types of cells). Each cell is a system that consists of elements such as nu-cleus, mitochondria, endoplasmic reticulum, which are in turn systems in their own right with their own specific functions.

Element and function have a parallel relationship in that an element’s place in the body’s granular hierarchy is determined by its function. However, every organ in the body performs a multiplicity of functions – which is why it is crucial to distinguish between a body part and a system element. A body part belongs to the body by virtue of its physical attachment; an element belongs to a system by virtue of the function it performs for the system. The kidney is a part of the body whether the body is dead or alive, functioning or not, but it will only be an element of the urinary system while the system is capable of functioning, i.e., as long as the body is alive.

It is by virtue of the complex organization of the body’s granular hierarchy accord-ing to the functioning of systems and subsystems that enables the body to regulate its own state and structure. Its ability to perform this regulation depends on the very specific type of organization which is present already in single-celled organisms, where we find processes of metabolism, waste excretion, DNA replication, and struc-tural support performed by corresponding elements arranged in modular fashion into systems. Without the performance of such functions the body will die. It is this fact which yields the principle for the division of the body into its major systems.

The body’s systems and elements were developed by evolutionary processes. Our bodily systems exist as they do because the bodies of our immediate predecessors had similar systems, which performed functions that proved conducive to their survival.

88 Barry Smith, Igor Papakin, and Katherine Munn

Those functions of a given element that enable the whole organism to survive, and thus reproduce successive generations of the same type of element, are called the proper functions of this element4.

We can now state our definition of ‘element’: X is an element of Y if and only if: (i) X is a proper part of Y and Y exists on a higher level of granularity than X; (ii) X is causally relatively isolated from the surrounding parts of Y; (iii) X has a proper function which contributes to the proper function of Y; (iv) X is maximal, in the sense that X is not a proper part of any item on the same level of granularity satisfying con-ditions (i) to (iii).

Only those entities that have proper functions inside your body are elements of your body according to this definition. Thus a virus may take on a functional role in your body, directing the cell to construct certain proteins that the virus needs for re-production, but it is neither an element nor a part of your body, because the directions given by the virus interfere with your body’s proper functioning5.

The proper functioning of your parents’ hearts ensured that their circulatory sys-tems worked; combined with the proper functioning of their other bodily systems, this enabled them to produce you and your heart. Thus the proper functions on each granular level need to be lined up in a complex branching structure so that they can be executed to support the survival of the whole organism. To understand the nature of this ordering of bodily functions is to understand the role of the body’s major systems in the organization of the body as a whole.

3 Criticality and Bodily Systems

The body is full of redundancy, so any of its elements can cease to function for certain periods, yet the body will still survive as a living organism. But some functions are critical: if they are not executed, the body dies. We can define critical function as follows: F is a critical function for a system Y if and only if: (i) An element of Y has F as its proper function; (ii) F is performed by element X of Y and no other element of Y performs F; (iii) the continuing to function of system Y is causally dependent on the continued performing of F by X.

An element’s function may become critical in special circumstances, including cases of disease. Each kidney has a non-critical function in the body’s normal state, but it becomes critical if the contralateral kidney is damaged or removed. But your kidneys taken together do perform a critical function by the terms of our definition.

Understanding criticality will help us understand how elements relate to the whole systems of which they form a part. All critical functions performed by elements of the body’s hierarchical organization at lower levels are contributions to the performance of critical functions by larger systems on a higher level of granularity. Eventually we reach some maximal level where we deal with critical functions that belong to ele-ments of the body that contribute to the functioning of no larger body part except the body as a whole. The elements on this maximal level are precisely the body’s major systems. We can then define: X is a bodily system for organism Y if and only if: (i) X has a critical function for Y; (ii) X is not a proper part of any other system that has a critical function for Y.

Bodily systems are in other words the largest elements of the human body that have critical functions. Just as some functions belong to a level of granularity that is

Bodily Systems and the Modular Structure of the Human Body 89

immediately below that of the bodily system to which they contribute, so bodily sys-tems belong to the level of granularity immediately below that of the whole organism. The body’s systems, in spite of their relative causal isolation, are still massively caus-ally interconnected. If one system ceases to function then so, by virtue of the ensuing death of the whole organism, do the others. But this interdependence is sequential in the sense that a pathologist can almost always establish which system was responsible for causing the organism’s life processes to cease. In order to understand this sequen-tiality we need to keep careful track of the levels of granularity of elements (systems and subsystems) with which we deal. There then emerges a proportional relationship between granularity and criticality: the fewer umbrella systems you have to count upward from a function before you reach the critical function of a whole bodily sys-tem, the more critical the function is – and the higher its granular level. Breaking a finger will not kill you, but losing your liver or heart will.

Evolutionary development has tended to follow a principle of economy (or co-op-tion): given elements can evolve to have functions for two or more systems at differ-ent times, and are thus sites for the systems’ functional and structural overlapping. The male urethra provides a pathway for both urine and sperm. It thus has the func-tion of allowing urine to exit the bladder and sperm to be ejaculated; the former is critical for the urinary system, the latter for the reproductive system.

The body is full of overlap: every element has functions for multiple systems. The functions of system elements can be classified on a scale of degree of criticality. A function has a low degree of criticality if the system still achieves its function when the element that performs it is disabled. The circulatory system still functions if some arterial branch is occluded by a thrombus and no longer supplies certain regions with blood, for collateral arteries will provide the needed blood flow. So this particular arterial vessel has a low degree of criticality to the circulatory system.

This kind of redundancy contributes to the body’s modular structure. The lower the granularity, the fewer examples we find of criticality, and the greater the redundancy. Thus where the immune system is executing its proper function, the mutation of one single cell does not cause cancer. For this we need the presence of the same mutant gene in a multiplicity of cells within a single tissue.

4 Conclusion: Demarcating the Body into Bodily Systems

Our approach suggests how to explain why the standard rosters of bodily systems, while they contain many commonalities, still differ among themselves. Some text-books of anatomy include both bones and joints in the skeletal system, whereas the Nomina6 and the Terminologia Anatomica7 represent bones and muscles as two sep-arate systems8. As we saw, there is a sequentiality to the interdependence of bodily systems: if one system ceases to function, the others will follow, in a certain order. If two putatively distinct systems always cease to function simultaneously – such as the pulmonary and systemic components of the circulatory system – then they are not systems in their own right, but elements of the same system.

Note that the demarcation lines among bodily systems are to a degree a matter of fiat; they are boundaries inserted by human beings for the purposes of constructing predicatively powerful causal theories. In this respect bodily systems are comparable to the body’s extremities. We say that the human body has arms and legs as parts. But

90 Barry Smith, Igor Papakin, and Katherine Munn

there is no bona fide boundary (no physical discontinuity) constituting the border between your arm and the rest of your body. Here, too, there are fiat boundaries by which we (cognitively) divide the body into parts9.

Our analysis comes close to yielding the core roster of bodily systems that standard medical textbooks share. It helps us understand why there is no standard opinion on how to classify the reproductive system within such rosters. Some accounts tack it onto the urinary system and refer to one composite ‘urogenital system’; some refer to a ‘genital system’. We see this as evidence that our analysis can shed light not only on what is broadly shared by standard rosters of the body’s systems but also on the how these rosters differ.

Acknowledgements

This work was supported by the Alexander von Humboldt Foundation under the aus-pices of its Wolfgang Paul Program. Our thanks go also to Anand Kumar for helpful comments.

References 1. World Health Organization training course on National Library of Medicine classification,

http://www.emro.who.int/HIS/VHSL/Doc/NLM.pdf 2. Köpf-Maier P, ed., Wolf-Heidegger’s Atlas of human anatomy, Vol. 1, 5th Edition, Berlin,

2000. 3. Bittner T, Smith B, A theory of granular partitions; Foundations of geographic information

science, Duckham M, Goodchild MF, Worboys MF, eds., London: Taylor & Francis, 2003, 117–151.

4. Millikan RG. Language, thought, and other biological categories. Cambridge, MA: MIT Press, 1984.

5. Donnelly M. On holes and parts: The spatial structures of the human body. IFOMIS Re-ports 03/03, Leipzig, Germany, 2003.

6. Nomina anatomica, 4th ed., Amsterdam: Excerpta Medica, 1977. 7. Terminologia anatomica: International Anatomical Terminology, Federative Committee on

Anatomical Terminology (FCAT), Stuttgart: Thieme, 1998. 8. Rosse C. Terminologia anatomica; considered from the perspective of next-generation

knowledge sources, Structural Informatics Group, http://sig.biostr.washington.edu/publica-tions/online/CRTAnat.pdf

9. Smith B. Fiat objects, Topoi, 20: 2, September 2001, 131-148.


Multi-agent Approach for Image Processing: A Case Study for MRI Human Brain Scans

Interpretation

Nathalie Richard1,2, Michel Dojat1,2, and Catherine Garbay2

1 Institut National de la Santé et de la Recherche Médicale, U594 – Neuroimagerie Fonction-nelle et Métabolique, CHU - Pavillon B, BP 217, 38043 Grenoble Cedex 9, France

{nrichard,mdojat}@ujf-grenoble.fr 2 Laboratoire TIMC-IMAG, Institut Bonniot, Faculté de Médecine, Domaine de la Merci

38706 La Tronche Cedex, France [email protected]

Abstract. Image interpretation consists in finding a correspondence between radiometric information and symbolic labelling with respect to specific spatial constraints. To cope with the difficulty of image interpretation, several informa-tion processing steps are required to gradually extract information from the im-age grey levels and to introduce symbolic information. In this paper, we evalu-ate the use of situated cooperative agents as a framework for managing such steps. Dedicated agent behaviours are dynamically adapted function of their po-sition in the image, topographic relationships and radiometric information available. Acquired knowledge is diffused to acquaintance and incremental re-finement of interpretation is obtained through focalisation and coordination of agents tasks. Based on several experiments on real images we demonstrate the potential interest of multi-agents for MRI brain scans interpretation.

1 Modelling and Interpretation Processes

Automatic interpretation of Magnetic Resonance Imaging (MRI) brain scans could greatly help clinicians and neuroscientists in decision making. Due to various image artefacts and in spite of several research efforts, this presently remains a challenging application. Based on several experiments, we demonstrate in this paper the potential interest of situated cooperative agents as a framework to manage the information processing steps, essentially modelling and interpretation via fusion mechanisms, required in this context.

1.1 Context

Three tissue classes exist inside the brain: grey matter (GM), white matter (WM) and cerebro-spinal fluid (CSF) distributed over several anatomical structures, such as cortical ribbon and central structures for GM, ventricles and sulci for CSF and myelin sheath for WM. 3D MRI brain scans are huge images (≈10Mb for one 3D image), whose interpretation consists either in tissue interpretation or in anatomical structures identification. Radiometric knowledge, i.e. knowledge about the tissue intensity dis-

92 Nathalie Richard, Michel Dojat, and Catherine Garbay

tribution and about image acquisition artefacts, must be inserted for tissue interpreta-tion and anatomical knowledge, i.e. knowledge about the geometry and localization of these structures, has to be added for structures interpretation. To perform properly tissue interpretation three kinds of acquisition artefacts are generally taken into ac-count: 1) a white noise over the image volume, which leads to an overlapping of tis-sue intensity distributions, 2) a partial volume effect due to the sampling grid of the MRI signal, which leads to mixtures of tissues inside given voxels and, 3) a bias field, due to intensity non homogeneities in the radio frequency field, which introduces variations in tissue intensity distribution over the image volume. Most of the methods proposed in the literature perform a radiometry-based tissue interpretation. We focus our paper on this issue.

1.2 Estimation and Classification via Fusion Mechanisms

In image processing, decision making occurs at a voxel level since each voxel has to be labelled. Such a labelling is based on so-called "models", that characterize tissue intensity distributions and on interactions between neighbouring voxel labels to re-spect tissue homogeneity. Such models have to be learned from sufficient data and to be expressed in a common framework for information fusion at the voxel level. For MRI brain scan interpretation, model computation is hampered by: 1) the presence of noise, 2) the heterogeneity of classes and 3) their large overlapping. To cope with these difficulties, modelling process, via estimation and classification, should be re-fined through an iterative procedure.

Several estimation techniques have been proposed in the literature. Most of them use bayesian classification algorithms where tissue intensity distributions are mod-elled as gaussian curves whose parameters, mean value and standard deviation, have to be estimated via generally an iterative E/M (Estimation/Maximization) approach [3, 5, 7, 8]. The prior probability required in Bayesian classifiers is based on the rela-tive frequency of each class in the volume [3, 6] or on the introduction of spatial knowledge by the means of a digital brain atlas [1, 7]. In this context, Markov Ran-dom Fields (MRFs) are often used to model interactions between neighbouring voxel labels and define the a priori probability of tissue via a regulation term [7, 8]. The parameters of the MRF model may be given a priori [5] or also estimated iteratively during the EM process [7]. Image artefacts modelling can also be introduced mainly for bias field correction [1, 5, 7].

Estimation and classification should be continuously refined during the incre-mental image interpretation process. Most of the strategies proposed to date to control such a process are iterative optimal approaches that use MRF for modelling the neighbouring labelling topology, gaussian modelling for tissue intensity distribution and bayesian fusion for information combination [5, 7, 8]. Others approaches are rather based on incremental improvement of the interpretation [6].

For a robust decision making, it is essential to proceed incrementally, through suc-cessive and interleaved model estimation, classification, focalisation and fusion. Fig-ure 1 exemplifies the information flow during modelling and interpretation processes and the central role of maps, which are implicitly used in MRF approaches, and ex-plicitly represented in the approach we develop below. Maps are matrices organizing the image information according to its spatial coordinates, in order to keep track of topographical information. They constitute a naturally distributed information reposi-

Multi-agent Approach for Image Processing 93

tory, where focalisation mechanisms can elegantly take place. Maps represent explic-itly the various types and levels of information that are gradually computed, exploited and fused along the entire interpretation process. We advocate in this paper that the use of a multi-agent approach is a powerful way to manage such explicit maps.

Information fusion by tissueClassification mechanisms Fusion mechanisms

Model-based classification

Model parameters estimation Information fusion between tissues for labeling decision

Interpretation process

Low level maps

Final tissue probability maps(one per tissue)

Decision maps

FUSION

Grey level based probability map

Final tissue probability map

Neighbouring labelling based probability map

Grey levels map

Voxel level

Model level

FUSION

Tissue intensity distribution model

Decision (labelling) maps

Neighbouring labels interaction model

For one tissue

...

Modellingprocess

Other tissue final

probability maps

Model based tissue probability maps (one per

tissue and per model)

Fig. 1. Modelling and interpretation processes. Starting from radiometric information (low level maps) models are instantiated and gradually refined by the means of successive and inter-leaved steps of estimation, classification and fusion of probability maps to lead to the final labelling decision.

2 Our Situated and Incremental Multi-agent Approach

Adopting a situated and incremental approach consists in introducing accurate focali-sations of treatments and adequate coordination and cooperation mechanisms. We chose an approach based on incremental improvement of the interpretation, the voxels being labelled from the most reliable to the most uncertain ones, and the adequacy of models instantiation being reinforced all long the interpretation process.

Control strategy drives the evolution of the interpretation process (through the graph of Figure 1) and the spatial exploration of the image (through the image vol-ume). Focalising treatments signifies to choose, at a given step of the interpretation


process: 1) a goal to be reached (objects to identify), 2) a region of interest to be proc-essed (a set of voxels at a given location on the maps), and 3) a method that achieves the treatment (chosen between the four previously introduced modelling and fusion mechanisms). Organizing treatments signifies to choose how treatments should be distributed and coordinated for image interpretation, i.e. when a treatment should be launched and following which criteria and how should treatments cooperate to im-prove the global process.

2.1 Focalisation of Treatments

Interpretation process is proceeded in a situated way i.e. with evolving goals, inside distributed regions of interest and achieved using dedicated mechanisms. For in-stance, simple region growing technique and sophisticated confrontation mechanisms are respectively used for voxels inside a tissue and for voxels located at tissue bor-ders.

Focalisation on Goals Presently, in the implemented system we have designed, situated treatments are dedi-cated to the local interpretation of brain images in three tissues, WM, GM, and CSF. Decision maps are gradually introduced. Firstly, an initial skull-striping map is built to differentiate brain tissues from the rest of the image. Then, decision maps for each tissue are extracted. A next step would consist in identifying anatomical structures by differentiation of tissue decision map.

Focalisation on Distributed Regions of Interest To take into account the non uniformity of intensity distribution overall the image, mainly due to bias field , the interpretation is proceeded on volume partitions. Local radiometric models are introduced that are instantiated during local tissue distribution estimation steps and used during local labelling steps. Because estimation is per-formed locally, resulting models are prone to errors and some are missing. To cope with this difficulty, local models distributed over the volume are confronted to models interpolated from the neighbourhood, to maintain the global consistency of the dis-tributed interpretation process and reinforce the robustness of the models.

Focalisation for Selection of Dedicated Mechanisms To take into account the noise and the partial volume effect, that induce errors in the radiometric models instantiation, two phases are distinguished in the local interpreta-tion process:�1) during the initial phase based on strict labelling constraints an under-segmentation of the image is produced: no labelling decision is taken for the most difficult voxels situated at the frontier between tissues, and 2) during the final phase, the radiometric models are firstly refined and then the remaining voxels are labelled. Each phase is composed of a radiometric model estimation step and of a voxel evalua-tion and labelling step.

The initial phase. Radiometric model estimation is initialised with a k-means algo-rithm and refined with a bayesian EM algorithm. Five tissue classes are estimated, one for each tissue or tissue mixture (CSF, GM, WM, CSF-GM and GM-WM). The prior probability is based on the relative frequency of each class in the volume and


estimated during the E step of the algorithm. The obtained model is then confronted with a control model interpolated from the neighbourhood. Voxels classification into pure tissue classes is proceeded with a region growing process, following three steps: 1) region growing constraints are defined from the gaussian models in order to label only the most reliable voxels of each tissue (to obtain an under-segmentation), 2) seeds to start the region growing are selected using strict criteria (“rooting mecha-nism”) or transmitted from neighbouring regions (“region growing propagation mechanism”), and 3) pixels are labelled function of their grey level and of the label-ling of the neighbouring voxels.

The final phase. Voxels unlabeled during the initial phase are treated during the fi-nal phase in order to obtain a complete image interpretation. Radiometric model esti-mation is initialised with the previously under-segmented image and refined with a bayesian E/M classification algorithm. Voxel classification is done competitively between tissues, from the most reliable labelling to the most uncertain one. It relies on a more sophisticated model than this used in the initial phase and concerns only vox-els at the tissue frontiers more difficult to label. Partial volume labels may be intro-duced.

2.2 Organization of Treatments

Parallel interpretation processes are launched in each volume partition in a coordi-nated and cooperative way.

Coordination and Information Diffusion Treatments have to be coordinated inside a given volume partition or between neighbouring partitions, function of the available and incrementally extracted knowl-edge. Local model estimation must be reinforced using estimations produced in the neighbourhood. This confrontation can only be achieved when the information from the neighbourhood is available. When a model is modified, corresponding informa-tion is propagated to neighbouring regions for new confrontations. The labeling proc-ess performed during the initial phase requires information about the location of seeds to launch the region growing mechanism. A rooting, time-consuming process may be used to select seeds using local radiometric and topologic criteria. It can be advanta-geously replaced by mechanisms of “region growing propagation from cube to cube”. When a local region growing process reaches the corresponding frontier of the cube, it transmits the voxel candidates to the corresponding neighbouring process (which is eventually launched).The switch from one step of the local interpretation process to the other is launched autonomously in each cube function of criteria relative to the information available in the cube and relative to the neighbouring cubes. To launch the final estimation step, a large enough local under-segmentation have to be avail-able, which depends on the advancement of the labelling process in the neighbouring partitions. Similarly, to launch the labelling steps, local models have firstly to be computed in the volume partition and then a robust enough model interpolation from the neighbourhood has to be available to verify the model.


Cooperation between Distributed Treatments Three kinds of cooperation defined in [2] are used in this context: 1) integrative coop-eration: models estimation, models checking using neighbourhood and data analysis steps are interleaved, 2) augmentative cooperation: interpretation is a spatially distrib-uted process, and 3) confrontational cooperation: information produced in the same region or in neighbouring regions are confronted (via fusion and interpolation mecha-nisms).

2.3 A Multi-agent Architecture

To implement the mechanisms previously described, we introduce situated and coop-erative agents as a programming paradigm. The system is composed of agents running under control of a system manager, whose role is to manage their creation, destruc-tion, activation and deactivation. Each agent is in turn provided with several behav-iours running under control of a behaviours manager. The agents are organized into groups running under control of a group manager which ensures their proper coordi-nation. Briefly, (details about the implementation can be found in [4]), three types of agents coexist in the system : global and local control agents and tissue dedicated agents. The role of the global control agent is to partition the data volume into adja-cent territories, and then assign to each territory one local control agent. The role of local control agents is to create tissue dedicated agents, to estimate model parameters and to confront tissue models for labelling decision.

The role of tissue dedicated agents is to execute tasks distributed by tissue type: tissue models interpolation from the neighbourhood and voxels labelling using a re-gion growing process.

The agents have to be coordinated at several levels: 1) inside the cube the local control agent and the three tissue dedicated agents alternate the firing of their behav-iours, 2) tissue dedicated agents from neighbouring cubes interact during their model control behaviour and their region growing behaviour, and 3) agent behaviour selec-tion also depends on the global progress of the interpretation. Behaviour switching is decided either autonomously by agents when they have achieved their current behav-iour and when the required information is available, or triggered by group coordina-tion mechanisms. Agents are organized into groups depending on their type and on treatments they currently process. Four local control agents groups and three tissue dedicated agents groups (one group for each step to be processed by each kind of agents) coexist in the system. The agents share a common information zone, organ-ized according to the tissue types and spatial relations, storing global and local statis-tical information. Qualitative information maps are introduced to efficiently gather, retrieve and easily add complementary information.

3 Evaluation

To evaluate our system, we acquired three dimensional anatomical brain scans (T1-weighted, 3D flash sequence, voxel size =1mm3, matrix size=181*184*157) at 1.5 T on a clinical MR scanner (Philips ACS II). Such images are shown in Figures 2 and 3.

Image partitioning for local model estimation and classification: Figure 2 shows the high variability of tissue characteristics depending on the position in the image


and illustrates the importance of local models adaptation. The anatomical volume was partitioned following a 15*15*15 voxels size grid. Six agents were considered per cube: one local control agent and five tissue dedicated agents, i.e. three agents dedi-cated to pure tissue (WM, GM and CSF) labelling and two agents dedicated to mix-ture labelling (WM-GM and CSF-GM). At the end, 686 local control agents and 3430 tissue dedicated agents were launched (segmentation in 3.5 min on a PC486, 256M RAM, 800MHz). In each cube, a local histogram was computed on a 20*20*20 voxels size region centred on the middle of the cube. For two selected cubes (drawn in white (bottom) and black (top) in Fig 2a), local histograms were computed (Fig 2c) As indicated in Fig 2c, the GM intensity distribution of the upper cube was equal to WM intensity distribution of the lower cube. Nevertheless, thanks to the local adapta-tion, the global result is satisfactory as indicated in Fig 2.b.

A … E … I

1

5

9

.

.

.

.

.

.

A … E … I

a.

F-2B-8

Global

Grey levels

Voxels

b. c.

B-8 WM peakF-2 GM peak

Fig. 2. a. The partitioning grid is placed on one MR anatomical slice. The local histograms corresponding to two cubes located at B8 (in red) and F2 (in blue) are shown in c. The global histogram over the entire volume is plotted (in black). The final segmentation is indicated in b.

Gradual interpretation refinement via fusion mechanisms: Partitioning can lead to some difficulties in model estimation. Two cases are emphasized. In case 1, due to the reduced size of the voxel population, the model estimation fails for some tissues and in case 2, the presence of different anatomical structures composed by the same tissue hampers the model estimation. Cooperation between neighbouring agents and pro-gressive interpretation refinement are the solutions we propose. They are illustrated in Figures 3 and 4. In the anatomical part displayed in Fig3a., several tissue are shown, WM, CSF inside sulci, GM in the cortex and GM in the putamen, a central nucleus whose grey level is intermediate between those of the cortical GM and of the WM. Figures 3d to 3g show, at several interpretation steps, starting in d and ending in g, the histograms and the estimated gaussian models corresponding to the grid in Fig. 3a. Cube D4 and cube C3 are illustrative of Case 1 and Case 2 respectively. The evolu-tion of the estimation of their histograms are zoomed in Figure 4. In Fig. 3d, initial gaussian models were estimated. Some were missing due to the absence of some tissue (see D4 in Fig. 4d) or to the existence of a new intensity distribution (putamen in C3, see Fig. 4d) between the distribution of cortical GM and WM. In C3 this led to a misinterpretation: putamen peak was interpreted as a GM peak and the cortical GM peak was interpreted as a CSF peak. During the following interpretation step (Fig. 3e), these gaussian models were checked, corrected and/or computed by interpolation from the neighbouring models. Missing models were computed (see D4 in Fig. 4e).


A B C DA B C D A B C DA B C D A B C DA B C D

1

2

3

4

5

WM

GM cortex GM putamen

CSF sulci

GM/WM partial volume labelWM labelGM label

CSF label

a. b. c.

A B C DA B C D A B C DA B C D A B C DA B C D

1

2

3

4

5

WM

GM cortex GM putamen

CSF sulci

GM/WM partial volume labelWM labelGM label

CSF label

a. b. c.

A B C D A B C DA B C D A B C D

1

2

3

4

5

1

2

3

4

5 d. e.

1

2

3

4

5

1

2

3

4

5 f. g.

Fig. 3. a. MR anatomical image and the partitioning grid. b. Under-segmented image obtained at the end of the initial phase. c. Final segmentation image. d-e local histograms and estimated gaussian models during the incremental interpolation process starting with d: initial gaussian estimation ending with e: re-evaluation in the final phase.

False models were corrected (putamen is a small structure, thus the GM model computed correspond to cortical GM, see C3 in Fig. 4e). These models were used to compute an under-segmented image (see Fig. 3b). Note that at this step, the putamen stayed unlabelled because of its intensity which was intermediate between cortical GM and WM models. This under-segmentation was used to re–estimate the gaussian models (Fig. 3f). Some models were refined by this way (see WM in D4, and GM in C3 in Fig. 4f). Once again (see Fig 3g), the resulting models were checked, missing ones were computed (see CSF in D4 and C3 in Fig. 4g) and used to label the remain-ing voxels. (see final segmentation in Fig. 3c). Additional label corresponding to WM-GM and CSF-GM partial volume effect were added to the final labelling phase. Most of the voxels belonging to the putamen structure were labelled as WM-GM partial volume voxels.


Vox

els

Vox

els

Vox

els

Vox

els

Grey levels Grey levels Grey levels Grey levels

CSF

CSF CSF

GM GMGM GM

WMWM

WMWM

Putamen intensity

peak

d. e. f. g.

Cube C-3V

oxel

s

Vox

els

Vox

els

Vox

els


CSF

CSF CSF

GM GMGM GM

WMWM

WMWM

Putamen intensity

peak

d. e. f. g.

Cube C-3

Vox

els

Vox

els

Vox

els

Vox

els


GM GM

WM WM WMNo modelcomputed

d. e. f. g.

Cube D-4

Vox

els

Vox

els

Vox

els

Vox

els


GM GM

WM WM WMNo modelcomputed

d. e. f. g.

Cube D-4

Fig. 4. Zooms for cube C3 and D4 of the local histograms and estimated gaussian mixtures during the incremental interpolation process (d-e).

4 Discussion and Perspectives

In this paper, we propose a framework based on situated and cooperative agents for the management of information processing steps required for image interpretation. The application to MRI brain scan interpretation has been reported.

Several generic principles have driven our framework conception. Each agent is rooted in a three dimensional space, situated at a given position in the environment, with a given goal. It works locally, diffuses its partial results to its acquaintances (for instance agents dedicated to the same tissue in neighbouring regions), shares results via specific maps and coordinates its actions with other agents to reach a global com-mon goal. On various realistic brain phantoms, we obtained results (about 84% of truth positive) comparable to other optimal methods, which rely on MRF models and include a bias field correction map, with a lower computational burden (less than 5 min to segment a complete volume) (see [4]). The present evaluation was performed on real MRI scans at 1.5 T.

Our strategy for MRI brain scan interpretation is based on the partition of the im-age volume and on the introduction of local modelling mechanisms (similarly to [3, 5]) to take the bias field into account without introducing an explicit bias field map. As reported by results shown in Figure 2, this allows for a tissue intensity distribution estimation in different localizations in the image, despite of large intensity variations inside a same tissue. Our local approach implies the use of mechanisms for informa-tion diffusion as confirmed by results shown in Figures 3 and 4. Missing or non opti-mal tissue models are defined or refined by this way. Because of gradual refinement, the quality of the estimator is not critical. The fusion of several qualitative maps gath-ering gradual acquired knowledge clearly improves the final decision.

A bias field map estimation may advantageously be inserted in our system to cor-rect the residual intra-partition bias field. Refinement of the results could as well be obtained by the insertion of anatomical knowledge. For this purpose, new low level


maps can be computed using for instance mathematical morphology operators and interpreted using a particular model to obtain a specific structure probability map. Symbolic knowledge on structure geometry and location can also be introduced to compute rough structure probability maps, from previously obtained decision maps concerning others structures in spatial relationship with the structure to detect. A model of the object to detect, for instance sulci, obtained with active shape model can be inserted and deformed to fit the specificity of a new object. Knowledge derived from an atlas can be introduced where structures are identified on a reference grey level map, that can be deformed to fit the grey level map to be interpreted. The framework we report is open: the previously cited models and maps can be inserted to improve the radiometric-based approach we have described.

To conclude, based on our experimentations with phantoms [4] and realistic MRI brain scans, situated and cooperative agents appear as an interesting framework to combine several information processing steps that are required for image interpreta-tion.

References

1. Ashburner, J., Friston, K.: Multimodal image coregistration and partitioning - a unified framework. NeuroImage 6 (1997) 209-17.

2. Germond, L., Dojat, M., Taylor, C., Garbay, C.: A cooperative framework for segmentation of MRI brain scans. Artif. Intell. in Med. 20 (2000) 277-94.

3. Joshi, M., Cui, J., Doolittle, K., Joshi, S., Van Essen, D., Wang, L., Miller, M.I.: Brain seg-mentation and the generation of cortical surfaces. NeuroImage 9 (1999) 461-476.

4. Richard, N., Dojat, M., Garbay, C.: Situated Cooperative Agents: a Powerful Paradigm for MRI Brain Scans Segmentation. In: Van Harmelen, F. (eds.): ECAI 2002. Proceedings of the European Conference on Artificial Intelligence (21-26 July 2002, Lyon, Fr). IOS Press, Amsterdam, (2002) 33-37.

5. Shattuck, D.W., Sandor-Leahy, S.R., Schaper, K.A., Rottenberg, D.A., Leahy, R.M.: Mag-netic resonance image tissue classification using a partial volume model. NeuroImage 13 (2001) 856-876.

6. Teo, P.C., Sapiro, G., Wandell, B.A.: Creating connected representations of cortical gray matter for functional MRI visualization. IEEE Trans. Med. Imag. 16 (1997) 852-863.

7. Van Leemput, K., Maes, F., Vandermeulen, D., Suetens, P.: Automated model-based tissue classification of MR images of the brain. IEEE Trans. Med. Imag. 18 (1999) 897-908.

8. Zhang, Y., Brady, M., Smith, S. Segmentation of Brain MR images through a hidden Markov random field model and the expectation-maximisation algorithm. IEEE Trans. Med. Imag. 20 (2001) 45-57.


Qualitative Simulation of Shock States in a Virtual Patient

Altion Simo1 and Marc Cavazza2

1 Virtual Systems Laboratory, University of Gifu 1-1 Yanagido, Gifu-shi, Gifu, 501-1193, Japan

[email protected] 2 School of Computing and Mathematics, University of Teesside

TS1 3BA Middlesbrough, United Kingdom [email protected]

Abstract. In this paper, we describe the use of qualitative simulation to simulate shock states in a virtual patient. The system integrates AI techniques with a real-istic visual simulation of the patient in a 3D environment representing an ER room. We have adapted qualitative process theory to the representation of physiological processes in order to be able to generate appropriate pathophysi-ological models. We describe how a subset of cardiac physiology can be mod-elled using qualitative process theory and discuss knowledge representation is-sues. We then present results obtained by the system and the benefits that can be derived from the use of a virtual patient in terms of training. Finally, we explore the problem of integrating multiple pathophysiological models for various aeti-ologies of shock states.

1 Introduction

In this paper, we describe the use of qualitative simulation for modelling shock states in a virtual patient. Our interest in developing a virtual patient for clinical medicine is to provide a realistic tutoring environment, as well as creating an “ideal” interface to medical knowledge-based systems that would be able to visualise clinical situations. The use of a 3D environment not only provides a realistic setting for the training proc-ess but, from a diagnostic perspective, creates a situation in which the user has to actively search for visual cues (clinical signs). The objective of the visualisation of clinical situations through a virtual patient is thus to elicit an appropriate diagnostic and therapeutic process in the trainee. While there exists substantial amount of re-search aiming at developing virtual patients for surgery, very little work has been dedicated so far to the use of virtual patients in clinical medicine. The main work on 3D virtual patients outside virtual surgery has been that of Badler et al. [2] who have described the use of an autonomous virtual human to simulate battlefield casualties in military simulations, still in the field of trauma rather than clinical medicine.

The paper is organised as follows. After a brief reminder of the relations between qualitative simulation and “deep knowledge” and an overview of our system architec-ture, we describe the qualitative simulation method we have used to model shock

102 Altion Simo and Marc Cavazza

states and discuss knowledge representation issues. We then present an extended ex-ample from the system, including the visualisation of symptoms on the virtual patient. We conclude by discussing the integration of multi-system models around this qualita-tive model, in order to achieve more complex simulations.

2 From “Deep Knowledge” to Qualitative Simulation

Sticklen and Chandrasekaran [14] introduced the concept of deep knowledge, which consists in embedding medical knowledge in first principles, rather than explicitly encoding the relations between dysfunctions and signs through production rules. Though the notion of reasoning from first principles or model-based diagnosis is common to many AI applications, in the area of Medicine, deep knowledge generally corresponds to a pathophysiological model, i.e. an identified source of knowledge. This is why deep knowledge representations have mostly been produced in areas of medicine where clears pathophysiological descriptions are available, for instance acid-base regulation, respiratory physiology and intensive care, and in particular cardiac physiology [3] [4] [5] [9] [11] [13]. A related aspect of deep knowledge is the use of pathophysiological models for qualitative simulation. The underlying idea is to bring life to the explanatory models used in textbooks that contain causal explanations based on pathophysiological diagrams making use of variables such as “x”, which are pre-cisely the definition of qualitative variables, though these schemata long pre-existed qualitative simulation. Cardiology, especially blood pressure regulation, has been a major area of application for qualitative modelling, probably because of the abundance of pathophysiological descriptions and their relevance to diagnosis and treatment. Long [13] formalised cardiac physiology using a causal network, and Kuipers adopted his QSIM approach to cardiac dynamics [11]. Widman [15] proposed another qualita-tive model of the circulatory system, using semi-quantitative variables, and encapsu-lating causal relationships between certain variables into higher-level processes (Star-ling’s law, Laplace’s law), though in a non-systematic fashion.

Fig. 1. System Architecture

Qualitative Simulation of Shock States in a Virtual Patient 103

Bylander described a model of the cardiac pump centered on physical processes of ejection and transmission, with separate modes for systole and diastole. Escaffre [7] used de Kleer’s confluence theory to implement a qualitative simulation of the long-term regulation of blood pressure. Cavazza [5] proposed an early implementation of qualitative process theory for the cardio-vascular system. It can thus be noted that, from confluence equations to the QSIM approach, most of qualitative simulation theo-ries have been adapted to cardiac physiology.

3 System Overview and Architecture

Our virtual patient system integrates a visualisation engine and a qualitative simulation system (Figure 1). The visualisation component is based on a state-of-the-art 3D game engine, Unreal Tournament™ [12]. It supports high quality graphics, as well as the animation of virtual characters, used for the patient. Besides, it includes an excellent development environment that supports the authoring of animations for virtual humans behaviour, as well as various mechanisms (dynamic link libraries and socket-based inter-process communication) for integrating external software, such as the qualitative simulation module. Our software architecture is based on UDP socket communication between the 3D graphics engine and the qualitative simulation module, which has been developed in Allegro Common Lisp™. The system generates a complete patho-physiological simulation from initial alterations corresponding to the pathological situation to be simulated. The set of parameters obtained is interpreted and displayed as clinical signs (e.g. pallor, enlarged jugular veins, etc.), as data on the monitoring devices (HR, MAP), or as results from complementary explorations (e.g. central ve-nous line, Swan Ganz catheter). All these visual elements can be updated throughout the simulation to reflect a deterioration or an improvement of the patient’s situation. The visual appearance of the patient is based on dynamic textures that can reflect a relevant range of shock situations (pallor, “warm shock”, cyanosis, etc.).

4 Implementing Qualitative Processes in Cardiac Physiology

Qualitative Process Theory (QPT) [8] was introduced by Forbus as one of the main techniques of Qualitative Simulation. It is centered on the identification of physical processes, within which the causal influence between variables is encapsulated. This approach is much closer to the description of physical phenomena themselves. QPT has been successful in modelling complex mechanical devices and has a real potential for modelling physiological systems as well. Due to the complexity of physiological systems, it is most difficult to derive a consistent set of confluence equations for such systems, which make traditional approaches difficult to use.

Research in qualitative simulation of the cardio-vascular system has been motivated on one hand, by the limitation of rule-based expert systems for the representation of medical knowledge and, on the other hand, by the difficulties encountered in the use of


traditional modelling approaches, e.g. based on differential equations. A detailed dis-cussion of the latter aspect is given in [5]. To summarise it, we can say that qualitative methods have the advantage of integrating various levels of description and of being able to generate explanations on the system behaviour. Numerical simulations mostly behave as “black boxes” and in some instances are faced with difficulties due to a lack of global convergence of the set of differential equations.

Fig. 2. Physiological Processes and Parameter Mapping to Patients’ Status

Because physiological knowledge tends to be expressed through processes encapsu-lating physiological laws [1], the use of a process-based representation also facilitates knowledge elicitation.

Qualitative Process Theory relies on processes representing the main transforma-tions, encapsulating qualitative variables linked through influence equations. We have defined some 20 processes corresponding to various physiological mechanisms, such as the determinants of ventricular filling (e.g., ventricular venous return, relaxation and passive elastance) or ventricular ejection (effects of inotropism, pre-load, after-load, etc.) as well as various compensatory mechanisms (e.g. baroreceptors). These processes are encapsulated into four macro-processes: ventricular filling, ventricular ejection, arterial system behaviour, venous system behaviour. In the course of the simulation, these four macro-processes are activated in turn, in a way that reflects the cardiac cycle (the “P-V” curve on Figure 2 represents the cardiac contraction cycle). The variables we defined are directly adapted from actual qualitative variables used in the description of cardio-circulatory pathophysiology. Hence they can take any of nine values, from “ “ to “ ”. Influence equations formalise the relations between variables in terms of their variation. For instance an influence equation such as I+(inotropism, SV) indicates that stroke volume increases with inotropism, while I-(after-load, SV) that it decreases when after-load increases. Influence equations are

P

VESV EDV

EDP

ESP

End-diastolic volume

Stro

k e V

olum

e

MAP = CO × SVR

P

VESV EDV

EDP

ESP

Filling Ejection

ArterialVenous

P

VESV EDV

EDP

ESP


Stro

k e V

olum

e

MAP = CO × SVR

P

VESV EDV

EDP

ESP

Filling Ejection

ArterialVenous

P

VESV EDV

EDP

ESP


Stro

k e V

olum

e

MAP = CO × SVR

P

VESV EDV

EDP

ESP

Filling Ejection

ArterialVenous

P

VESV EDV

EDP

ESP


Stro

k e V

olum

e

MAP = CO × SVR

P

VESV EDV

EDP

ESP

Filling Ejection

ArterialVenous


generally assumed to be linear considering that they apply to a small set of qualitative values. However, we had to adapt the traditional notion of influence equation to the context of physiological laws, where influences are more complex.

One modification consists in including coefficients for the influence relation, which can be modified dynamically to take into account that influence relations can change under different circumstances. For instance the influence of after-load on stroke vol-ume is more important when inotropism is low. The coefficient used in the influence equation I-(after-load, SV) will dynamically reflect that. A single physiological law may be represented by more than one influence equation. One such example is Frank-Starling’s law, which describes the relation between ventricular ejection and “pre-load” (the level of ventricular filling prior to contraction, corresponding to the End-Diastolic Volume), this relation depending on cardiac contractility (inotropism) as well. The qualitative translation gives two levels of influence depending on the seg-ment of the curve (Figure 3).

Fig. 3. Converting Starling’s Law into Influence Equations

This can be represented by maintaining two separate influence equations for each segment of the Frank-Starling curve, the influence equation to be used being deter-mined by a threshold value of the pre-load. I1+(pre-load, SV) and I2+(pre-load, SV), which have different influence coefficients. The transition point between these two influence equations, i.e. the pre-load value for which the increase of SV is less signifi-cant is dynamically computed at each cycle as a function of the inotropic state (the computation is itself a qualitative influence). As a result, if we consider the determi-nants of cardiac ejection (whose output is represented by stroke volume), two out of the three influence equations I+(pre-load, SV), I+(Inotropism, SV), I-(After-load, SV) actually use extensions to the original theory to take into account the complex rela-tions between determinants, a phenomenon that is difficult to capture with e.g., stan-dard causal networks.

Overall, our model includes 25 primitive parameters, which account for the main physiological variables: Stroke Volume, End-Systolic pressure and volume, End-


Diastolic pressure and volume, pulmonary capillary pressure, systemic vascular resis-tance, inotropism, etc. Each process operates on average on 3 parameters and contains several influence equations. A few more parameters account for system properties that have not been modelled at a finer level of granularity, such as “left atrium function”, or “ventricular geometry” (the latter being part of computation of the after-load). In addition, there are internal variables to the system, through which it is possible to integrate the effects of several influences throughout the cardiac cycles (for instance, the variation in stroke volume).

Temporal aspects are also easier to represent in a process-based approach than with confluence or constraint equations. They are “implicitly” part of the cycle through which the processes are invoked, though this does not address the problem of time-scales. For instance, if we consider the example of a blood loss, the first process af-fected is venous return, which impacts on ventricular filling, then ejection and finally the arterial system, causing a fall in MAP and triggering short-term mechanisms for maintaining arterial pressure (baroreceptors).

As the virtual patient is developed for educational and training purposes, simulating the effects of therapeutics should be an important part of it. In the context of cardiac shock, a variety of medical treatments is made available: beta-agonists (such as dobu-tamine), alpha-agonists (norepinephrine or high-dose dopamine), arterial vasodilators, venous vasodilators, mixed vasodilators, adrenaline, fluid infusion, etc.

Each drug targets one or more primitive parameters of the simulation accounting for its effects and side effects. For instance, beta-agonists increase inotropism but also heart rate. In addition, these effects are dose-dependent and several doses are avail-able: low, moderate, high. This is useful for combined treatments (e.g., Beta-agonists + vasodilators) and also for exploratory treatment (e.g. careful volume expansion). Some target effects are shown on table I. The effects of medical treatment are simu-lated by modifying the corresponding target variables on the pathophysiological model obtained from the first simulation and running the qualitative simulation again until a new steady-state is obtained, which corresponds to the effects of the therapy in that context. Multiple therapies (e.g. beta-agonists with vasodilators) can be selected.

Table 1. Target Parameters of Some Common Treatments

Treatment Target Parameter IV Fluids Blood Volume β-agonist Inotropism, HR Vasodilator SVR ( ) Norepinephrine SVR ( )

5 Results

The qualitative model was developed using data from the literature validated through interview with intensive care and emergency medicine specialists. In the first instance, physiological knowledge was formalised by describing appropriate processes in the


framework of qualitative process theory. For instance, processes related to ejection were derived from physiological laws such as the Frank-Starling law. A first basic model providing early results was discussed with specialists of cardiac pathophysiol-ogy, which suggested to include additional knowledge (for instance on the role of left ventricle’s geometry or the nature of cardiac relaxation). Knowledge on traditional pathophysiological syndromes was not explicitly included in the model (which is a generative model), which made possible to use the traditional textbook descriptions of these syndromes (cardiogenic shock, hypovolemic shock, peripheral shock) to test the system, by comparing the values obtained from the simulation with those traditionally described for the associated syndromes (values for cardiac output, mean arterial pres-sure, capillary pulmonary pressure, etc.).

The initial validation procedure involved the generation of the main shock syn-dromes by altering corresponding primitive variables: inotropism for cardiogenic shock, SVR and arterial properties for anaphylactic and toxic shocks, blood volume for haemorrhagic/hypovolemic shock. The results provided by the system in terms of the main physiological parameters (MAP, HR, CO, SV, EDV, EDP, etc.) are consis-tent with the traditional description of these syndromes in the literature, though they are all generated through a complex cycle of simulation. In addition to the main shock syndromes, the system is able to simulate a range of cases of short-term adaptation of the cardiovascular system.

Fig. 4. Treatment of Cardiogenic Shock

For instance, adaptation to an increase in after-load, to the increase of intrathoracic pressure due to artificial ventilation, acute tachycardia, atrial fibrillation, etc. These have provided additional validation for the model and can be used to generate more training cases, alone or in combination with acute heart failure. We give a detailed example of the simulation of a cardiogenic shock. The simulation of cardiogenic shock is triggered by primitively decreasing the value of the inotropism parameter, which corresponds to the intensity of cardiac contraction. The first process activated is the ejection process (as the site of the inotropism parameter). The main output variable for this process is the Stroke Volume. The primitive decrease in Inotropism causes a de-crease in SV of the same order of magnitude. As a consequence, the End-Systolic Volume increases (ESV is updated by a process computing the “mechanical” aspects


of the ventricle). The next active process consists of the arterial system, where values of cardiac output and mean arterial pressure are computed using the classical equa-tions. In addition, the baroreceptor process reacts to a fall of MAP, triggering an im-mediate increase in HR, SVR and Venous Tone (VT). After the arterial system, the venous system macro-process is triggered, which contains several processes comput-ing venous return. The most important influence here is that increased VT (in response to the activation of baroreceptors) increases ventricular venous return. The last process of the cycle is ventricular filling. This one is simplified in our model, which is mostly a Left Ventricle model, where the role of the right ventricle is part of a coarser model (not modelling specifically right ventricle contraction and pulmonary circulation). Here, ventricular filling is moderately increased by increased venous return due to increased venous tone: I+(VT, pre-load). More importantly, this process integrates the variations in ventricular volume. The (previous) increase in ESV added to the slight increase in venous return causes the End-Diastolic Volume (EDV) to increase. This example is an important illustration of the integration of effects throughout the cardiac cycle, which enables the integration of multiple dependencies as well as taking into account some dynamic aspects. The second cycle of simulation activates again the ejection process under the new conditions that result from short-term adaptive mecha-nisms, in particular the variation in pre-load. However, the increase in pre-load fails in improving the stroke volume significantly as the influence I+(pre-load, SV) depends on the levels of inotropism, which is primitively decreased.

Fig. 5. Distension of Jugular Veins as a Sign of Increased CVP

Hence the qualitative value of SV remains low. Then the Arterial System process is triggered again, and the calculations take into account the updated HR and SVR values (as modified by the compensatory mechanisms). The increase in SVR fails to restore MAP for severe alterations of inotropism, just like the increase in HR fails to restore CO. The set of physiological parameters can be mapped onto the patient representation and the monitoring devices: the patient is pale and sweating (low perfusion, vasocon-striction and sympathic response), his consciousness is modified and the monitoring


devices show a low MAP and high HR. The effects of therapeutics are simulated in a similar fashion. From the steady-state obtained by simulating cardiogenic shock above, the system is run again after taking into account the modifications introduced by the therapeutic selected. The effects of the correct therapeutic, beta-agonist are shown on Figure 4. They restore inotropism, ejection and a MAP closer to normal but still low. Heart rate remains high in the acute context and due to the side effect of the drug itself. End-diastolic pressure and PCap decrease. Arterial vasodilators, improve ejection by decreasing the after-load. The effect of a variation of afterload is greater on a failing heart. This increases cardiac output but because SVR are decreased, it fails to restore MAP. Isolated fluid expansion initially increases ventricular venous return, but due to a low stroke volume, only contributes to a dangerous elevation of filling pressures and PCap, while cardiac output remains low. This triggers various clinical signs in the patient, such as changes in respiratory rate (pulmonary edema) and under certain circumstances, distension of the jugular veins (Figure 5).

6 Towards Multi-system Integration

A major challenge in the development of more comprehensive virtual patients is the integration of several physiological sub-systems. Difficulties arise from a wide variety in the granularity of knowledge, the nature of causal representations and the represen-tational philosophies themselves.

In this section, we first discuss the integration problem by describing how a collec-tion of models all related to the cardio-vascular system, but different in their nature and their granularity, could be integrated. We then discuss integration of models of the primary causes of non-cardiogenic shocks. We have described a qualitative model of cardio-circular physiology mainly oriented towards short-term regulation of arterial pressure. Many other models have been described for other aspects of cardiac physiol-ogy, such as cardiac arrhythmias, myocardial perfusion, peripheral circulation, blood clotting, etc. Related models have been developed for the long-term regulation of arterial pressure, which include mostly the renal system (see [7] for a qualitative model). A comprehensive model of cardiac physiology should integrate these different sub-systems. The integration involves two different aspects, which are the shared physiological parameters and their updating in both directions (i.e., from both models) and the respective timescales for the models. For instance, myocardial perfusion models can modify variables such as inotropism and ventricular relaxation, while their input is affected by heart rate or after-load (via ventricular wall tension). Models of peripheral circulation can be integrated in pathophysiological models of shock to take into account the “positive feedback” phenomena in the progression of shock states. Model of cardiac arrhythmias are more likely to provide input on the main model, which could simulate the haemodynamic consequences of the onset of arrhythmias and tachycardias. Finally, models of long-term arterial pressure regulation can be incorporated to models of chronic heart failure: this suggests interesting research di-rections for the dual use of our cardiac model for acute and chronic heart failure. Pan-creatitis is an acute disease that can evolve into a severe shock state. The pathophysi-


ology of acute pancreatitis is not fully described and include processes at different levels from molecular processes to local inflammatory processes to general manifesta-tions (which include shock). It can thus be challenging to integrate this model. How-ever, while the remainder of the pancreatitis pathophysiological model would be based on a causal network describing the causes of the inflammatory process, the factors interfering with cardiac dynamics can be identified as targets for integration with the cardiac model. These factors are i) liquid sequestration, ii) release of vasodilator sub-stances, such as bradikynin and iii) release of cardio-toxic substances. The target qualitative variables can thus be circulating blood volume, SVR and physical proper-ties of the arterial system and inotropism. This approach could be a generic one for non-cardiogenic causes of shock states: septic shock, toxic shock, anaphylactic shock, etc. In all these shock states, a coarse-grained causal model would describe the pri-mary pathology, interfaced to the QS cardiac model. In these models, different ap-proaches to qualitative simulation could be used such as simpler causal models. One tentative conclusion at this stage is that integration of the coarsest models would take place around the finest description model, whose behaviour would also control the simulation as a whole.

7 Conclusions

The need to generate clinical situations from first principles, which justifies the devel-opment of physiological models, also provides more realistic models that are easier to interface with the appearance and behaviour of virtual humans. In this context, the development of a virtual patient can be seen as an integration of a visual model and a physiological model, which is also a realistic model of the “internal behaviour” of the patient. As a result, a higher level of integration can be achieved with this approach than in systems in which the virtual human is mainly an interface to traditional knowl-edge-based systems. A realistic simulation should render the atmosphere and tension created by the critical nature of the situation. In that sense, the visual representation recreates some of the emotional tension of realistic situations. This is achieved through the reconstruction of a realistic ER, the patient’s behaviour and even the intervention of autonomous agents, such as ER nurses, who would react to the evolution of the patient’s situation and in that sense are also part of the simulation. Though the overall system is still under development, its target use will be for computer-aided training of medical students taking a first course in cardiac physiology or emergency medicine (related to treatment selection). It would provide a complete system to generate realis-tic emergency situations in which to assess the student’s diagnostic and therapeutic knowledge reproducing familiar computer game settings.

References

1. Baan, J. Arntzenius, A.C., Yellin E.L. (Eds.) Cardiac Dynamics. Martinus Nijhoff, The Hague, 1980.


2. Badler, N.Webber, B.Clarke, J., Chi, D., Hollick, M., Foster, N., Kokkevis, E., Ogunyemi, O., Metaxas, D., Kaye, J. and Bindiganavale, R. MediSim: Simulated medical corpsmen and casualties for medical forces planning and training, National Forum on Military Tele-medicine, IEEE, 1996, pp. 21-28.

3. Bratko, I. Mozetic, I., and Lavrac., N. KARDIO: a Study in Deep and Qualitative Knowl-edge for Expert Systems. MIT Press, 1989.

4. Bylander T., Smith J.W. and Svirbley J.R., Qualitative Representation of Behaviour in the Medical Domain. Computers and Biomedical Research, 21, pp. 367-380, 1988.

5. Cavazza, M. Simulation Qualitative en Physiologie Cardiaque, in Proceedings of AFCET/RFIA’91 (Lyon, France), 1991 (in French).

6. Coiera, E.W., Monitoring Diseases with Empirical and Model Generated Histories. Artifi-cial Intelligence in Medicine, 2, pp.135-147, 1990.

7. Escaffre, D. Qualitative Reasoning on Physiological Systems: The Example of the Blood Pressure Regulation. In: I. DeLotto and M. Stefanelli (Eds.), Artificial Intelligence in Medicine, Elsevier Science Publishers.

8. Forbus, K.D. Qualitative Process Theory. Artificial Intelligence, 24, 1-3, pp. 85-168, 1984. 9. Julen, N., Siregar, P., Sinteff, J.-P. and Le Beux, P., A Qualitative Model for Computer-

Assisted Instruction in Cardiology. Proceedings AMIA 98, pp. 443-447 10. Kuipers, B. Commonsense Reasoning about Causality: Deriving Behaviour from Structure.

Artificial Intelligence, 24, 1-3, pp 168-204, 1984. 11. Kuipers, B. Qualitative Simulation in Medical Physiology: A Progress Report. Technical

Report, MIT/LCS/TM-280, 1985. 12. Lewis, M., Jacobson, J., Communications of the ACM, 45, 1, January 2002. Special issue

on Games Engines in Scientific Research. 13. Long W.J., Naimi, S., Criscitiello, M.G., Pauker, S.G., Kurzrok, S. and Szolovits, P. Rea-

soning about therapy from a physiological model, in Proceedings of MEDINFO’86 (Wash-ington DC).

14. Sticklen J. and Chandrasekaran B., Integrating Classification-based Compiled Level Rea-soning with Function-based Deep Level Reasoning, Applied artificial Intelligence, 3, 2-3, pp. 275-304, 1989.

15. Widman, L.E. Expert System Reasoning About Dynamic Systems by Semi-quantitative Simulation. Computer Methods and programs in Biomedicine, 29, pp. 95-113.

3D Segmentation of MR Brain Images intoWhite Matter, Gray Matter and Cerebro-Spinal

Fluid by Means of Evidence Theory

Anne-Sophie Capelle1, Olivier Colot2, and Christine Fernandez-Maloigne1

1 Laboratoire Signal Image et Communication (SIC), UMR CNRS 6615Universite de Poitiers - Bat. SP2MI, Bd Marie et Pierre Curie - B.P. 30179

86962 Futuroscope-Chasseneuil Cedex, France{capelle,fernandez}@sic.sp2mi.univ-poitiers.fr2 Laboratoire d’Automatique I3D - FRE CNRS 2497

Universite des Sciences et Technologies de Lille - Bat. P2, Cite Scientifique59655 Villeneuve d’Ascq Cedex, France

[email protected]

Abstract. We propose an original scheme for the 3D segmentation ofmulti-echo MR brain images into white matter, gray matter and cerebro-spinal fluid. To take into account complementary, redundancy and even-tual conflicts provided by the different echoes, a fusion process based onEvidence theory is used. Such theory, well suited to imprecise and un-certain data, provides great fusion tools. The originality of our methodis to include a regularization process by the mean of Dempster’s combi-nation. Adding neighborhood information increases the knowledge. Thesegmentation is more confident, accurate and efficient. The method isapplied to simulated multi-echo data and compared with method basedon Markov Random Field theory. The results are very encouraging andshow that Evidence theory is well suited to such problematic.

1 Introduction

Magnetic Resonance (MR) imaging provides excellent differentiation and visualresolution of brain tissue types in vivo. MR images segmentation methods arenumerous and include single or multi-echo approaches [1]. The simultaneousanalysis of different echoes provides abundant, redundant, complementary andsometime conflicting informations. The segmentation of MR images into whitematter (WM), gray matter (GM) and cerebro-spinal fluid (CSF) by a fusionprocess seems then quite natural. Thus, we propose to use Evidence theorywhich is well suited to treat such uncertain and imprecise data.

Introduced by Dempster [2] and formalized by Shafer [3], Evidence theory isbased on the construction of belief functions.

Our segmentation scheme is based on the modeling of the data by an eviden-tial model. In order to take into account the relationship between neighboors,we propose to include spatial regularization through the Dempster’s combina-tion rule. This provides a more accurate modeling of the data and increases the


3D Segmentation of MR Brain Images 113

confidence about the class membership. In section 2, we quickly introduce theEvidence theory background. In section 3, we present our segmentation scheme.In section 4, the results obtained on simulated data are described and discussed.

2 Evidence Theory Background

Let Θ = {H1, . . . , HN} be a frame of discernement composed of the N exhaustiveand exclusive hypotheses Hi of the classification problem. We note 2Θ the powerset of 2N propositions defined on Θ. Within the context of Evidence theory,a piece of evidence brought by an information source on a proposition A (asingleton or a compound hypothesis of 2Θ), is modeled by the belief structure m,called Basic Belief Assignment (bba), defined by m : 2Θ → [0, 1], and verifying:

m(∅) = 0, and∑A⊆Θ

m(A) = 1 . (1)

Two dual functions called credibility (Bel) and plausibility (Pl) are derived fromm. Bel(A) =

∑B⊆A m(B), can be interpreted as the total amount of belief com-

mitted to the proposition A. Pl(A) =∑

A∩B �=∅ m(B), quantifies the maximumamount of belief potentially assigned to A. When a source is considered as im-precise or not completely reliable, the confidence in this source can be discountedby a factor α and a derived belief structure mα can be defined by:

mα(A) = α.m(A) ∀A ⊆ Θ, A �= Θ (2)mα(Θ) = 1 − α + α.m(Θ) . (3)

Let us denote {m1, . . . , mP } P bbas associated to P independent informationsources S1, . . . , SP . Dempster’s combination rule, m⊕ = m1 ⊕ . . . ⊕ mP , is themost commonly used operator to aggregate P sources. For two sources S1 andS2, the merged bba m⊕ is given by:

∀A ⊆ Θ m⊕(A) =1

1 − k

∑B∩C=A

m1(B).m2(C), (4)

where k =∑

B∩C=∅ m1(B).m2(C). The normalization term k (0 ≤ k ≤ 1)can be interpreted as a measure of the conflict between the sources. AlthoughDempster’s rule has been justified theoretically [4], it is still criticized [5] becauseincoherence and counter-intuitive behaviours appear with high conflict (k ≈ 1).In most applications, a decision has to be taken generally in favour of a simplehypothesis (singleton). The most current decision rules consist in maximizing thecredibility or the plausibility. Within the context of Transferable Belief Model [6],Smets proposes to maximize the pignistic probability distribution.

3 Segmentation Scheme

3.1 Evidential Modeling by Appriou’s Model

Appriou proposes two models, verifying some axiomatic requirements and basedon the estimation of the likelihood L(Hn|X) [7]. For each pattern vector X, we

114 Anne-Sophie Capelle, Olivier Colot, and Christine Fernandez-Maloigne

consider independently the N simple hypothesis Hi to construct and evaluate Nbbas {mi, . . . , mN}. Following Appriou’s recommendations [8], we use the modeldefined by: ⎧⎨⎩

mi({Hi}) = 0mi({Hi}) = αn.{1 − R.L(Hi|X)}, ∀i ∈ [1, . . . , N ]mi({Θ}) = 1 − αi.{1 − R.L(Hi|X)} ,

(5)

where R, constrained by R ∈ [0, (maxi∈[1,N ], X{L(Hi|X)})−1], is a normalizationfactor and αn is a reliability factor. The final bba m is obtained by Dempster’scombination of the N initial bbas: m = m1 ⊕ . . . ⊕ mN . In our application usingp echoes, X is a p-vector formed by p gray levels.

3.2 Introduction of Spatial Information

The originality of our method is to include, next to the initial data modelingby Appriou’s model, a spatial regularization by means of Evidence theory. Theidea is to increase the global knowledge by integrating local knowledge. Thus,each pattern reinforces the knowledge about its neighbors. In particular, if acorrupted pattern is present, the neighborhood knowledge decreases its beliefsuch as a denoising process. Near boundaries, some ambiguities are also solved.

Let X be the pattern to classify and ∂(X) = {X1, . . . , Xk} its k spatial neigh-bors. We denote m and m∂(X) = {mX1

, . . . , mXk} the bbas respectively asso-ciated to X and ∂(X). We introduce the spatial information through a weightedDempster’s combination. Thus, we define the new bba associated with X by:

m′ = m ⊕ mX1

γ1⊕ . . . ⊕ mXk

γk, (6)

where γi, for i ∈ [1, k], are discounting factors depending on the distance be-tween X and its neighbor Xi, and defined by γi = exp{−(d(X, Xi)}. Thanksto discounting, the nearest neighbors, i.e. the most reliable, have more influencethan the farest one, i.e. the less reliable.

4 Experiments and Results

To evaluate the reliability and efficiency of our method, we segment volumesissued from the BrainWeb1 phantoms database. Several noise levels (3%, 5%and 7%) and intensity non-uniformity levels (20% and 40%) are used. Eachtime, the multi-echo data volume is composed of the set of (T1, T2 and PD)echoes (P = 3). Each of them is composed of 181 slices of 217 × 181 voxels of1 × 1 × 1mm.

Two versions of the evidential segmentation scheme are evaluated. The first,called EV1, does not include neighborhood relationship. The second, EV2, in-corporates the information of the 26-connex neighbors. For both EV1 and EV2,1 http://www.bic.mni.mcgill.ca/brainweb/

3D Segmentation of MR Brain Images 115

Table 1. Segmentation results with simulated Brainweb data

κ n=3% n=5% n=7%20% 40% 20% 40% 20% 40%

WM 0.92 0.93 0.90 0.70 0.83 0.67EV1 GM 0.86 0.86 0.88 0.66 0.82 0.62

CSF 0.83 0.83 0.89 0.92 0.91 0.90WM 0.93 0.93 0.92 0.71 0.87 0.69

EV2 GM 0.87 0.87 0.89 0.67 0.86 0.65CSF 0.84 0.83 0.83 0.91 0.91 0.91WM 0.88 0.87 0.87 0.86 0.86 0.81

[10] GM 0.89 0.89 0.87 0.87 0.81 0.82CSF na na na na na na

(a) (b) (c)

Fig. 1. (a) Original T1 image with 7% of noise level and 20% of intensity non-uniformity. (b) EV1’s misclassified WM voxels. (c) EV2’s misclassified WM voxels.The well classified WM is shown in gray; misclassified WM voxels are overlayed inbright color

model’s parameters are estimated by EM algorithm [9]; αn is equal to 0.95 (eq. 5);decisions are taken by maximizing the pignistic probability. The non-brain struc-tures were removed thanks to the ground truth provided by the Brainweb. Ourmethod is compared to multi-echo method [10] which interleaves classificationwith estimation of model parameters and which includes inhomogeneity cor-rection and contextual information based on Markov Random Fields. For eachmethod, Jaccard similarity, noted κ and defined by TP/(TP +FP +FN) whereTP , FP and FN are respectively the true-positive, false-positive and false-negative rate, is computed thanks to the ground truth (an ideal segmentationcoincides with κ = 1). The results are summarized in table 1. Comparing EV1and EV2 results, we see that κ increases whatever the noise level and the in-tensity non-uniformity level. With EV1, misclassifications appear inside a regionand on the frontiers between regions (Fig. 1-b). They are mostly due to the noiseand fuzzy transitions between two anatomical structures. With EV2, numerousmisclassified patterns are eliminated (Fig. 1-c). Note that the segmentation re-sults increase with the noise level; the spatial combination works as a denoisingprocess. Moreover, segmentation improvements are obtained while preserving

116 Anne-Sophie Capelle, Olivier Colot, and Christine Fernandez-Maloigne

the finest structures thanks to the discounting of the belief functions. Withoutthe discounting, the regions are too smoothed and details are lost.

When intensity non-uniformity is lower than 40%, the evidential approach ismore accurate and efficient than the one presented in [10]. However, this tendencyis inverted since the intensity non-uniformity increases due to the poor adequacyof the model’s parameters, which were globally estimated, to the local variationsof intensity. One solution is to include a pre-processing step to correct the biasas done in [10]. An other is to locally re-estimate the model’s parameters.

5 Conclusion

We propose a 3D evidential segmentation scheme for the detection of WM, GMand CSF in normal MR brain. Well suited to model imprecise and uncertain data,Evidence theory provides great fusion tools. The originality of the method is toincorporate the spatial dependencies between the neighbors by a weighted Demp-ster’s combination. Results obtained on simulated data show that the spatialfusion process increases the performance of the segmentation. In noisy regions,it behaves as a denoising process. Near the frontiers, it solves some ambiguities.Comparisons with a method based on parametric estimation, bias field correc-tion and Markov Random Field theory show the pertinence and the accuracy ofthe evidential approach and spatial fusion process. However, the segmentationprocess is not robust to high intensity non-uniformity. An adaptative version ofthe evidential segmentation process is currently studied.

References

1. Bezdek, J., Hall, L., Clarke, L.: Review of MR image segmentation techniquesusing pattern recognition. Medical Physics 20 (1993) 1033–1048

2. Dempster, A.: Upper and lower probabilities induced by multivalued mapping.Annals of Mathematical Statistics 38 (1967) 325–339

3. Shafer, G.: A Mathematical Theory of Evidence. Princetown University Press(1976) Princetown New Jersey.

4. Dubois, D., Prade, H.: On the unicity of Dempster rule of combination. Interna-tional Journal of Intelligent System (1996) 133–142

5. Zadeh, L.A.: On the validity of Dempster’s rule of Combination of Evidence.University of California, Berkeley (1979) ERL Memo M79/24.

6. Smets, P., Kennes, R.: The transferable belief model. Artificial Intelligence 66(1994) 191–234

7. Appriou, A.: Probabilites et incertitudes en fusion de donnees multi-senseurs.Revue Scientifique et Technique de la Defense 11 (1991) 27–40

8. Vannoorenberghe, P., Denoeux, T.: Likelihood-based vs Distance-based EvidentialClassifiers. In: FUZZ-IEEE’2001, Melbourne, Australia (2001)

9. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal Statistical Society 39 (1977) 1–38

10. Leemput, K.V., Maes, F., Vandermeulen, D., Suetens, P.: Automated model-basedtissue classification of MR images of the brain. Technical report, Katholieke Uni-versiteit Leuven (1999)

A Knowledge-Based Systemfor the Diagnosis of Alzheimer’s Disease

Sebastian Oehm1, Thomas Siessmeier2, Hans-Georg Buchholz2,Peter Bartenstein2, and Thomas Uthmann1

1 Dep. of Mathematics and Computer Science, Johannes Gutenberg UniversityMainz, Germany

2 Dep. of Nuclear Medicine, Johannes Gutenberg University, Mainz, Germany

Abstract. Therapies to slow down the progression of Alzheimer’s dis-ease are most effective when applied in its initial stages. Therefore it isimportant to develop methods to diagnose the disease as early as possi-ble. It is also desirable to establish standards which can be used generallyby physicians who may not be experts in diagnosis of the disease. Onepossible method to obtain an early diagnosis is the evaluation of the glu-cose metabolism of the brain. In this paper we present a prototype of anexpert system that automatically diagnoses Alzheimer’s disease on thebasis of positron emission tomography images displaying the metabolicactivity in the brain.

1 Introduction

Alzheimer’s disease (AD) is the most common form of dementia in elderly people(6% of people older than 65, 47% of people older than 85 years). Due to theincrease of life expectancy the number of people suffering from this disease willgrow in the future. A cure is not yet possible, but if the disease is detected atan initial stage, progression of the death of nerve cells can be slowed down byspecial therapies [5]. This leads to the need to standardise and facilitate earlydiagnostic investigation.

2 Diagnosis of Alzheimer’s Disease

Alzheimer’s disease is associated with neuronal degeneration mainly in the cere-bral cortex of the brain [1]. This results in a reduction of metabolism in theaffected cells and finally leads to their death. In the course of the disease theoverall loss of neuronal functionality in the brain shows a pathognomonic pat-tern which can be observed in positron emission tomography (PET) image setsacquired by using fluorine-18-fluorodeoxyglucose (18-FDG) as a radio-labelledglucose analogue. These images reflect the relative metabolic activity in the dif-ferent regions of the brain and can thus be used to diagnose AD [3, 7]. Thediagnosis is facilitated by using three-dimensional stereotactic surface projec-tions (3D-SSPs) [2, 7]. Clinical investigations cannot be used to distinguish ADfrom other dementia in its initial stages.


118 Sebastian Oehm et al.

L

1

023

4

56

78

9

R

1

023

4

56

78

9

L R

9 9

0 02 23 35 5

L R

10 1011 11

L 1312

R 1312

Fig. 1. ROIs defined for the diagnosis of AD. The views are (from left to right) leftlateral, right lateral, superior, posterior, left mesial and right mesial. Anatomical de-scription of relevant ROIs: 0: frontal cortex; 3,13: primary sensorimotor cortex; 4:temporal cortex, anterior; 5,12: parietal cortex, superior; 6: parietal cortex, inferior; 7:temporal cortex, posterior; 8,11: cerebellum; 9,10: occipital cortex.

3 Data Preparation

78 sets of PET images were used for this investigation [9]. 37 belong to subjectswho certainly do not have AD, 41 to patients with probable AD according toNINCDS criteria [6]. According to the Mini-Mental Status Examination (MMSE)the stage of their disease was either mild (22-18pt) or moderate (17-10pt). 10image sets were selected from both groups. Images not showing any alterationin glucose metabolism and thus considered healthy are selected from the for-mer group, images showing the pattern typical for subjects suffering from ADfrom the latter. While developing the model these 20 image sets constitute thereference data whereas all other sets served as test data.

3D-SSPs were calculated for all sets of PET images. Then a pattern of regionsof interest (ROIs) was defined for every view of the 3D-SSP (see Fig. 1) using astereotactic atlas of the human brain [10]. These ROIs can be used for the eval-uation of every data set, since the 3D-SSP views are anatomically standardized.However there is a known variation in the location of the primary sensorimotorcortex [10] even in these views. Therefore an algorithm searches for its most likelyposition [9]. Finally all views in a 3D-SSP image set are normalised accordingto the average metabolic activity in the cerebellum extracted from the posteriorview (for a discussion of this choice see [9]). Data normalisation increases thesensitivity for detection of regional abnormalities in metabolism [7].

4 Modelling the Diagnosis

The proposed rule-based system uses the classic criteria established by Holmanand Devous [4] for the dignosis of AD. In addition newer scientific findings likethe early involvement of the posterior cingulate/mesial parietal cortex [8] areintegrated into the model. More details on the deduced rules can be found in[9]. To evaluate the alteration in metabolism the difference between the averagemetabolic activity in a particular ROI and in the primary sensorimotor cortexis assessed. The metabolic rate in the latter region is affected in only a few casesof AD. Hence it is well suited as a basis of comparison.

Since the extracted rules are based on the amount of variation in metabolicactivity, using the reference data set two thresholds were defined together with

A Knowledge-Based System for the Diagnosis of Alzheimer’s Disease 119

experts from nuclear medicine for each ROI to classify this variation into thefollowing three categories:

Not affected: The amount of variation does not support the diagnosis of AD.Weak: The degree of change is high enough for AD being possible.Strong: The variation in metabolic activity clearly points towards the disease.

In AD a reduction of metabolism can always be found in both hemispheresof any relevant region, even though maybe not to the same extent. To considerthis in the model both hemispheres of the most important regions (parietal andtemporal cortex) are evaluated together as a single region under one rule usinga special classification to combine the values in both hemispheres [9].

4.1 Definition of Rules

The most important rule in the diagnosis of AD is that if metabolic activity inthe parietal cortex relative to the primary sensorimotor cortex is reduced signif-icantly then the subject under investigation most likely suffers from dementia.Using the classification introduced above, this rule can be expressed as follows:

IF parietal superior strongTHEN Alzheimer’s disease.

Likewise other rules were extracted and expressed in terms of the model.If none of these rules holds true, the expert system decides that AD cannot

be diagnosed. If according to these rules, AD is detected, the system checks forLewy-body dementia and shows a warning if necessary. This dementia shows asimilar pathognomonic pattern and thus is not easy to distinguish from AD.

Additionally rules were integrated for providing warnings in case uncommonvariations were encountered, e.g. the following (applied for all regions):

IF difference between left and right hemisphere > thresholdTHEN caution: difference between hemispheres considerable.

4.2 Certainty of the Diagnosis

Because the rule-based model does not provide information about the certainty ofthe resulting diagnosis a second, score-based model is incorporated. This modelrests upon the observation that the stronger the alteration in metabolic activitythe more likely the subject under investigation suffers from AD. For every regionevaluated by any rule a score is calculated. If the examined variation suggests ADthe score returned is positive, otherwise negative. The stronger the evidence foror against the disease the higher the absolute value of the score until a maximumis reached. For the sake of simplicity it is assumed that this correlation is linear.To meet the fact that a positive deviation might have to be rated stronger infavour of the disease than a negative deviation against it or vice versa, the modelpermits for separate definition of either correlation (see Fig. 2). Finally thesescores are added up to an overall score. According to the construction of thismodel a positive overall score indicates AD, whereas a negative score testifiesthat no dementia of this kind can be detected. See [9] for more details.

120 Sebastian Oehm et al.

�

�

�

�

�

��

��

��

��

activity

score

��

�threshold value

minimal value maximal value0

max

min�

relativedecrease of activity

�relative increase of activity

Fig. 2. Scoring model for alterations of metabolic activity in a ROI.

4.3 Explaining the Diagnosis

Every time a rule is evaluated a comment on the decision made is presented.If for example the above rule processing the parietal superior region is true forboth hemispheres the comment would be:“The high reduction in metabolic activity in the parietal cortex leads to theassumption that the patient suffers from Alzheimer’s disease.”Thus the decision process is made transparent and comprehensible for the user.

5 Tests and Results

For evaluating the proposed system experts from nuclear medicine sorted thetest data set by visual diagnosis into three classes as follows:

class 1: Data sets that certainly do not belong to Alzheimer patients (data setsof sound people as well as of people with other forms of dementia).

class 2: Data sets of patients suffering from AD with a high degree of certainty.class 3: Data sets that can undoubtedly be classified as AD.

For testing pro or contra AD, class 1 (27 items) is evaluated versus classes 2 (16items) and 3 (15 items), for the test of certainty class 2 versus class 3.

Testing Pro or Contra AD. All data sets of class 1 were classified correctlyby the score-based model. The rule-based model by mistake classified one set asAD, but with a warning. Thus we consider it to be classified correctly. All datasets of classes 2 and 3 were rated correctly by the rule-based model, but two ofthem got negative scores. These two have to be regarded as classified incorrectly.Hence the sensitivity and specificity are 100% and 96% respectively.

Testing the Certainty of the Diagnosis. Comparing the achieved scoresreveals that the average score of 90.5 in class 3 is much higher than in class 2where it is 62.5. On the other hand the scores obtained overlap to a great extent.The ranges are 25 to 98 in class 2 and 58 to 110 in class 3. Thus there is only aweak, but still observable, correlation.

A Knowledge-Based System for the Diagnosis of Alzheimer’s Disease 121

Results. In the case that both models find the same diagnosis the overall scorecan be used as a rough measure of certainty of the diagnosis. If the score con-tradicts the rule-based diagnosis a more detailed investigation of the data set isadvisable. Here the warnings provided can act as a starting point.

6 Conclusion

The knowledge-based system for the automated diagnosis of AD shows verygood performance in terms of sensitivity and specificity. Since the pattern of ab-normality is already the same in preclinical stages of the disease this automatedsystem has great potential in assisting the physician to diagnose AD in very earlystages of the disease. Nevertheless larger sets of reference and test data wouldbe desirable to validate the models, particularly the choice of threshold values.For facilitating future enhancements it would be advantageous to reimplementthe models, developing them into a genuine expert system.

References

1. Behl, C., Sagara, Y.: Mechanism of amyloid beta protein induced neuronal celldeath: current concepts and future perspectives. J. Neural. Transm. Suppl. 49(1997) 125–134

2. Burdette, J.H., Minoshima, S., Borght, T.V., Tran, D.D., Kuhl, D.E.: Alzheimerdisease: improved visual interpretation of PET images by using three-dimensionalstereotaxic surface projections. Radiology 198 (1996) 837–843

3. Heiss, W.D., Szelies, B., Kessler, J., Herholz, K.: Abnormalities of energymetabolism in Alzheimer’s disease studies with PET. Ann. N. Y. Acad. Sci. 640(1991) 65–71

4. Holman, B.L., Devous, M.D.: Functional brain SPECT: the emergence of a pow-erful clinical method. J. Nucl. Med. 33 (1992) 1888–1904

5. Mayeux, R., Sano, M.: Drug therapy: treatment of Alzheimer’s disease. N. Engl.J. Med. 341 (1999) 1670–1679

6. McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., Stadlan, E.M.:Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA WorkGroup under the auspices of Department of Health and Human Services TaskForce on Alzheimer’s Disease. Neurology 34 (1984) 939–944

7. Minoshima, S., Frey, K.A., Koeppe, R.A., Foster, N.L., Kuhl, D.E.: A diagnos-tic approach in Alzheimer’s disease using three-dimensional stereotactic surfaceprojections of fluorine-18-FDG PET. J. Nucl. Med. 36 (1995) 1238–1248

8. Minoshima, S., Giordani, B., Berent, S., Frey, K.A., Foster, N.L., Kuhl, D.E.:Metabolic reduction in the posterior cingulate cortex in very early Alzheimer’sdisease. Ann. Neurol. 42 (1997) 85–94

9. Oehm, S.: Entwurf und Implementation eines wissensbasierten Systems zur Diag-nose der Alzheimer-Demenz. Diploma Thesis (unpublished), Dep. of Mathematicsand Computer Science, Johannes Gutenberg University, Mainz, Germany (2002)

10. Talairach, J., Tournoux, P.: Co-Planar Stereotaxic Atlas of the Human Brain.Thieme Medical Publishers, New York (1988)


DEGEL: A Hybrid, Multiple-Ontology Framework for Specification and Retrieval of Clinical Guidelines

Yuval Shahar, Ohad Young, Erez Shalom, Alon Mayaffit, Robert Moskovitch, Alon Hessing, and Maya Galperin

Medical Informatics Research Center Department of Information Systems Engineering Ben Gurion University, Beer Sheva, Israel 84105

{yshahar,ohadyn,erezsh,mayafit,robertmo,hessing�gmaya}@bgumail.bgu.ac.il http://medinfo.ise.bgu.ac.il/medlab/

Abstract. Clinical Guidelines are a major tool in improving the quality of medical care. However, most guidelines are in free text, not machine compre-hensible, and are not easily accessible to clinicians at the point of care. We introduce a Web-based, modular, distributed architecture, the Digital Electronic Guideline Library (DeGeL), which facilitates gradual conversion of clinical guidelines from text to a formal representation in a chosen guideline ontology. The architecture supports guideline classification, semantic markup, context-sensitive search, browsing, run-time application, and retrospective quality as-sessment. The DeGeL hybrid meta-ontology includes elements common to all guideline ontologies, such as semantic classification, and domain knowledge. The hybrid meta-ontology also includes three guideline-content representation formats: free text, semi-structured text; and a formal representation. These formats support increasingly sophisticated computational tasks. All tools are designed to operate on all representations. We demonstrated the feasibility of the architecture and the tools for the Asbru and GEM guideline ontologies.

1 Introduction

Clinical guidelines (or Care Plans) are a powerful method for improvement of the quality of medical care [1], while reducing the escalating costs of medical care. Sev-eral of the major tasks involved in guideline-based care, which would benefit from automated support, include specification and maintenance of clinical guidelines, search, retrieval, browsing, and visualization of relevant guidelines, examination of the eligibility of one or more patients for a given guideline or the applicability of one or more guidelines to a given patient, runtime application of guidelines, and retro-spective assessment of the quality of the application of the guidelines.

Most clinical guidelines, however, are text-based and inaccessible to care provid-ers, who need to match them to their patients and to apply them at the point of care. Similar considerations apply to the task of assessing retrospectively the quality of clinical-guideline application. Thus, there is an urgent need to facilitate automated guideline specification, dissemination, application, and quality assessment.

DEGEL: A Hybrid, Multiple-Ontology Framework 123

1.1 Automated Support to Clinical Guideline-Based Care

During the past 20 years, there have been several efforts to support complex guide-line-based care over time in automated fashion. Examples include ONCOCIN [2], T-HELPER [3], DILEMMA [4], EON [5], Asgaard [6], PROforma [7], the guideline interchange format (GLIF) [8], the European PRESTIGE project, the British Prodigy project [9], and the ActiveGuidelines model [10]. A recent framework, GEM [11], enables structuring of a text document containing a clinical guideline as an extensible markup language (XML) document, using a well-defined XML structure, although it is not based on an underlying computational model.

1.2 The Asgaard Project and the Asbru Language

In the Asgaard project [6], the first author and his colleagues had designed an expres-sive guideline-representation language, Asbru. An Asbru specification includes con-ditions, such as eligibility criteria; control structures for the guideline’s body (e.g., sequential, concurrent, and periodic combinations of actions or sub-guidelines), pref-erences (utility functions), expected effects, and process and outcome intentions. A feature unique to Asbru is the use of explicit intentions, represented as temporal pat-terns, which supports intelligent quality assessment by representing the designer’s intentions regarding care-provider actions and patient outcomes.

We will use the Asbru ontology for demonstration of the current architecture’s various aspects. It is the default ontology we are currently using for the conversion process. We are also creating several Asbru-specific tools for runtime guideline ap-plication and for retrospective quality assessment of guideline-based care

2 The Conversion Problem and the Hybrid-Representation Model

The existence of automated architectures for guideline representation makes the ques-tion “How will the large mass of free-text guidelines be converted to a formal ma-chine-readable language?” be a most pertinent one. The core of the problem is that expert physicians cannot (and need not) program in guideline-specification languages, while programmers and knowledge engineers do not understand the clinical semantics of the guidelines. In addition, text-based representations are useful for search and retrieval of relevant guidelines, while formal representations are essential for creating a machine-readable, executable code. Thus, our guiding principle is that expert phy-sicians should be transforming free-text guidelines into semi-structured, semantically meaningful representations, while knowledge engineers should be converting semi-structured guidelines to a formal, executable language.

To gradually convert clinical guidelines to machine-comprehensible representa-tions, we have developed a hybrid, multifaceted representation, an accompanying distributed architecture, the Digital electronic Guideline Library, (DeGeL), and a set of Web-based software tools. The tools gravitate a set of clinical guidelines grace-fully from text-based, through structured text labeled by the knowledge roles of a target ontology chosen by the editor, to fully formal, executable representations (Fig-ure 1).

124 Yuval Shahar et al.

Semantic markup (semi-structuring)�

Context-sensitive search, retrieval, and visualization

Eligibility & applicability determination Guideline runtime application Retrospective quality assessment

Free-text guideline

Adding a machine-readable formalization

Web-based guideline library

Fig. 1. The incremental conversion process in the DeGeL architecture. Input free-text guide-lines are loaded into a markup editor; expert physicians index and markup (structure) portions of the guidelines by semantic labels from a chosen target ontology. Knowledge engineers use an ontology-specific tool to add executable expressions in the formal syntax of that ontology

Underlying the tools is the guiding principle mentioned above: Expert physicians use the tools to classify the guidelines along multiple semantic axes, and to semanti-cally markup (i.e., label portions of the text by the semantic labels of the target ontol-ogy) existing text-based guidelines, resulting in an XML document. Knowledge engi-neers convert the marked-up text into a machine-executable representation of the target ontology, using an ontology-dedicated tool. Different parts of a guideline might exist at different levels of specification (e.g., eligibility conditions might in-clude also executable expressions, supporting automated eligibility determination). All formats co-exist in a structure defined by the hybrid meta-ontology (Figure 2).

Fig. 2. The hybrid guideline meta-ontology in DeGeL. (1) a pointer to one or more source ontologies of sources used by the guideline document, (2) a pointer to the semi-structured target ontology of the guideline document (e.g., Asbru, GEM), (3) a pointer to a formal version of the target ontology, and (4) several knowledge-roles, independent of the target ontology, that char-acterize the document (e.g., domain knowledge, semantic indices, documentation)


3 The Hybrid Meta-ontology

To support the specification of a guideline in one or more different ontologies, the DeGeL architecture includes a hybrid guideline meta-ontology (see Figure 2); it dis-tinguishes sources from guideline documents.

Uploading a guideline into the DeGeL library (e.g., a document published by a pro-fessional society) creates a source. A source can be named, searched, and retrieved, and is annotated using a source ontology documenting the source’s details (e.g., au-thors, date). However, a source cannot be indexed or applied to a patient.

A guideline document is a more complex structure, which can be indexed, re-trieved, modified, and applied. A guideline document includes one or more sources, additional knowledge roles that are independent of the target ontology, such as docu-mentation; classification; domain knowledge necessary for guideline application, and the semi-structured and fully-structured (machine-comprehensible) representations of the guideline using the selected target ontology. The semi-structured representation corresponds roughly to top level and intermediate concepts of the target ontology. For Asbru, we included key entities such as conditions and intentions, but left out the low-level content. For example, temporal queries to the patient record are specified as semi-structured queries that are then fully formalized by the knowledge engineer.

4 Hybrid Design-Time and Runtime Tools

Several DeGeL tools are used mostly to specificy and retrieve guidelines, irrespective of a particular patient. Other tools are used mostly at runtime and require automated or manual access to patient data. All of the tools were designed to support the various formats implied by a hybrid representation.

4.1 The Uruz Tool: Semantic Markup of Guidelines

The Uruz Web-based guideline markup tool (Figure 3) enables medical experts to: create new guideline documents. A source guideline is uploaded into the DeGeL, and can then be used by Uruz to create a new guideline document, marked-up by the se-mantic labels of one of the target ontologies available in DeGeL. Uruz can also be used to create a guideline document de-novo (i.e., without using any source) by di-rectly writing into the knowledge roles of a selected target ontology. We are develop-ing an Asbru-dedicated tool to add the formal-specification level.

Figures 3 and 4 show the Uruz semantic-markup interface. The user browses the source guideline in one window, and a knowledge role from the target ontology in the other window. She labels the source content (text, tables, or figures) by dragging it into the knowledge-role frame. Note that the editor can modify the contents or add new content. This enables turning implicit knowledge into more explicit, further facilitating the task of the knowledge engineer who fully formalizes the guideline.


Fig. 3. The Uruz Web-based guideline markup tool. The tool’s basic interface is uniform ac-ross al guideline ontologies. The target ontology selected by the medical expert, in this case, Asbru, is displayed in the upper left tree; the guideline source is opened in the upper right frame. The expert physician highlights a portion of the source text (including tables or figures) and drags it for further modification into the bottom frame labeled by a semantic role chosen from the target ontology (here, filter condition). Note that contents can be aggregated from different locations in the source. The bottom left textbox, Element Comments, stores remarks on the current selected knowledge-role, thus supporting collaboration among guideline editors

A more complex module embedded in Uruz, the only one specific to the Asbru on-tology (such modules can be defined for other ontologies), the plan-body wizard (PBW), is used for defining the guideline’s control structure (see Figure 4). The PBW enables a user to decompose the actions embodied in the guideline into atomic actions and other sub-guidelines, and to define the control structure relating them (e.g., sequential, parallel, repeated application). The PBW, used by medical experts, significantly facilitates the final formal specification by the knowledge engineer.

When a knowledge engineer needs to add a formal, executable expression to a knowledge role, she uses one of the ontology-specific Uruz modules (we are develop-ing one specific to Asbru), which delves deeper into the syntax of the target ontology. For example, in our hybrid Asbru, conditions can include temporal patterns in an expressive time-oriented query language used by all of the application modules. To be truly sharable, guidelines need to be represented in a standardized fashion. Thus, Uruz enables the user to embed in the guideline document terms originating from standard vocabularies, such as ICD-9-CM for diagnosis codes, CPT-4 for procedure codes, and LOINC-3 for observations and laboratory tests. In each case, the user selects a term when needed, through a uniform, hierarchical search interface to our Web-based vocabulary server.


Fig. 4. The Asbru plan-body wizard (PBW) module. On the left, the guideline’s structure tree is displayed and updated dynamically as the user decomposes the guideline. On the upper right, the user is prompted with wizard-like questions to further specify the selected control structure. In the bottom right, the text of the source, current, or parent guidelines is displayed

4.2 The IndexiGuide Tool: Semantic Classification of Guidelines

To facilitate guideline retrieval, the medical expert indexes the guideline document by one or more intermediate or leaf nodes within one or more external (indexing) seman-tic axes trees, using the IndexiGuide tool (Figure 5). Currently, the semantic axes are: (1) symptoms and signs (e.g., hypertension), (2) diagnostic findings (e.g., blood-cell counts), (3) disorders (e.g., endocrine disorders, neoplasms), (4) treatments (e.g., antibiotic therapy, surgery), (5) body systems and regions (e.g., pituitary gland), (6) guideline types (e.g., screening, prevention, management), and (7) guideline spe-cialties (e.g., internal medicine). Semantic axes are typically headers of standardized vocabularies such as MeSH, ICD-9 or CPT.

4.3 The Vaidurya Tool: Context-Sensitive Search and Retrieval of Guidelines

The Vaidurya hybrid guideline search and retrieval tool exploits the existence of the free-text source, the semantic indices, and the marked semi-structured-text. Figure 6 shows the Vaidurya query interface. The user, performing a search, selects one or more concepts from one or more external (indexing) semantic axes, or scopes, to limit the overall search. The tool also enables the user to query marked-up guidelines for the existence of terms within the internal context of one or more target-ontology’s knowledge roles (e.g., pregnancy within the filter condition).


Fig. 5. The IndexiGuide semantic-classification tool. Domain experts index the guideline by one or more intermediate or leaf nodes (right frame) within one or more semantic axes (left frame), such as Disorders, Treatments, or Symptoms and Signs

Fig. 6. The Vaidurya Web-based, context-sensitive, guideline search and retrieval tool. The user defines the relevant search scope by indicating one or more nodes within the semantic axes (upper left and right frames). The search can be further refined by specifying terms to be found within the source text, and even (after selecting a target ontology such as Asbru), within the context of one or more particular knowledge roles of that ontology (middle right frame)

For external scopes, the default constraint is a conjunction (i.e. AND) of all se-lected axes (e.g., both a Cancer diagnosis and a Chemotherapy therapy) but a disjunc-tion (i.e. OR) of concepts within each axis. For internal contexts, the default seman-


tics are to search for a disjunction of the key words within each context, as well as among contexts (i.e, either Diabetes within the Filter Conditions context or Hyperten-sion within the Effects context). The search results are browsed, both as a set and at each individual-guideline level, using a specialized guideline-visualization tool.

4.4 The VisiGuide Tool: Browsing and Visualization of Guidelines

The VisiGuide browsing and visualization tool (Figure 7) enables users to browse a set of guidelines returned by the Vaidurya search engine and visualize their structure. It is linked to the DeGeL applications, allowing the user to return one or more se-lected guideline for use within the Uruz markup tool or the IndexiGuide semantic classifier. VisiGuide makes no assumptions regarding the guideline’s ontology, al-though it can have extensions for specific ontologies (e.g., the Asbru plan-body).

Visiguide organizes guidelines along the semantic axes in which they were found, distinguishing between axes that were requested in the query (e.g., disorders = breast carcinoma and therapy mode = chemotherapy) and axes that were not requested but which where originally used to classify a retrieved guideline (e.g., therapy mode = radiotherapy). Axes that were requested in the query but in which no guideline was found are highlighted (differently) as well.

Fig. 7. An example of the VisiGuide Interface in the multiple-ontology mode. In this mode, multiple guidelines, typically retrieved by Vaidurya search engine, are displayed within the various semantic axes indexing them (left frame); the contents of knowledge roles relevant to the user are displayed and compared as a table (right frame). In the single-guideline mode, a guideline’s contents can be more deeply examined. The “Return Results” button returns se-lected guideline back to the requesting application (e.g., to the Uruz markup tool)

In the multiple-guideline mode, a table listing the content of desired semi-structured knowledge roles for all retrieved guidelines or for all guidelines that are indexed by a certain semantic axis can be created on the fly by simply indicating the interesting knowledge roles in the target ontology by which the guideline was marked


(semi-structured), thus enabling quick comparison of several guidelines. Several de-fault views exist, such as for eligibility determination or quality assessment. In the single-guideline mode, a listing of the content of each of the knowledge roles or any combination can be requested, thus supporting actual application or quality assess-ment.

5 Discussion and a Preview of Future Work

Hybrid representations of clinical guidelines include any combination of free-text, semi-structured text, and machine-comprehensible formats in a chosen target guide-line ontology. They cater for the different capabilities of expert physicians, who are expected to have only limited knowledge of the semantics of the chosen ontology, and knowledge engineers, who are expected to have full semantic and syntactic knowl-edge of the chosen ontology. By incrementally converting free-text guidelines to semi-structured and then formal specifications, we are gradually enhancing the so-phistication of the automated services that the guideline’s representation can support: from full-text search, through context-sensitive search and visualization (sensitive to specific knowledge roles of the target ontology), to fully automated application and quality assessment. At the same time, the semi-structured view provides independent value: Search precision has been shown to be significantly improved by marking-up the text [12], while displaying documents along a predefined meaningful ontology is highly preferred by users [13]. (We intend to add in the future a capability of search-ing within formal expressions in the case of the Asbru target ontology). Furthermore, the tools we are developing for runtime application and quality assessment can exploit that intermediate representation level. Indeed, only a semi-structured representation is useful when no electronic patient record is available, and the attending physician or quality-assessment nurse is acting as the mediator to the patient record.

To control the use of the DeGeL tools, we have developed a detailed guideline au-thorization and authentication model, organized by medical-specialty groups, and distinguishing among multiple levels of access and permissions (e.g., read, write, modify, classify) for different representation formats. The default authorization in-cludes no editing permissions whatsoever, but only search of the DeGeL contents.

We are also developing the Spock runtime-application module, which is specific to Asbru and currently focuses mainly on the semi-structured representation and the Asbru-specific QualiGuide retrospective quality-assessment tool. Both tools use our architecture for intelligent interpretation and browsing of patient data, thus adding, besides the link to the clinical knowledge, a link to the patient’s data.

Preliminary assessment of the tools by our clinical colleagues is highly encourag-ing, and formal evaluations are under way. We had already experimented with the GEM and Asbru guideline-representation ontologies, and have shown the feasibility of marking up, searching, and displaying guidelines in both ontologies. We intend to add other ontologies such as GLIF.


Acknowledgments

This research was supported in part by NIH award No. LM-06806. We thank Samson Tu and Mor Peleg for useful discussions regarding the need for supporting the use of multiple guideline ontologies. Drs. Richard Shiffman and Bryant Karras assisted us in using their GEM ontology. Drs. Mary Goldstein, Susana Martins, Lawrence Basso, Herbert Kaizer, Aneel Advani, and Eitan Lunenfeld, were extremely helpful in assess-ing the various DeGeL tools.

References

1. Grimshaw, J.M. and Russel, I.T. (1993). Effect of clinical guidelines on medical practice: A systematic review of rigorous evaluations. Lancet, 342: 1317–1322.

2. Tu, S.W., Kahn, M.G., Musen, M.A., Ferguson, J.C., Shortliffe, E.H., and Fagan, L.M. (1989). Episodic Skeletal-plan refinement on temporal data. Communications of ACM 32: 1439–1455.

3. Musen M. A., Carlson R. W., Fagan L. M., and Deresinski S. C. (1992). T-HELPER: Automated Support for Community-Based Clinical Research. Proceedings of the Sixteenth Annual Symposium on Computer Applications in Medical Care, Washington, D.C., 719-723.

4. Herbert, S.I., Gordon, C.J., Jackson-Smale, A., and Renaud Salis, J-L. (1995). Protocols for clinical care. Computer Methods and Programs in Biomedicine 48: 21–26.

5. Musen, M.A., Tu, S.W., Das, A.K., and Shahar, Y. (1996). EON: A component-based ap-proach to automation of protocol-directed therapy. Journal of the American Medical In-formation Association 3(6): 367–388.

6. Shahar, Y., Miksch, S., and Johnson, P. (1998). The Asgaard project: A task-specific framework for the application and critiquing of time-oriented clinical guidelines. Artificial Intelligence in Medicine (14): 29-51.

7. Fox, J., Johns, N., and Rahmanzadeh, A. (1998). Disseminating medical Knowledge: the PROforma approach. Artificial Intelligence in Medicine, 14: 157-181.

8. Peleg M, Boxwala A. A., Omolola O., Zeng Q., Tu, S.W, Lacson R., Bernstam, E., Ash, N., Mork, P., Ohno-Machado, L., Shortliffe, E.H., and Greenes, R.A. (2000). GLIF3: The Evolution of a Guideline Representation Format In Overhage M.J., ed., Proceedings of the 2000 AMIA Annual Symposium (Los Angeles, CA, 2000), Hanley & Belfus, Philadelphia.

9. Johnson PD, Tu SW, Booth N, Sugden B, and Purves IN (2000). Using scenarios in chronic disease management guidelines for primary care. In Overhage M.J., Ed., Proceed-ings of the 2000 AMIA Annual Symposium (Los Angeles, CA, 2000), Hanley & Belfus, Philadelphia.

10. Tang PC and Young CY (2000). ActiveGuidelines: Integrating Web-Based Guidelines with Computer-Based Patient Records. In Overhage M.J., Ed., Proceedings of the 2000 AMIA Annual Symposium (Los Angeles, CA, 2000), Hanley & Belfus, Philadelphia.

11. Shiffman RN, Karras BT, Agrawal A, Chen R, Marenco L, Nath S. (2000). GEM: a pro-posal for a more comprehensive guideline document model using XML. Journal of the American Medical Informatics Association 7(5): 488-498.

12. Purcell, G. P. Rennels, G. D., and Shortliffe, E. H. (1997). Development and Evaluation of a Context-Based Document Representation for Searching the Medical Literature. Interna-tional Journal of Digital Libraries 1:288-296.

13. W. Pratt. Dynamic Organization of Search Results Using the UMLS. 1997 AMIA Annual Fall Symposium, Nashville, TN, 480-484. 1997

Experiences in the Formalisation and Verificationof Medical Protocols

Mar Marcos1, Michael Balser2, Annette ten Teije3,Frank van Harmelen3, and Christoph Duelli2

1 Universitat Jaume I, Dept. of Computer Engineering and ScienceCampus de Riu Sec, 12071 Castellon, Spain

2 Universitat Augsburg, Lehrstuhl Softwaretechnik und Programmiersprachen86135 Augsburg, Germany

3 Vrije Universiteit Amsterdam, Dept. of Artificial IntelligenceDe Boelelaan 1081a, 1081HV Amsterdam, Netherlands

Abstract. Medical practice protocols or guidelines are statements to assist prac-titioners and patient decisions about appropriate health care for specific circum-stances. In order to reach their potential benefits, protocols must fulfill strongquality requirements. Medical bodies worldwide have made efforts in this direc-tion, mostly using informal methods such as peer review of protocols. We areconcerned with a different approach, namely the quality improvement of medicalprotocols by formal methods. In this paper we report on our experiences in theformalisation and verification of a real-world medical protocol. We have fully for-malised a medical protocol in a two-stage formalisation process. Then, we haveused a theorem prover to confirm whether the protocol formalisation complieswith certain protocol properties. As a result, we have shown that formal verifica-tion can be used to analyse, and eventually improve, medical protocols.

1 Introduction

Medical practice protocols or guidelines1 are “systematically developed statements toassist practitioners and patient decisions about appropriate health care for specific cir-cumstances” [1]. They contain more or less precise recommendations about the diag-nosis tests or the interventions to perform, or about other aspects of clinical practice.These recommendations are based on the best empirical evidence available at the mo-ment. Among the potential benefits of protocols, we can highlight the improvement ofhealth-care outcomes [2]. In fact, it has been shown that adherence to protocols may re-duce the costs of care upto 25% [3]. In order to reach their potential benefits, protocolsmust fulfill strong quality requirements. Medical bodies worldwide have made effortsin this direction, e.g. elaborating appraisal documents that take into account a variety ofaspects, of both protocols and their development process (see e.g. [4]). However, theseinitiatives are not sufficient since they rely on informal methods and notations.

We are concerned with a different approach, namely the quality improvement ofmedical protocols through formal methods. Currently, protocols are described using a

1 In this paper we use the terms guideline and protocol indistinctively. However, the term proto-col is in general used for a more specific version of a guideline.


Experiences in the Formalisation and Verification of Medical Protocols 133

combination of different formats, e.g. text, flow diagrams and tables. The idea of ourwork is translating these descriptions into a more formal language, with the aim ofanalysing different protocol properties. In addition to the advantages of such kind offormal verification, making these descriptions more formal can serve to expose prob-lematic parts in the protocols.

In this paper we report on our experiences in the formalisation and verification of amedical protocol for the management of jaundice in newborn babies. This work is partof the IST Protocure2 project, a recently concluded project which has consisted in theassessment of the application of formal methods for protocol quality improvement.

The formalisation of medical protocols can be tackled at different degrees. Sincewe aim at a formal verification, we have chosen the logic of a theorem prover –KIV[5]– as target formalism. Prior to the KIV formalisation step, we have carried out amodelling step using a specific-purpose knowledge representation language for medicalprotocols –Asbru [6]. This gradual formalisation strategy has made the formalisationtask feasible, which in turn has enabled us to use theorem proving.

The structure of this paper roughly follows the Asbru modelling-KIV formalisation-KIV verification process that the protocol has undergone. First section 2 introduces thejaundice protocol. Then section 3 describes the Asbru language and the model of theprotocol in this language. The next step has been the translation of the Asbru protocolinto the formal notation of KIV. Section 4 describes this step, and section 5 presents theresults of the subsequent verification step. Finally, section 6 concludes the paper.

2 The Jaundice Protocol

Jaundice, or hyperbilirubinemia, is a common disease in newborn babies which iscaused by elevated bilirubin levels in blood. Under certain circumstances, high bilirubinlevels may have detrimental neurological effects and thus must be treated. In many casesjaundice disappears without treatment but sometimes phototherapy is needed to reducethe levels of total serum bilirubin (TSB), which indicates the presence and severity ofjaundice. In a few cases, however, jaundice is a sign of a severe disease.

The jaundice protocol of the American Association of Pediatrics [7] is intended forthe management of the disease in healthy term3 newborn babies. The guideline is a 10pages document which contains knowledge in various notations: the main text; a listof factors to be considered when assessing a jaundiced newborn; two tables, one forthe management of healthy term newborns and another for the treatment options forjaundiced breast-fed ones; and a flowchart describing the steps in the protocol.

The protocol consists of an evaluation (or diagnosis) part and a treatment part, to beperformed in sequence. During the application of the protocol, as soon as the possibil-ity of a more serious disease is uncovered, the recommendation is to exit without anyfurther action.

3 Modelling the Jaundice Protocol in Asbru

In the first stage of protocol formalisation we have used a specific-purpose knowledgerepresentation language. Different languages have been proposed to represent medical

2 http://www.protocure.org/3 Defined as 37 completed weeks of gestation.

134 Mar Marcos et al.

protocols and their specific features (see [8]). Most of them consider protocols as a com-position of actions to be performed and conditions to control these actions [9]. However,although the trend is changing lately, many of the protocol representation languages inthe literature are not formal enough. For instance, they often incorporate many free-textelements which do not have clear semantics. Exceptions to this are PROforma [10] andAsbru [6]. In this work we have used Asbru, mainly because it is more precise in thedescription of a variety of medical aspects.

3.1 Asbru: A Knowledge Representation Language for Protocols

The main aspects of Asbru are: (i) in Asbru a medical protocol is considered as a planskeleton with sub-plans in the sense of AI planning, (ii) it is possible to specify theintentions of a plan in addition to the actions of a plan, (iii) it is possible to specify avariety of control structures within a plan, and (iv) it provides a rich language to specifytime annotations. Below we give a short description of the main constructs of the Asbrulanguage (see [6] for more details).

A medical protocol is considered in Asbru as a hierarchical plan. The main com-ponents of a plan are intentions, conditions, effects, and plan-body. Furthermore, a plancan have arguments and has the possibility to return a value. Next we briefly discusssome of these components.

Intentions are the high-level goals of a plan. Intentions can be expressed in termsof achieving, maintaining or avoiding a certain state or action. Such states or actionscan be intermediate or final (overall). For example, the label “achieve intermediate-state” means that sometime during the execution of the plan, a certain state must beachieved. In total there are twelve possible forms of intention: [achieve/maintain/avoid][intermediate/overall]-[state/action].

A variety of conditions can be associated with a plan, which define different aspectsof its execution. The most important types of conditions are the following:

– filter conditions, which must be true before the plan can be started.– abort conditions, which define when a started plan must be aborted.– complete conditions, which define when a started plan can complete successfully.– activate conditions, with possible values “manual” or “automatic”. If the activate

mode is manual, the user is asked for confirmation before the plan is started.

The plan-body contains the actions and/or sub-plans to be executed as part of theplan. The main forms of plan-body are the following:

– user-performed: an action to be performed by the user, which requires user interac-tion and thus is not further modelled.

– single step: an action which can be either an activation of a sub-plan, an assignmentof a variable, a request for an input value or an if-then-else statement.

– subplans: a set of steps to be performed in a given order. The possibilities are: insequence (sequentially), in parallel (parallel), in any possible sequential order (any-order), and in any possible order, sequential or not (unordered).

– cyclical plan: a repetition of actions over time periods.


In the case of subplans, it is necessary to specify a waiting strategy, which describesthe plans that must be completed so that the parent plan can be considered successfullycompleted. It is possible to specify e.g. whether all the subplans should be executed(“wait-for ALL”) or not (e.g. “wait-for ONE”, or “wait-for” some specific plan).

Time annotations can be associated to different Asbru elements (e.g. intentionsand conditions). A time annotation specifies (1) in which interval things must start,(2) in which interval they must end, (3) their minimal and maximal duration, and (4) areference time-point. The general scheme for a time annotation is ([EarliestStarting,LatestStarting] [EarliestFinishing, LatestFinishing] [MinDuration, MaxDuration] REF-ERENCE). Any of these elements can be left undefined, allowing for uncertainty in thespecification of time annotations.

3.2 Asbru Model of Jaundice Protocol

Like the original document, the Asbru model of jaundice protocol has as main compo-nents a diagnostics part and a treatment part. It is made up of about 40 plans and has alength of 16 pages in a simplified Asbru notation. Figure 1 shows the overall structureof the protocol as a hierarchy of plans.

The treatment phase, in which we focus here, consists of two parallel parts, namelythe actual treatments and a cyclical plan asking for the input of new age and TSB valuesevery 12 to 24 hours. Regarding the treatments (label (-) in figure 1), either the regularones (“Regular-treatments”) or an exchange transfusion (“Exchange-transfusion”) cantake place depending on the bilirubin level. The “Regular-treatments” plan contains themain treatment procedure. It consists of two parts to be performed in any possible order(unordered): the study of feeding alternatives and the different therapies (see label (*)).The plans in group (*) can be tried in any order, one at a time.

Figure 2 shows the “Phototherapy-intensive” plan, which describes one of the ther-apies. Its plan-body simply contains a sub-plan activation pointing to a user-performedaction. One of its intentions is attaining normal (or “observation”) bilirubin levels. Italso contains different conditions, e.g. one of the abort conditions specifies that the planshould abort as soon as it fails to reduce the bilirubin levels in 4 hours.

4 Formalising the Jaundice Protocol in KIV

In the second stage of the formalisation process we have used the KIV verificationtool [5]. KIV is an interactive theorem prover with strong proof support for higher-order logic and elaborate heuristics for automation. Currently, special proof supportfor temporal logic and parallel programs is being added. In contrast to fully automaticverification tools, the use of KIV interactive tool allows for the verification of large andcomplex systems, as it has been shown by its application to a number of real-worldsystems (distributed systems, control systems, etc).

4.1 KIV

KIV supports the entire software development process, i.e. the specification, the imple-mentation and the verification of software systems. Next we briefly describe the relevantaspects of KIV for Asbru specification and verification needs.


Fig. 1. Overview of the jaundice protocol in Asbru. The main entry point of the protocol is the“Diagnostics-and-treatment-hyperbilirubinemia” plan –the three “Check-for-...” plans are Asbruartifacts to model a continuous monitoring of TSB level and two check-ups at temporally spec-ified intervals. The plan “Diagnostics-and-treatment-hyperbilirubinemia” is divided into a diag-nostics and a treatment subplan, to be executed sequentially.

For specification, three aspects are important: specifications can be structured, andboth functional and operational system aspects can be described. A specification is bro-ken down into smaller and more tractable components using structuring operations suchas union and enrichment, that can be used to combine more simple specifications. Forfunctional aspects, algebraic specifications are used to specify abstract data types.

Complex operational behaviour can be specified using parallel programs. Programsin KIV can contain assignments (v := τ ), conditionals (if ϕpl then ψ1 else ψ2),loops (while ϕpl do ψ), local variables (var v = τ in ψ), nondeterministic choices(choose ϕ or ψ), interleaving (ϕ || ψ) and synchronisation points (await ϕpl). For abetter support of Asbru, additional basic constructs have been implemented: interrupts(break ψ if ϕpl), for modelling different plan conditions; and synchronous parallel ex-


plan Phototherapy-intensive

intentionsachieve overall-state: (bilirubin = observation)maintain intermediate-state: (and

(TSB-decrease = yes) in ([4h, -] [-, 6h] [-, -] SELF)(TSB-change ≥ 1) in ([4h, -] [-, 6h] [-, -] SELF) )

conditionsfilter-precondition: (or (bilirubin = phototherapy-intensive) in NOW

normal-phototherapy-failure)abort-condition: (or (and (bilirubin �= phototherapy-intensive) in NOW

(not normal-phototherapy-failure)) /* and */intensive-phototherapy-failure:(and (bilirubin = phototherapy-intensive) in NOW

(or (TSB-decrease = no) in ([4h, -] [-, -] [-, -] SELF). . . ) /* or */

) /* and */) /* abort condition */

plan-bodyPrescribe-intensive-phototherapy

Fig. 2. “Phototherapy-intensive” plan.

ecution (ϕ ||s ψ), as well as any-order execution (ϕ ||a ψ), for a more direct translationof plan-bodies. With the help of these constructs, the main features of Asbru can bedirectly translated. Others still need to be encoded using additional program variables.

Concerning the verification, we use a variant of Interval Temporal Logic (ITL) [11]to formulate properties. This logic is first-order and allows finite and infinite intervals.Here we restrict ourselves to the temporal operators always (� ϕ), eventually (� ϕ),next (◦ ϕ), and laststep –which is true only in the last step of an interval. Single tran-sitions are expressed as first-order relations between unprimed and primed variables,where the latter represent the value of the variable in the next state. For example, theformula v = 0 ∧ (� v′ = v + 1) → � v = n states that, if variable v is initially 0, andthe value v′ in the next state is always incremented by one, then eventually the variablewill be equal to an arbitrary natural number n. Finally, the proof technique for verifyingparallel programs in KIV is symbolic execution with induction.

4.2 KIV Formalisation of Jaundice Protocol

In order to formally analyse Asbru plans in a first attempt, we have translated theminto parallel programs. The translation of the Asbru model into KIV has been done ina structure-preserving way, by mapping each Asbru plan into a KIV specification con-taining a parallel program. Thus, the structure of the jaundice protocol in KIV roughlymirrors the Asbru model in figure 1. This is one of the key ideas of our work, becauseit gives the possibility to obtain some feedback from the formalisation and verificationphases in terms of the Asbru model, and to exploit this structure during proof attempts.Table 1 shows some of the patterns that we used in the translation of Asbru plans.

In many cases the KIV translation closely follows the structure of the original Asbruplan, except for small details. Other translations, however, needed additional encodings


Table 1. Translation patterns of some Asbru constructs into KIV.

Asbru KIVfilter precondition ϕ NOW body await ϕ; bodyfilter precondition ϕ body if ϕ then bodycomplete condition ϕ body break body if ϕabort condition ϕ body break body if ϕ<<name>> (plan activation) <<name>>#(...) (procedure call)do type=sequentially P1,... Pn P1;... Pndo type=any-order P1,... Pn P1 ||a ... Pnwait-for Pi body break body if some expression on Pi-state

Phototherapy-intensive#(var phototherapy-normal-prescription-activated, patient-data,time, phototherapy-intensive-activated)

beginawait get-bilirubin(patient-data.tsb) = phototherapy-intensive

∨ get-bilirubin(patient-data.tsb) = phototherapy-normal∧ phototherapy-normal-prescription-activated �= ⊥∧ 4 ≤ time - phototherapy-normal-prescription-activated.value∧ ¬ get-decrease(patient-data.tsb);

phototherapy-intensive-activated := mk-value(time);break

prescribe-intensive-phototherapy#(; time)if get-bilirubin(patient-data.tsb) �= phototherapy-intensive∨ get-bilirubin(patient-data.tsb) = phototherapy-intensive

∧ ( 4 ≤ time - phototherapy-intensive-activated.value∧ ¬ get-decrease(patient-data.tsb)

∨ . . . )end

Fig. 3. KIV translation of “Phototherapy-intensive” plan.

to represent the Asbru elements not directly supported by KIV. The example in fig-ure 3, corresponding to the plan “Phototherapy-intensive”, serves to illustrate the kindof translations that we have obtained. This translation includes an await construct tomodel the filter preconditions of the plan, as well as an interrupt (break) to monitorthe conditions under which the plan should abort. The KIV plan also shows the way inwhich time annotations can be encoded, with the help of additional variables holdingthe time at which a plan has been activated (such as phototherapy-intensive-activated).

5 Verifying the Jaundice Protocol in KIV

After the formalisation of the jaundice protocol, we have worked on theverification of several protocol properties using the KIV system. Protocolproperties are expressed in the previously introduced variant of ITL. For instance,Phototherapy−intensive#(. . .) ∧ (� time′′ = time′ + 1) → � laststep states


that, if the program Phototherapy − intensive is executed, then execution willeventually reach the last step, i.e. it terminates. However, the plan only terminatesunder the additional assumption that time is incremented by one in each step.

As part of Protocure project, we identified a number of protocol properties whichwere deemed important from a medical point of view. We distinguished properties at theimplementation level, dependent on the Asbru language, from properties at the concep-tual level, protocol-dependent. Properties at the conceptual level reflect the verificationneeds in a practical application and hence have been the target of our formal verification.Among them, the correctness of plan intentions and the properties making reference toindicators were deemed of particular interest. The correctness of intentions aims at en-suring that the intentions of a plan follow from its body. As for the properties aboutindicators, the goal is verifying that the protocol results in actions that comply withcertain quality criteria defined either in the protocol itself or by external sources. Inour case, we have exploited the indicators for jaundice treatment defined by the MAJIC(Making Advances against Jaundice in Infant Care) Committee [12].

5.1 Verification of one Intention of “Phototherapy-Intensive” Plan

One of the intentions of the plan “Phototherapy-intensive” is the reduction of the TSBlevels in at least 1 mg/dl within 4 to 6 hours. We can view this as a property that theplan should satisfy, i.e. while executing the plan, there should be such a decrease. Thisproperty was initially translated to the following ITL formula:

� laststep∨ pti-state �= activated ∧ pti-state′ = activated→ /* time annotation */

4 ≤ time − pti-acttime .value∧ time − pti-acttime .value ≤ 6

→ /* property */get-decrease(pd .tsb, pti-acttime .value)

∧ get-change(pd .tsb, pti-acttime .value) ≥ 1unless pti-state′ �= activated

Informally, this formula says that, if “Phototherapy-intensive” is activated, when 4 to 6hours have elapsed, there should be a decrease in TSB levels greater or equal than 1.

The property was successfully proved in KIV, in a fully automatic proof. However,the proof was successful only after a number of proof attempts. These attempts uncov-ered errors in the formulation of the intention, which led to an improved formal se-mantics for Asbru intentions. This property was somehow “easy” to prove, because theintention is enforced by one of the abort conditions of the plan. In the next section wepresent a more difficult proof, which required identifying the conditions under whichthe property should hold. As we will see, these conditions describe the most usual casesof newborn jaundice.

5.2 Verification of MAJIC Indicator #7

The MAJIC indicators that appear in [12] have been refined by the same organisationinto different medical review criteria for the evaluation and treatment of jaundice. This


includes a set of 11 criteria which jaundice protocols must comply with. Among thesecriteria, we selected indicator #7, which is stated as follows:

INCLUSIONS If any phototherapy initiatedCRITERIA No more than one serum bilirubin level drawn after phototherapy is dis-

continued

The rationale of this indicator is beyond the scope of this work. It was translatedinto the following temporal formula:

� laststep∨ pd .under-phototherapy ∧ ¬ pd′ .under-phototherapy

∧ pd .tsb = TSB0→ � pd .tsb = TSB1 ∧ TSB1 �= TSB0 → � pd .tsb = TSB1

Informally, when phototherapy is discontinued, if another TSB value is measured, thenthere will not be more TSB measurements (TSB history will stay the same).

Several attempts were made to prove that “Regular-treatments” complies with thisindicator, which uncovered some problems in the formalisation of the protocol. Thisinsight was used to enhance the translation patterns for Asbru plans. Finally, it wasproved that the property does not hold. A counter example was found, which consistsin applying phototherapy once and then doing observation for more than 24 hours, al-lowing “Treatment-hyperbilirubinemia” to measure TSB twice. The analysis of proofattempts served to identify the assumptions under which the indicator should be satis-fied: (1) If phototherapy is discontinued, bilirubin levels are normal (or “observation”);(2) Observation plan will run for less than 24 hours; and (3) After phototherapy andobservation, bilirubin levels will still be “observation”. These assumptions were givento medical experts for review, who concluded that they capture the most usual cases,i.e. for most of newborns the assumptions hold and the protocol satisfies the indicator.Thus the assumptions explicitly define the cases in which the indicator is satisfied. Theycould be used to improve the original protocol, e.g. to document the cases in which theindicator might not be satisfied.

6 Conclusions

In this paper we have shown that it is possible to formalise a significant piece of medicalknowledge to such an extent that it can be used as the basis for formal verification, andthat this verification is indeed possible. We have fully formalised a real-world medicalprotocol in a two-stage formalisation process. Then, we have used a theorem proverto systematically analyse whether the formalisation complies with certain (medicallyrelevant) protocol properties.

The most important contribution of our effort is showing that it is possible to for-mally analyse medical protocols. If a protocol is developed with certain goals in mind(e.g. intentions or indicators), verification can serve to check whether the protocol actu-ally complies with them. Even if this is not the case, the verification attempts can be ofhelp in obtaining counter examples and/or assumptions, which can be eventually usedto improve the original document.


Obviously, this achievement comes at a price: a significant amount of work hasbeen necessary for such an effort. Although we are not in a position to make strongquantitative statements, the formalisation and verification exercise reported in this paperhas taken over a person-year to complete. However, this has been our first attempt inthe direction of verifying medical protocols with mathematical rigour. We expect thatthe necessary effort should decrease in the future, e.g. with a more direct KIV supportfor Asbru protocols. Furthermore, we would argue that the improvement of the qualityof medical practice protocols is worth additional effort.

Acknowledgements

This work has been supported by the European Commission, under contract numberIST-2001-33049–Protocure. We want to thank all Protocure members, without whomthis work would not have been possible.

References

1. Field, M., Lohr, K., eds.: Clinical Practice Guidelines: Directions for a New Program. Na-tional Academy Press, Washington D.C., USA (1992)

2. Woolf, S., Grol, R., Hutchinson, A., Eccles, M., Grimshaw, J.: Potential benefits, limitations,and harms of clinical guidelines. British Medical Journal 318 (1999) 527–530

3. Clayton, P., Hripsak, G.: Decision support in healthcare. Int. J. of Biomedical Computing39 (1995) 59–66

4. AGREE Collaboration: Appraisal of Guidelines for Research & Evaluation (AGREE) In-strument (2001) Obtained in http://www.agreecollaboration.org/.

5. Balser, M., Reif, W., Schellhorn, G., Stenzel, K., Thums, A.: Formal system developmentwith KIV. In Maibaum, T., ed.: Fundamental Approaches to Software Engineering. Number1783 in LNCS, Springer (2000)

6. Shahar, Y., Miksch, S., Johnson, P.: The Asgaard project: a task-specific framework for theapplication and critiquing of time-oriented clinical guidelines. AI in Medicine 14 (1998)29–51

7. AAP: American Academy of Pediatrics. Practice parameter: management of hyperbiliru-binemia in the healthy term newborn. Pediatrics 94 (1994) 558–565

8. Elkin, P., Peleg, M., Lacson, R., Bernstam, E., Tu, S., Boxwala, A., Greenes, R., Shortliffe,E.: Toward Standardization of Electronic Guidelines. MD Computing 17 (2000) 39–44

9. Miksch, S.: Plan Management in the Medical Domain. AI Communications 12 (1999) 209–235

10. Fox, J., Johns, N., Lyons, C., Rahmanzadeh, A., Thomson, R., Wilson, P.: PROforma: ageneral technology for clinical decision support systems. Computer Methods and Programsin Biomedicine 54 (1997) 59–67

11. Moszkowski, B.: A temporal logic for multilevel reasoning about hardware. IEEE Computer18 (1985) 10–19

12. MAJIC: MAJIC Steering Committee Meets. MAJIC Newsletter 1 (1998)


Enhancing Conventional Web Content with Intelligent Knowledge Processing

Rory Steele and John Fox

Cancer Research UK, Advanced Computation Laboratory, Lincoln’s Inn Field, London, Eng-land, WC2A 3PX

{rory.steele,john.fox}@cancer.org.uk

Abstract. The Internet has revolutionized the way knowledge can be accessed and presented. However, the explosion of web content that has followed is now producing major difficulties for effective selection and retrieval of information that is relevant for the task in hand. In disseminating clinical guidelines and other knowledge sources in healthcare, for example, it may be desirable to pro-vide a presentation of current knowledge about best practice that is limited to material appropriate to the current patient context. A promising solution to this problem is to augment conventional guideline documents with decision-making and other “intelligent” services tailored to specific needs at the point of care. In this paper we describe how BMJ’s Clinical Evidence, a well-known medical reference on the web, was enhanced with patient data acquisition and decision support services implemented in PROforma.

1 Introduction

Health professionals find themselves under increasing pressure from constantly esca-lating workloads and the growing expectations and demands of patients and manag-ers, with inevitably less time being spent maintaining their personal knowledge bases. Furthermore, the available scientific knowledge that forms the evidence base of eve-ryday clinical practice far exceeds a clinician’s capacity to absorb it and apply it ef-fectively1. It is widely believed that technologies such as decision support and knowl-edge management systems have considerable potential to support effective dissemination of up-to-date knowledge to clinicians, bringing relevant information to the right place at the right time, and applying it safely and efficiently (see www.openclinical.org).

The web is becoming a vital tool for the management of medical knowledge. De-velopments in hypertext content have made it possible to rapidly build and publish major repositories of reference information, clinical guidelines and so on. Techniques for automatically generating web content from pre-existing relational- or XML-databases, with insertion of links between related sections are also well established. The result is an increasing availability of specialist “knowledge resources” with un-precedented coverage and accessibility.

1 “Medicine is a humanly impossible task” - Alan Rector

Enhancing Conventional Web Content with Intelligent Knowledge Processing 143

Despite these developments and the impressive performance of modern search-engines, the new web publishing techniques remain problematic. Users who are look-ing for reference documents or seeking answers to questions typically divide their time between navigating across web links and reading the material that seems relevant to their requirements. Although search engines home in quickly on relevant web pages, the end result of a typical search is still a large collection of documents to re-view in order to find the answers to the original question. A clinician who needs to answer a specific question quickly, or compare treatment options for a particular pa-tient, can still be overwhelmed by material. Healthcare professionals would greatly benefit from search processes that could filter content in a way that focused the pres-entation on the specific task in hand at the point of care.

An alternative technique, developed within the AI community, has been used to control the delivery of content to a user. Some expert systems have used a set of rules to automatically direct browsing, whilst also providing explanations to the user as to why specific content has been delivered. [1, 2] Unfortunately, translating knowledge from documents into an expert system’s knowledge base has proved difficult. Rada has described a scheme called ‘expertext’ to combine the strengths of both expert systems and hypertext, [3, 4] with some groups experiencing limited success with this approach. [5]

This paper addresses the provision of expert system like decision support facilities for healthcare professionals, integrated with conventional web content. The aim is to provide patient-specific decision support based on large repositories of web content. The two critical challenges we consider are the need to reduce impractical demands for detailed knowledge engineering, by automatically adapting existing XML content to specific use cases in an expertext-like manner, and the capability to focus presenta-tion of the content in light of the specific clinical needs.

2 Problem Description

Paper and electronic journals and textbooks are the traditional sources of medical knowledge. Journals normally provide detailed and focused information specific to certain disease-areas, while textbooks typically provide comprehensive information covering aetiology, physiology, diagnosis and treatment. Both formats are normally prepared with quiet study in mind rather than rapid access to patient-specific informa-tion. To address the need for rapid clinical reference new print formats have appeared such as the pocket-sized Oxford Handbook of Clinical Medicine. Despite their popu-larity, such manuals are inevitably lacking in detail and medical publishers continue to look for alternative solutions. One of the most interesting new formats to appear is Clinical Evidence (C.E.), a biannual digest of clinical research developed by the pub-lishers of the British Medical Journal. [6]

The basic concept of C.E. is to provide a structured, standardized database of refer-ence information built around (a) major areas of clinical practice, (b) questions that commonly arise about alternative treatments and other interventions in those areas, and (c) the proven benefits - and potential harms - that are associated with different interventions. Each question is associated with a certain amount of text, typically in the region of half-a-dozen pages of close print. The user will then need to read the text in order to extract, retain and then correctly apply the evidence provided to make a

144 Rory Steele and John Fox

clinical decision. A web version is also available, from which it is possible to drill down from the evidence summaries into other web resources, notably the PubMed repository of research reports.

Despite the popularity of C.E., the staff of BMJ Publishing recognize some impor-tant limitations. In its paper form, it is a weighty and unwieldy volume and in its web form, users must navigate up and down a hierarchically tree structure in order to get to the sections they require. More importantly, while the publication provides a uniquely compact review of many areas of modern medical practice, there is still a great deal of information to read and digest. There would still be great value in “filter-ing and focusing” the content into a form that was directly relevant to specific clinical settings and questions.

Cancer Research UK was asked to carry out an experiment to investigate new ways in which this problem might be addressed. The C.E. knowledge base was supplied to us in the form of a set of XML documents [7] or sections, each containing a text seg-ment about a particular topic (Figure 1). Each topic consists of a set of questions and references. Questions have a set of associated options, with each option describing the benefits, harms and any further comments associated with that option in relation to the posited question.

Fig. 1. Hierarchical breakdown of Clinical Evidence document structure

3 Knowledge Authoring

The technology used in this experimental system was the Tallis guideline authoring and web-publishing system (www.openclinical.org/kpc). Tallis uses the PROforma process modeling language that was designed to support the specification and execu-tion of task-based processes such as clinical guidelines. PROforma provides an ex-pressive, compositional language based on a small ontology of generic tasks (Fox and Das, 2000):


• Decisions - any choice, such as a choice between competing diagnoses or treat-ments

• Actions – a simple external act, such as a message action or display of a web page • Enquiries – an external request for information, such as a clinical data entry form • Plans – any number of the above tasks, possibly including sub-plans

PROforma tasks can be composed into networks, representing processes that are to be carried out over time (such as guidelines, protocols or care pathways). The task specification is a declarative representation that can be interpreted by a suitable en-gine that enacts the tasks (e.g. acquiring data, evaluating decisions, and controlling the flow of task execution). Task enactment can be influenced by a number of control constructs:

• Scheduling-constraints which specify any tasks that must be completed before a task can be considered for execution (e.g. collect data before making a decision)

• Trigger events which can activate a task independently of any scheduling con-straints on it (e.g. user initiates a care pathway)

• Preconditions that specify any logical circumstances that must hold for a task to be processed by the engine (e.g. a task that should only be enacted if a particular deci-sion has already been made).

A decision task also contains a set of decision options or candidates. Candidates are associated with “argument rules” (which might represents reasons for particular treatments for example) and “commitment” rules, which take or recommend particu-lar candidates based on collections of arguments. Arguments may also be given weightings to indicate that some arguments are “stronger” than others.

4 Integrating a Task Model into Clinical Evidence

The first step in integrating decision support into C.E. was to define a PROforma task structure for the C.E. document structure. This was facilitated by the hierarchical organization of the C.E. document, since all nodes in the C.E. tree structure could map simply to a corresponding PROforma task.

As remarked earlier PROforma applications can contain plans that can contain sub-plans and other tasks. Figure 2 shows how C.E. is modeled as a single plan (repre-sented as a rounded rectangle), which contains sub-plans that are used as containers for C.E. sections, such as “cardiovascular disease section”. These sub-plans contain the C.E. content dealing with C.E. topics, such as “Acute Myocardial Infarction”. In the C.E. structure topics contain a number of C.E. questions, which are also container plans. The tree structure in Figure 2 also contains many squares; each of these repre-sents a PROforma action that contains all the instructions required to display a seg-ment of C.E. text as a web page (including any links to other pages).

Each question within C.E. would map to a decision support plan, consisting of an enquiry task, to obtain patient data, and a decision task to evaluate arguments for and against different options. The candidates of the decision task were automatically de-rived from the options associated with the question. The content within the benefits, harms and comment sections on an option were used to create the requisite arguments


for a candidate (arguments described within the benefits section “support” a candi-date, whilst arguments described within the harms section “oppose” it).

Unfortunately, the C.E. text for each option (i.e. the benefits, harms and comments) was not sufficiently well structured to allow a simple mapping to the arguments of a candidate. This process was carried out manually by creating PROforma rules within the Tallis authoring system. [9] The weight of each argument was determined by the strength of the clinical trial data the argument referenced and its statistical signifi-cance.

Fig. 2. PROforma model based on the content of Clinical Evidence detailed in Figure 1

To illustrate this manual process, consider the question in Figure 1: “Which treat-ments improve outcomes in acute myocardial infarction?”. This question has a set of options, one of which is Angiotensin converting enzyme inhibitors (ACE inhibitors). Associated with this option are the following fragments of text:

… The overview (4 large RCT s, 98,496 people irrespective of clinical heart failure or left ventricular dysfunction, within 36 h of the onset of symptoms of AMI) compared ACE inhibitors versus placebo. [33] It found that ACE inhibitors significantly reduced mortality after 30 days (7.1% with ACE inhibitors v 7.6% with placebo; RR 0.93, 95% CI 0.89 to 0.98; NNT 200)… The largest benefits of ACE inhibitors in people with AMI are seen when treatment is started within 24 hours…

This would map to the creation of the candidate ‘ACE inhibitors’ for the decision

task. This would contain a set of arguments (where ‘ami_onset’ is an integer variable representing the time of infarction onset):

• Argument 1 PROforma condition: ami_onset =< 24 PROforma weight: +2 PROforma caption: Treatment of patients with angiotensin converting en-zyme inhibitors within 24 hours of the onset of infarction significantly re-duced mortality after 30 days [33]


• Argument 2 PROforma condition: (ami_onset > 24) AND (ami_onset =< 36) PROforma weight: +1 PROforma caption: Treatment of patients with angiotensin converting en-zyme inhibitors within 36 hours of the onset of infarction significantly re-duced mortality after 30 days [33]

5 Application Architecture

An XSLT [10] document was developed to transform the C.E. XML documents into PROforma XML documents, using the mappings specified in the section above. An XSLT document was also developed to transform the C.E. content into the necessary web pages to provide a façade for the PROforma action, enquiry and decision tasks viewed during guideline enactment (Figure 3). Navigational instructions to control enactment, such as task confirmations and triggers, were inserted into the web pages as hyperlinks. Enquiry tasks within the decision support plan were also made avail-able as hyperlinks. When activated, a HTML form is provided to query the user for data. This data can then be used by the engine to evaluate which is the most appropri-ate candidate. Candidates of a decision task can also link to a HTML-encoded break-down of the relevant arguments, with hyperlinks back to the clinical trials that pro-vided the evidence on which the arguments were based.

Fig. 3. System architecture overview and integration of the PROforma engine with the XSLT generated content

The PROforma documents and web pages were deployed within a J2EE servlet-container, where they could be enacted via a standard web-browser. In a typical ses-sion, the user starts up the PROforma engine, which leads to the activation of an ac-tion task that presents all the available sections. The user could then browse to a spe-cific topic, by triggering another action task detailing the questions specific for that topic. For each question, the user may browse through the individual options or acti-vate the decision support facilities, provided as a hyperlink within that web page (Figure 4).

Invoking decision support initiates an enquiry task. This displays an HTML form for the user to provide information about the specific patient. On completion of this


step the decision task is initiated, using the data collected by the enquiry to evaluate the arguments for and against the different options (candidates).

The final task is to construct a report showing the options in order of preference based on an assessment of the overall strength of the arguments. The user can review the arguments for each option. The report in Figure 5 shows a typical report for the decision support service in the application. The top panel shows the 7 decision op-tions for the question “Which treatments improve outcome in acute myocardial infarc-tion?”. The top two ticked options (Nitrates and Blockers) are recommended, while the bottom two crossed options are recommended against. The user, who may accept the system recommendation or select another option, makes the final choice. Here the user has requested further details of one of the equivocal options (indicated by ‘?’) and the arguments are shown below, one argument “for” and one “against”. The user may also request further justification for the argument, which is provided by linking through to the original research study report located on the PubMed web site.

This example application can be accessed for demonstration purposes at http://www.openclinical.org/BMJDemo/demo.html.

Fig. 4. HTML façade for the Acute Myocardial Infarction Action - showing the decision sup-port hyperlink

6 Discussion and Future Work

6.1 Functionality Benefits

The use of decision tasks allows the user to be directed to the most appropriate con-tent for that current session. Normally, a user would be presented with a web page with a set of hyperlinks to further content. To determine if the information pointed to


via these links is relevant, the user is first required to navigate them all, read all the content and then make a decision based on the digested content. With decision sup-port facilities, the doctor merely enters current patient details and the candidate deci-sion options are assessed based upon this data. Such facilities avoid a great deal of unnecessary and time-consuming work by generating only the hyperlinks that are relevant within the particular clinical situation and by ensuring that clinical decision-making takes all current information and evidence into account.

Fig. 5. HTML façade for the Acute Myocardial Infarction Decision – showing the support for the Angiotension converting enzyme inhibitors candidate

6.2 Usability Benefits

The network structure of a PROforma guideline, and its decomposition into constitu-ent tasks, provides a practical clinical context for information retrieval and navigation. Each content level within the C.E. document maps to a plan within the enactable guideline. Each of these plans then contains further plans (sublevels of C.E. content) and an action task that returns the rendered HTML, containing the necessary links for further navigation. This also provides a context for users with respect to their previous navigational choices. The current set of active tasks can be retrieved at any moment, in effect providing a dynamic set of bookmarks for the current session and reducing time spent browsing back and forth in the complete C.E. document. This could be valuable where time is short, allowing the busy doctor to avoid time-consuming and redundant navigation steps.

6.3 Future Work

Currently, all HTML content is pre-generated before the guideline is enacted. A fu-ture line of enquiry is to generate content dynamically, by providing runtime-processing facilities to tasks within PROforma. Such facilities could include the


XSLT generation of the relevant web pages, generation of emails and the querying and/or update of external patient records.

Another line of enquiry currently being investigated is to provide a more fine-grained description of the content within a C.E. option. Medical ontologies and se-mantic web initiatives are promising candidates for providing the required high-level descriptions of such content. [11, 12] The automated approach of using XSLT, cur-rently used in the guideline construction, could then be extended to the generation of arguments within the PROforma guideline.

The final line of development is to carry out usability testing and clinical evalua-tion of this approach to decision support. Earlier PROforma applications have been evaluated in collaboration with volunteer doctors in clinical settings. For example, the CAPSULE project found that an “argument-based decision support” system led to a significant improvement in the quality of a doctor’s prescribing decisions (in relation to making a better choice of medication, or a cheaper but equally effective one). [13] The RAGs genetic risk assessment system was evaluated in a clinical simulation and found that decision support technology made sense of the “guideline chaos” in pri-mary care. [14, 15] We aim to carry out a two-stage evaluation of the present technol-ogy with paper patients to establish whether (a) decision support of this kind has beneficial effect on clinical decisions, (b) investigate issues of usability and accept-ability at the point of care.

7 Conclusion

The explosion of medical knowledge on the web is producing problems for the practi-cal retrieval of relevant information at the point of care. A promising solution to this problem is to augment conventional guideline documents with decision-making and other “intelligent” services, tailored to a patient’s specific circumstances. In this paper we have demonstrated how this can be achieved, using BMJ’s Clinical Evidence as the knowledge base and PROforma as the formal guideline representation. The crea-tion of the integrated XML content was partly automatic, with scope for increasing the automated component in such applications, particularly where the target document is well structured. This combination of ordinary documents and formalized knowl-edge offers a number of potential benefits for improved functionality and usability of guidelines. The present paper has concentrated on the technical aspects of our ap-proach; studies of actual benefits are in progress.

Acknowledgements

We would like to thank Dr Jon Fistein and staff of BMJ Publishing for their encour-agement in this project and their help in integrating PROforma with Clinical Evi-dence. We would also like to thank Richard Thomson, Michael Humber, David Sut-ton and Ali Rahmanzadeh for their help and assistance in use of PROforma and related technologies.


References

1. Shortliffe, E., Scott, A., Bischoff, M., Campbel, A., van Melle, W., Jacobs, C.: ONCOCIN. Expert system for oncology protocol management. Proceedings International Joint Confer-ence Artificial Intelligence. (1981) 876-881

2. Timpka, T.: LIMEDS. Knowledge-based decision support for General Practitioners: an in-tegrated design. Proceedings Tenth Annual Symposium on Computer Applications in Medical Care. (1986) 394-402

3. Rada, R.: Hypertext: from text to expertext, McGraw-Hill, Inc. New York, (1992) 4. Rada, R., Barlow, J.: Expert systems and hypertext. The Knowledge Engineering Review.

3, 4, (1988) 285-301 5. Fox, J.,Glowinski, A., Gordon, C., Hajnal, S., O’Neil, M.: Logic engineering for knowl-

edge engineering: design and implementation of the Oxford System of Medicine. Artificial Intelligence in Medicine. 2, 6, (1990) 323-339

6. BMJ Publishing Group. See http://www.evidence.org 7. World Wide Web Consortium. eXtensible Markup Language (XML) 1.0. W3C Recom-

mendation. See http://www.w3.org/TR/2000/REC-xml-20001006 8. Fox, J., Johns, N., Lyons, C., Rahmanzadeh, A., Thomson, R., Wilson, P.: PROforma - a

general technology for clinical decision support systems. Computer Methods and Programs in Biomedicine. 54 (1997) 59-67

9. Advanced Computation Laboratory. See http://acl.icnet.uk/lab/tallis.html 10. World Wide Web Consortium. eXtensible Stylesheet Language Trans-formations (XSLT),

W3C Recommendation. See http://www.w3.org/TR/xslt 11. For a short review of medical ontolgies see http://www.openclinical.org/emr.html 12. World Wide Web Consortium. Semantic Web Activity. See http://www.w3.org/2001/sw 13. Walton, R.T., Gierl, C., Yudkin, P., Mistry, H., Vessey, M.P., Fox, J.: Evaluation of com-

puter support for prescribing (CAPSULE) using simulated cases. British Medical Journal. 315 (1997) 791-795

14. Coulson, A.S., Glasspool, D.W., Fox, J., Emery J.: RAGs: A novel approach to computer-ised genetic risk assessment and decision support from pedigrees. Methods of Information in Medicine. 40 (2001) 315-322

15. Emery, J., Walton, R., Coulson, A.S., Glasspool, D.W., Ziebland, S., Fox, J.: Computer support for recording and interpreting family histories of breast and ovarian cancer in pri-mary care (RAGs) - qualitative evaluation with simulated patients. British Medical Journal. 319 (1999) 32-36


Linking Clinical Guidelines with Formal Representations

Peter Votruba1, Silvia Miksch1, and Robert Kosara2

1 Vienna University of Technology, Inst. of Software Technology & Interactive Systems Favoritenstraße 9-11/188, A-1040 Vienna, Austria {peter,silvia}@asgaard.tuwien.ac.at

www.asgaard.tuwien.ac.at 2 VRVis Research Center for Virtual Reality and Visualization

TechGate Vienna, Donau-City-Strasse 1, A-1220 Vienna, Austria [email protected] www.VRVis.at/vis/

Abstract. Clinical guidelines have been used in the medical domain for some time now, primarily to reduce proneness to errors during the treatment of spe-cific diseases. Recently, physicians have special software at their disposal, which supports them at decision-making based on computerized protocols and guidelines. Using such tools, physicians sometimes want to know the reason why the computer recommends a particular treatment method. To comprehend the suggestions, a connection between the original guideline and its computer-ized representation is needed. This paper introduces a tool that was designed to provide a solution for that, the so-called Guideline Markup Tool (GMT). This tool enables the protocol designer to create links between the original guideline and its formal representation.

1 Introduction

Clinical guidelines have been introduced to standardize treatment methods of physi-cians. Clinical guidelines are systematically developed directions to assist the medical practitioner in making decisions about appropriate healthcare for specific conditions. Guidelines are intended to define each step of a treatment for specific diseases to reduce proneness to errors. Conventional guidelines are written as plain text docu-ments, sometimes including tables or flow charts for better illustration of important facts. Often these documents contain ambiguities or even contradictions, which re-duce their usefulness.

Several approaches have been carried out with the aim of improving the usefulness of guidelines, by trying to model them in a machine-readable form using a guideline modelling language. In many cases, the main goal is to systematically validate guide-lines. In addition, software systems have been developed, which support the comput-erization and application of guidelines.

Linking Clinical Guidelines with Formal Representations 153

The Asgaard project1 [6] outlines some useful task-specific problem-solving meth-ods to support both designers and users of clinical guidelines and protocols. The key part of this project is the guideline representation language Asbru [3,5].

The Guideline Markup Tool (GMT) has been developed within the Asgaard pro-ject to support the translation of a free-text guideline into its Asbru representation.

1.1 Related Work

Several tools for acquiring guidelines have been proposed. AsbruView [1], a knowl-edge acquisition and visualization tool, was developed within the Asgaard project to facilitate the creation, editing and visualization of Asbru files. To be suitable for phy-sicians, AsbruView uses graphical metaphors, such as a running track with a finishing flag, to represent Asbru plans.

GEM Cutter [4] is a tool that was developed to support the transformation of a guideline into the GEM format. It shows the original guideline document together with the corresponding GEM document, similar to our Guideline Markup Tool, and makes it possible to copy text from the guideline to the GEM document.

There are two tools for translating guidelines into PROforma [2] - both make heavy use of the same graphical symbols representing the four task types in PRO-forma. AREZZO is designed to be used on client-side only, whereas TALLIS [7] supports publishing of PROforma guidelines over the World Wide Web.

2 The Guideline Markup Tool

None of the existing tools supports (i) the linking of informal and formal representa-tion of guidelines to increase the structuring and understanding of guidelines in both representation and to trace back flaws and errors and (ii) the facility of design pat-terns to ease the authoring of guidelines in a formal representation. This leads to the two main features of the Guideline Markup Tool (GMT) [8].

Firstly, GMT allows the definition of links between the original guideline and the Asbru representation, which gives the user the possibility to find out where a certain value in the Asbru notation comes from. If someone wants to know the origin of a specific value in the Asbru XML file, the GMT can be used to jump to the correlating point in the HTML file where the value is defined and the other way round.

The second main feature of the GMT is the usage of macros. A macro combines several XML elements (in other words, Asbru elements), which are usually used together. Thus, using macros allows creating and extending Asbru XML files more easily through the usage of common design patterns. Such design patterns are often used behaviours, which can be found in guidelines.

Through these two features, GMT is able to support the following tasks:

1 Asgaard Project website: http://www.asgaard.tuwien.ac.at/

154 Peter Votruba, Silvia Miksch, and Robert Kosara

• Authoring and Augmenting Guidelines. We want to be able to take a new guide-line in plain text and create an Asbru version of it, and to add links to the corre-sponding parts of a guideline to an already existing Asbru file.

• Understanding Asbru Guidelines. For an Asbru guideline, we want to be able to see where values in the different parts of the Asbru code come from, and how parts of the original text were translated into Asbru. This is important not just for knowledge engineers, but also for physicians wanting to get an understanding of the language Asbru.

• Structuring Asbru. The GMT provides a structured list of Asbru elements – the macros – that needs to be done in a way that best supports the authoring of plans. This list will also provide a good starting point for teaching material and possible subsets of the language for special purposes.

2.1 Features

According to the requirements presented above, the user interface is designed to show the contents of the HTML file (original guideline), the XML file (Asbru representa-tion) and the macros file together in one window. Therefore, the GMT window is divided into three main parts – Fig. 1 shows a screenshot of the GMT with loaded HTML-, XML- and macros files.

Fig. 1. Screenshot of the GMT window


The upper left part of the window (component #1 in ) shows the contents of the HTML file. The XML part consists of a hierarchical view of the XML file (compo-nent #2a) and a detail view of the current XML node (component #2b). The macros part contains a view of the macros structure (component #3a) and a pre-view of the currently selected macro (component #3b).

• Inserting a macro/link. To insert a macro (or a link, which is a special kind of macro), the target XML element in the XML view and a proper macro in the struc-ture view have to be selected. After clicking on the insert-macro button, an input dialog appears where the attribute values can be entered.

• Activate a link. If links have been defined during the translation of a guideline, they can be used to comprehend the connections between the original guideline and its Asbru representation (see Fig. 2).

• Link visualization. A useful add-on is the possibility to visualize the spread of links in an Asbru file, where each element in XML view gets coloured differently – all link elements get a green background, the elements that belong to a link, are coloured blue, and the other elements are grey. In case links are inserted into an existing Asbru file, this feature provides a good overview of all unlinked parts.

Fig. 2. Link activation. When clicked on a link endpoint in the HTML view, the counterparts are highlighted in the XML view

3 Evaluation and Conclusions

In this paper, a new guideline tool, called Guideline Markup Tool (GMT), is intro-duced. It supports knowledge engineers in translating clinical guidelines into their Asbru representation. It does this by providing macros to facilitate assembling Asbru guidelines. However, the main feature of the GMT is the ability to create and main-tain links between a guideline HTML file and its representing Asbru XML file. The knowledge engineer should always define links during the translation task. If the resulted Asbru XML file is used as an input of another Asgaard tool, it may happen

156 Peter Votruba, Silvia Miksch, and Robert Kosara

that someone wants to know the reason for the choice of a particular Asbru element or the origin of a specific attribute value. The GMT can be used to answer such ques-tions or to retrace errors.

To be consequently, links also work in the other direction, which allows easier comprehension of the translation process and thereby facilitates learning of the quite complex language Asbru. Therewith, the GMT can be used to find out how a particu-lar passage in the text of the original guideline has been modelled in Asbru.

We performed a small, qualitative study on the usability of the GMT [8]. We chose eight knowledge engineers, who were familiar with the Asbru language.

The evaluation procedure consisted of three phases: (i) a questionnaire assessing the computer skills of the participants; (ii) an exploration session with the GMT, where the participants examined the functionality of the GMT; (iii) a questionnaire assessing the overall impression and the three views in particular.

The second and third phase confirmed that the three views (HTML, XML, Struc-ture View) are very appropriate to author clinical guidelines and to translate such clinical guidelines into a formal notation, like Asbru. The linking features in both directions facilitated structuring guidelines’ text, the retrieval of knowledge parts, and retracing of possible flaws and errors. One drawback of the GMT was that everybody was expecting a fully functional editor for XML code in the XML View, which was out of scope. In summary, the participants rated the GMT as a very powerful and useful tool, which supports the implementation of clinical guidelines.

Acknowledgments

We wish to thank Mar Marcos and Marije Geldof for their valuable suggestions. Furthermore, we would like to thank Katharina Kaiser, Georg Duftschmied, Andreas Seyfang, Christian Popow, Monika Lanzenberger, Wolfgang Aigner, Peter Messner and Klaus Hammermüller for participating in the evaluation.

This tool is part of the Asgaard Project, which is supported by “Fonds zur Förderung der wissenschaftlichen Forschung" (Austrian Science Fund), grant P12797-INF.

References

1. Kosara, R.; Miksch, S.: Metaphors of Movement - A Visualization and User Interface for Time-Oriented, Skeletal Plans. In: Artificial Intelligence in Medicine, Special Issue: Infor-mation Visualization in Medicine, pp. 111-131, 22(2) (2001)

2. Bury, J.; Fox, J. and Sutton, D.: The PROforma Guideline Specification Language: Progress and Prospects. In: Proceedings of the First European Workshop on Computer-Based Sup-port for Clinical Guidelines and Protocols (EGWLP) 2000, Volume 83 of Studies in Health Technology and Informatics, pp. 12–29. IOS Press. (2000)

3. Miksch, S.; Shahar, Y.; Johnson, P.: Asbru: A Task-Specific, Intention-Based, and Time-Oriented Language for Representing Skeletal Plans. In: Motta, E.; Harmelen, F. v.; Pierret-Golbreich, C.;Filby, I.; Wijngaards, N. (eds.), 7th Workshop on Knowledge Engineering: Methods & Languages (KEML-97), Milton Keynes, UK (1997)


4. Polvani, K.-A.; Agrawal, A.; Karras, B.; Deshpande, A.; Shiffman, R.: GEM Cutter Manual. Yale Center for Medical Informatics (2000)

5. Seyfang, A.; Kosara, R.; Miksch, S.: Asbru’s reference manual, Asbru version 7.3. Technical Report Asgaard-TR-2002-3, Vienna University of Technology (2002)

6. Shahar, Y.; Miksch, S.; Johnson, P.: The Asgaard Project: A Task-Specific Framework for the Application and Critiquing of Time-Oriented Clinical Guidelines. In: Artificial Intelli-gence in Medicine, 14, pp. 29-51 (1998)

7. Steele, R. and Fox, J.: Tallis PROforma Primer. Advanced Computation Laboratory, Cancer Research, UK. (2002)

8. Votruba, P.: Structured Knowledge Acquisition for Asbru. Master’s Thesis, Vienna Univer-sity of Technology (2003)


Computerised Advice on Drug Dosage Decisions in Childhood Leukaemia: A Method and a Safety Strategy

Chris Hurt1, John Fox1, Jonathan Bury2, and Vaskar Saha3

1 Advanced Computation Laboratory, Cancer Research UK 44 Lincoln’s Inn Fields, London WC2A 3PX, UK {chris.hurt,john.fox}@cancer.org.uk

2 Academic Unit of Pathology, University of Sheffield Medical School Sheffield S10 2UL, UK

[email protected] 3 Children’s Cancer Group, Cancer Research UK Dept of Paediatric Haematology and Oncology

Barts and the London School of Medicine and Dentistry Queen Mary University of London Stepney Way, London E1 1BB, UK [email protected]

Abstract. Currently over 95% of children who are diagnosed with Acute Lym-phoblastic Leukaemia in the UK are enrolled into Medical Research Council trials. The trial protocol specifies that following initial treatment there is a 2-3 year maintenance period during which drug dosage decisions are made weekly according to a set of pre-defined rules. These rules are complex, and there is a significant frequency of error in clinical practice, which can lead to patient harm. We have built a web-based decision support system (called LISA) to ad-dress this problem. The dose alteration rules from the MRC protocol were for-malised in the PROforma guideline modeling language as a state transition problem, and dose adjustment recommendations are provided into the clinical setting by a PROforma enactment engine. The design and implementation of the decision support module, the safety issues raised and the strategy adopted for resolving them are discussed. System safety is very likely to become a ma-jor professional challenge for the medical AI community and it can be ad-dressed, in this case, with relatively straightforward techniques.

1 Introduction

Acute Lymphoblastic Leukaemia (ALL) is the commonest paediatric malignancy. In the UK, over 95% of the 320 children diagnosed with ALL each year are enrolled into Medical Research Council trials, and their treatment is defined by the research proto-col for the trial [1]. Treatment of the disease can be viewed as having three phases – the Induction of clinical remission, the Consolidation of this remission, and subse-quent Maintenance Therapy. This paper is concerned with the management of the last phase.

The mainstay of treatment during this period is the regular administration of two oral chemotherapy agents, 6-mercaptopurine (MP) which is given daily, and meth-otrexate (MTX) which is given weekly. There is great individual variation in response

Computerised Advice on Drug Dosage Decisions in Childhood Leukaemia 159

to these drugs, and dosages have to be continually adjusted to avoid inducing episodes of severe marrow suppression. The rules for dosage adjustments defined in the proto-col are moderately complex and their application requires knowledge not only of a child’s most recent blood count but also of blood counts and chemotherapy dosages during the preceding twelve weeks. This poses particular challenges as care is typi-cally organized according to a ‘Hub and Spoke Model’ in which a regional Treatment Centre collaborates with a network of local ‘shared care’ units. One recent single institution study [2] has found that at least 7.4% of dosage decisions made by clini-cians are inconsistent with the protocol. The true figure is probably much higher be-cause many decisions are based on incomplete data – a feature of shared care with paper records. It is important to get the dosage adjustment correct because both under-treatment and overtreatment have potentially fatal consequences.

The LISA system has been designed to address the decision making and informa-tion sharing problems summarised above. LISA’s main components are a centralised Oracle database that holds all patient information (drug schedules, blood and toxicity results, dosages prescribed etc.) and a web-based decision support application that provides advice about dosage adjustment. The complete system is described else-where [2]; the present paper focuses on the decision support module.

2 Modelling the Decision in PROforma

The 97/01 trial protocol [1] is, like many research protocols, subject to modification by trial managers as new evidence of patient response and other clinical effects emerge. Such protocol changes are likely to require modifications to treatment re-gimes, in this case to the drug dosage rules. The formal knowledge, which describes the dosage rules in LISA, is held in a decision-support module that is loosely coupled to the other software components. The clean separation between the decision compo-nent and other parts of the system has enabled us to incorporate changes to the dosage rules during the lifetime of the project without needing to re-engineer other compo-nents of the LISA system.

The dosage adjustment rules are defined in Appendix B of the trial protocol [1]. We have modeled these rules in PROforma [3], a declarative, process modeling, logic language for specifying decisions, plans and other tasks that are natural components of clinical protocols, guidelines and care pathways. Application development is sup-ported by an extensive set of development tools created by Cancer Research UK (see http://www.openclinical.org/gmmintro.html). Part of this toolset is a Java engine that enacts any PROforma process model. The dosage adjustment rules have been mod-eled using three standard PROforma tasks: a plan containing a data enquiry followed by a dosage adjustment decision. PROforma supports a hybrid qualitative and quanti-tative decision procedure based on “argumentation” [4].

Study of the protocol rules shows that at any time during maintenance a patient will be in exactly one of eight “states”, each of which represents a drug dosage com-bination of MP and MTX. The single PROforma decision controls a state transition, moving the patient from the existing state to one of eight possible new states (decision candidates): 0% MP and 0% MTX (“omit oral chemotherapy”), 50% MP and 50% MTX, 75% MP and 75% MTX, 100% MP and 100% MTX, 125% MP and 100% MTX, 125% MP and 125% MTX, 150% MP and 125% MTX and 150% MP and

160 Chris Hurt et al.

150% MTX. Since it must be possible to let clinicians use their own judgement and prescribe a non-protocol combination, a 9th option is also included: Other (“non proto-col combination of %s”). Further scrutiny of the protocol rules shows that the deci-sion is to be based on five inputs: current state, current platelet and absolute neutro-phil count of the blood result on which the decision is being based, number of weeks that the patient has been at the current state and number of weeks that the patient has tolerated treatment.

Each candidate is associated with a number of arguments. Each argument has a logical condition based on the input data. Once the five patient data have been ac-quired the PROforma decision engine evaluates the logical condition associated with each argument and then “aggregates” those arguments that are logically true in order to arrive at an overall measure of support for each candidate. If the measure is above a specified value then the candidate is “recommended”. Each candidate is allocated a priority level so that if more than one is recommended, the one with the highest prior-ity is given precedence. A caption can be associated with each argument so that the reasons for a recommendation can be expressed in English by the enactment engine.

3 Safety Strategy

The PROforma task enactment engine has a substantial code base - several thousands of lines of Java. The potential for software faults is considerable even for programs considerably smaller than this and it is well known that even the most rigorous testing procedures cannot guarantee that faults are eradicated and will not be manifested in clinical use. For these reasons we decided that we must adopt an explicit strategy for managing safety. First we must ensure that software faults are limited by good design and a rigorous testing strategy, but secondly, we also need to ensure that the conse-quences of any residual faults are explicitly managed.

Fox [5] and Fox and Das [6] discuss a number of approaches to explicit manage-ment of hazards in clinical decision support and expert systems. Some of these tech-niques are concerned with “soft” aspects of design, such as systematic analysis of the possible scenarios in which faults might occur and the potential consequences, and ensuring that the system is open to scrutiny by the clinical user in order to identify situations in which system-generated advice may be inappropriate. To address these issues we carried out an informal Hazards and Operability Analysis (HAZOP) [7] to shed light on the possible dangers that can arise in this domain.

Our conclusion from this “soft” activity was that the principal source of hazard is simply in giving incorrect advice. This could be caused by erroneous inputs to the decision support module or erroneous code in the decision support module. We then applied “hard” techniques of safety management. To minimise the chance of errone-ous inputs we used standard software engineering methods of data validation. To minimise the impact of possible errors in the code base we adopted a method called N-version programming in the software safety community [8]. The idea behind N-version programming is that more than one version of the system is included that has the same function but is designed according to different principles (and preferably implemented by different programmers). The hope is that the N systems will have N different failure modes. In LISA, a second, “redundant” decision support component is included in the system that is intended to have the same functionality as the PRO-

Computerised Advice on Drug Dosage Decisions in Childhood Leukaemia 161

forma decision process, but is implemented in a completely different way - a special-purpose, much smaller, conventional, if…then…else Java method. The two solutions are run in parallel and their results compared before advice is given (figure 1).

Fig. 1. Schematic structure of LISA decision support module

This whole 2-version decision support module has also undergone a comprehen-sive test procedure to establish reliability. We have used another standard technique from software engineering, the method of equivalence partitioning [9]. In this method we split the inputs into a number of classes for which we expect the program to be-have in the same way for all members of each class. We fed values from the middle of these classes, and their boundaries into the decision support module in figure 2 to look for disagreements between the two solutions. However, equivalence class testing is not exhaustive and Java can behave non-deterministically (e.g. in searching HashTables) so we have incorporated both solutions into our application, with the hard-coded method “monitoring” the result of the decision in the PROforma solution. For user testing, we constructed a matrix of all valid dosage state transitions – this was used to build user test scripts to ensure that they were all verified.

4 Discussion

The famous report To err is human from the US Institute of Medicine [10] high-lighted the incidence of medical error and associated mortality and morbidity in rou-tine clinical practice, and it is now widely held that computer systems will in the fu-ture play an important role in improving healthcare and reducing error (notably IOM, 2001 [11]). However, if a computer system is to be used to guide decisions about treatment, such as the use of cytotoxics in the LISA system, this entails a duty of care on those who design and implement such safety-critical software.

DECISION SUPPORT MODULE

INPUT Current Plts Current ANC Current State Number of weeks at current state Number of weeks tolerated

PROforma solution (rules and engine) Hard-coded guardian (if-else Java method)

Is recommendation the same?

OUTPUT Single recommendation and reasons

OUTPUT Decision support fails with warning

YES NO

162 Chris Hurt et al.

It has been argued ([6], [12]) that given the current state of the art in software en-gineering it is effectively impossible to guarantee that software will be fault-free ex-cept in the simplest applications. Furthermore, in complex environments, such as clinical environments, where much can happen that software designers will be unable to anticipate, the specification itself may not cover all possible hazardous circum-stances (see http://www.openclinical.org/qualitysafetyethics.html for a detailed discussion of this subject).

Like many practical clinical systems, the domain covered by LISA’s decision sup-port module is quite simple by comparison with some AI technologies, but the project has provided a valuable opportunity to investigate research issues concerning patient safety. LISA is currently being clinically evaluated and is on course to enter routine clinical practice in about 22 hospitals during 2003. It will incorporate the decision support module discussed here.

References

1. 1.Medical Research Council. UK Acute Lymphoblastic Leukaemia Trial ALL97. http://www.icnet.uk/trials/children/mrcall97_mar00.pdf (2000)

2. Bury, J., Hurt, C. et al. LISA: A Clinical Information and Decision Support System for Collaborative Care in Childhood Acute Lymphoblastic Leukaemia. Proceedings of AMIA Symposium, Texas: San Antonio (2002)

3. Sutton, D R, Fox, J. The Syntax and Semantics of PROforma. In press: JAMIA, 2003 4. Krause, P, Clark, D. Representing uncertain knowledge: An Artificial Intelligence Ap-

proach. Intellect books (1993) 5. Fox, J.On the soundness and safety of Expert Systems. Artificial Intelligence in Medicine,

5: 159-179 (1993) 6. Fox, J, Das, S. Safe & Sound. MIT Press (2000) 7. Redmill, F, Chudleigh, M, Catmur, J. System Safety: HAZOP and Software HAZOP. John

Wiley & Sons (1999) 8. Aviziensis, A, Chen, L. On the implementation of N-version programming for software

fault tolerance during execution. Proceedings of IEEE COMPSAC 77, pages 149-155 (1977)

9. Bezier, B. Software Testing Techniques 2nd edn. New York: Van Nostrand Rheinhold (1990)

10. Kohn, L T, Corrigan, J M, Donaldson, M S (Editors). To Err Is Human: Building a Safer Health System. Committee on Quality of Health Care in America, Institute of Medicine ISBN 0309068371 (2000)

11. Crossing the Quality Chasm: A New Health System for the 21st Century. R by the US Insti-tute of Medicine of the National Academies, 1 March (2001)

12. Leveson, N. Safeware: Systems Safety and Computers. Reading, Mass. Addison-Wesley (1995)


The NewGuide Project: Guidelines, Information Sharing and Learning from Exceptions

Paolo Ciccarese1, Ezio Caffi2, Lorenzo Boiocchi1, Assaf Halevy3, Silvana Quaglini1, Anand Kumar1, and Mario Stefanelli1

1 Dipartimento di Informatica e Sistemistica, University of Pavia, Italy [email protected]

2 Consorzio di Bioingegneria e Informatica medica, Pavia, Italy 3 Ness-ISI Ltd., Beer Sheva, Israel

Abstract. Among the well agreed-on benefits of a guideline computerisation, with respect to the traditional text format, there are the disambiguation, the pos-sibility of looking at the guideline at different levels of detail and the possibility of generating patient-tailored suggestions. Nevertheless, the connection of guidelines with patient records is still a challenging problem, as well as their ef-fective integration into the clinical workflow. In this paper, we describe the evolution of our environment for representing and running guidelines. The main new features concern the choice of a commercial product as the middle layer with the electronic patient record, the consequent possibility of gathering in-formation from different legacy systems, and the extension of this "virtual medical record" to the storage of process data. This last feature allows manag-ing exceptions, i.e. decisions that do not comply with guidelines.

1 Introduction

In the past years, we developed a tool for clinical practice guidelines (GLs) imple-mentation [1]. Other research groups had put efforts in this field, and recently a com-prehensive paper comparing different tools has been published [2]. Compared to the past research in this area, our focus is now shifting towards a different representation of data, information, and knowledge, according to their source and degree of general-ity/specificity. The patient goes through many healthcare organisations, and may be enrolled in more than one GL. Each organisation has its own legacy system, and the granularity of information is, in general, different from the one required by the GL. Moreover, while it is quite normal to store patient’s clinical data (more or less de-tailed), the process data are rarely registered (workflow technology is not widely adopted in the healthcare setting). We argue that implementing GL without accessing process data allows only a partial exploitation of the GLs potential for care delivery improvement. As a matter of fact, to evaluate the physicians’ behavior, and in particu-lar their compliance to the GLs, it is necessary to know when, why and how a certain task has been performed: the rough datum in the Electronic Patient Record (EPR), representing the result of the task, often is not sufficient. It is clear that the two con-texts, medical and organisational, are to be taken into account. The NewGuide Project allows the integration of these two fields, while maintaining their own specificity: it

164 Paolo Ciccarese et al.

puts together the experience gained on Medical Knowledge Formalization [1, 2] and Workflow Management Systems (WfMS) [3]. The scenario that we devise is the fol-lowing: the NewGuide inference engine suggests the action to be performed; if the physician accepts the suggestion, the action control is transferred to a WfMS; the latter, according to the action type, facilitates its execution, by looking for operators with the appropriate roles, by advising them through the most suitable communication systems, etc. It also stores the action execution details. The best performance is achieved when the WfMS is integrated with the EPR, because as soon as the WfMS "finds" the operator that will perform the action, it will provide him, when necessary, with the correct electronic form to be filled.

This paper focuses on the medical knowledge, but with particular attention to the influence that process data may have on a GL inference engine. Section 2 briefly illustrates the graphical formalism, Section 3 shows the solution adopted for the VMR, and Section 4 is about the VMR extension to process data.

2 The NewGuide Representation Formalism

Concerning the GL representation, our approach is still flow-chart like, with a strong connection to Petri Nets, that are a good theoretical basis for process management [4]. Given the health care profiles complexity we use a multi-level representation where a sublevel is the detail of a concept expressed in the higher level. The health care proc-ess is therefore composed by a sequence of blocks, on different levels, each of them addressing a medical task or a flow management function. For the specification of both the rules associated to arcs and the criteria for defining abstractions, we have

implemented an object-oriented language that can also manage qualitative and temporal ab-stractions [6].

As mentioned above, any schema built with New-Guide can be translated into a Petri Net (see the analogies in Fig.1). Maruster et al [7] showed that Petri Nets allow performing a “process mining”, thus dis-covering a workflow model starting from the workflow logs. Having a Petri Net-compliant GL representation, allows com-paring the learned process with the theoretical one.

Fig. 1. A page of the stroke GL: a diagnostic strategywhere choice is among different image-based examina-tions. Analogies with Petri Nets are shown

The NewGuide Project: Guidelines, Information Sharing and Learning from Exceptions 165

3 Computerised Guidelines and Legacy Systems

On top of the above illustrated formalism, we built an inference engine. We must consider that, usually, patients are treated in different settings, with different informa-tion systems. In general these are not shared between different health care profession-als and it is difficult to retrieve information of different kind and nature. Moreover the same organisation may implement different GLs. Thus, the simple creation of a GL-oriented middle layer between the GL engine and the EPR is not sufficient. A more general level is needed. The International Medical Informatics community is tackling this problem since many years. The idea of a middle layer, called “virtual patient record” (VPR), or “virtual medical record” (VMR) has also been discussed by differ-ent authors [8]. The Decision Support Technical Committee of HL7 is actually work-ing at its VMR specification. We are also working on a similar concept, and we be-lieve that this middle layer must carry not only the clinical patient data representation but also all the information related to the decisional processes along the health care pathway.

The Medical Case Study- We considered the stroke patients admitted to the Stroke Unit of the "IRCCS C. Mondino" Hospital (from here on, SU). Two of the major risk factors for stroke are hypertension and hypercholesterolemia. Thus, it is not unusual that patients admitted to the SU were already enrolled in the outpatient departments devoted to those chronic pathological conditions. These two departments belong to another hospital, the "IRCCS Policlinico San Matteo" (from here on SM), with a different information system. SU also implements a GL for the management of the acute/subacute stroke phases, while SM implements a GL for hypertension treate-ment. Since stroke is a very severe condition, and patients (often unable to provide information) need to be treated in a very short time, it is essential to retrieve all the possible information from whatever data source. Thus, in the acute stroke phase, SU physicians will benefit from receiving the patient's data from the SM database. More-over, in the post-acute phase, SM physicians will benefit from receiving the stroke history from the SU, in order to assess the best strategy for secondary prevention.

The Adopted Solution: dbMotion1� - Since we do not want to modify the legacy systems, we need to collect the information in different formats and than re-organize it for a homogeneous management at the GL engine level. This target can be reached with a database connectivity product. We chose the dbMotion platform, a commercial technology for planning, establishing, operating and managing an Internet-based Vir-tual medical information Community. The dbMotion platform enables on-line collec-tion of medical data components from decentralized databases and transferal of the data to authorized users according to parameters pre-defined by user profiles. Most of its infrastructure is transparent to the end-user, who retrieves the information on his workstation by means of an online viewer or, as in our case, of a data server. In fact, the system administrators are the ones who configure dbMotion according to the arising needs and requirements of the organization. To our purposes, dbMotion is the way to map all the different data sources into a unique structure. It is based on a VMR, called the International Clinical Information Schema (ICIS). It is object-oriented and HL7 compliant[5].

1 DbMotion is a product of Ness-ISI Ltd. Beer Sheva, Israel (www.dbMotion.com)

166 Paolo Ciccarese et al.

NewGuide and dbMotion VMR- The ICIS provides the objects for the repre-sentation of patients, healthcare struc-tures, observations, diagnoses, drug therapies. These objects are useful not only for defining the information entity but also for sharing it with the external world, through the use of well-accepted medical terminology systems. In fact the tool requires a preliminary mapping of the observations into LOINC terminology [9] and ICD9-CM disease classification, and then it guarantees a homogeneous management of the different EPR data.

For example, Figure 2 shows the object representing the body temperature value, as it is returned by dbMotion.

4 VMR Extensions to Process Data

To trace and support the effective usage of a GL we need some extensions to the VMR. They will allow storing information about “task substitution”, “task abortion” and so on. Indeed these data objects are predisposed to contain a motivation provided by the user. These motivations concern four levels of compliance/non compliance:

Compliance (Level 0): the GL flow is completely preserved as well as the original intention or mean-ing; Weak Non Compliance (Level 1): identifies a substi-tution of an action with an-other one that is similar in terms of medical goals or finding (for example substi-tution of an RMI with a CT or with an RX); the flow

suggested by the GL is preserved; Definite Non Compliance (Level 2): identifies a suspension, abortion or delay; the flow of actions suggested by the GL is modified in terms of medical intention and paths but no new actions are added; Strong Non Compliance (Level 3): identifies the rejection of a GL suggestion or the insertion of new actions. This is both a flow and intention modification. In order to better under-stand why the “non compliance management” is necessary for the GL inference en-gine to run appropriately, we consider a simple flow, shown in Figure 3. We can have different exceptions: Level 1- the Observation as requested by the GL is not available, and the physician substitutes the Observation with a similar one that can provide the same information but, for example, with a different confidence (i.e.: RMI substituted by a CT scan). The process will continue but, at the rule-based decision block, the

Fig.3. A GL flow that can be transformed as a conse-quence of a non compliance (see text)

<ObservationObject PatientId=”34533332” PatientSystemId=”00534532” ProcedureId=”C4589385” ObservationCode=”8309-7” ObservationType=”C4844893” ObservationClass= “BDYTMP.ATOM” ObsName=”BODY TEMPERATURE” ObservationDate=”12/02/2003” ObservationUnits=”°C” ObservationValue=”36” >

Fig.2. The dbMotion object “Temperature”

The NewGuide Project: Guidelines, Information Sharing and Learning from Exceptions 167

system will advice the user that the confidence of the Value is less than the one ex-pected by the GL; Level 2 or 3- the Observation cannot be done (i.e.: lack of re-sources) or it is rejected (i.e.: the physician does not agree with the GL), the system will continue through the flow until the rule-based decision; at this point it will con-vert automatically the decision block into a “non rule-based” decision in which it is up to the user to choose the next task, if any: alternative options are to go directly to the end of the decisional block, or to leave the GL.

5 Conclusion

NewGuide is intended to support the physician in the whole patient management. It is particularly difficult to find a clear cut between "pure medical actions" and "organisa-tional actions", i.e. those actions that could be managed by a workflow management system. We are trying to assess such a distinction, implementing a communication layer able to propagate the effect of the exceptions that can arise during the GL-based process.

Acknowledgments

This work has been partially funded by the FIRB project, Italian Ministry of Univer-sity and Research. We thank Prof. E. Marchesi, Dr. G. Micieli and their staff for the contribution on the medical side.

References

1. Quaglini S, Stefanelli M, Cavallini A, Micieli G, Fassino C, Mossa C. Guideline-based Careflow Systems. Artificial Intelligence in Medicine 2000; 20(1) 5-22

2. Peleg M, Tu S, Ciccarese P, Kumar A, Quaglini S, Stefanelli M et al. Comparing models of decision and action for guideline-based decision support: a case-study approach. JAMIA 2003; vol. 1 ,n.10, 52-68

3. Panzarasa S, Maddè S, Quaglini S, Pistarini C, Stefanelli M. Evidence-based careflow man-agement systems. Journal of Biomedical Informatics 2002; vol. 35, 123-139

4. van der Aalst W, van Hee K. Workflow Management. Models, Methods, and Systems. The MIT Press 2002

5. Jenders RA, Sujansky W, Broverman CA, Chadwick M. Towards improved knowledge sharing: assessment of the HL7 Reference Information Model to support medical logic module queries. Proc AMIA Annu Fall Symp 1997, 308-12

6. Bellazzi R, Larizza C, Lanzola G. An http-based server for temporal abstractions. IDAMAP '99 working notes ,pag. 52 - 62

7. Maruster L, van der Aalst W, Weijters A, van den Bosch A, Daelemans W. Automated Discovery of Workflow Models from Hospital Data. Proceedings of ECAI 2002; 32-36

8. Johnson P, Tu S, Musen MA, Purves I. A Virtual Medical record for Guideline-Based Deci-sion Support. AMIA Annual Symposium, Washington, DC, . 2001

9. LOINC Users Guide release 1.0N, 02/04/ 2000.


Managing Theoretical Single-Disease Guideline Recommendations for Actual Multiple-Disease Patients

Gersende Georg, Brigitte Séroussi, and Jacques Bouaud

Mission Recherche en Sciences et Technologies de l'Information Médicale, STIM, DPA / DSI / AP-HP, Paris, France

{gge,bs,jb}@biomath.jussieu.fr

Abstract. Situations managed by clinical practice guidelines (CPGs) usually correspond to general descriptions of theoretical patients that suffer from only one disease in addition to the specific pathology CPGs focus on. When building knowledge bases, the lack of decision support for complex multiple-disease pa-tients is usually transferred to computer-based systems. Starting from a GEM-encoded instance of CPGs, we developed a module that automatically generated IF-THEN-WITH decision rules. A two-stage unification process has been im-plemented. All the rules whose IF-part was in partial matching with a patient clinical profile were triggered. A synthesis of triggered rules has then been per-formed to eliminate redundancies and incoherence. All remaining, eventually competitive, recommendations are finally displayed to physicians leaving them the control and the responsibility of handling the controversy and thus the op-portunity to make informed decisions.

1 Introduction

Clinical practice guidelines (CPGs) are originally textual documents. Usually struc-tured as a set of clinical situations, they provide, for each case, evidence-based thera-peutic recommendations. However, these clinical situations usually correspond to patients that suffer from only one disease in addition to the specific pathology CPGs focus on. For instance, in the case of Canadian guidelines on the management of hypertension (HT) [1], recommended drug therapies are provided for patients with HT and diabetes, with HT and ischemic heart disease, with HT and systolic dysfunc-tion, etc. However there is no explicit therapeutic decision support for patients suffer-ing from HT and diabetes and ischemic heart disease and systolic dysfunction.

This is not a difficulty for the clinician that looks for the best treatment for this kind of complex polypathological patient while reading textual guidelines. He can indeed interpret the guidelines and either organize eventually contradictory evidence-based recommendations to choose the best suitable therapy for his/her patient, or combine different recommendations and propose the corresponding combination of drugs. However, when being provided with CPGs through the use of decision support systems (DSSs), because incompleteness and ambiguities of original guideline docu-

Managing Theoretical Single-Disease Guideline Recommendations 169

ments are usually transferred to DSSs’knowledge bases during the formalization step, he may be not satisfied by the recommendations displayed for these same complex clinical cases, in which patients suffer from numerous diseases.

Starting with the textual document of the Canadian recommendations for the man-agement of hypertension [1], we used GEM [2] to structure and organize the guide-line content. In a previous work [3], we presented an interpretative framework to disambiguate the narrative guideline and build the corresponding GEM-encoded instance. In this paper, we propose a solution to deal with the incompleteness of the set of clinical situations managed by the guidelines and provide the user physician with recommended therapeutic options for any patient.

2 Background

Deciding of the best therapeutic decision is usually formalized as a classification problem. However, guideline knowledge relies on deterministic reasoning, while clinical practice often requires reasoning with incomplete and imprecise information. The management of imprecision and uncertainty has often been modeled using fuzzy logic [4]. An element in a fuzzy set has a partial membership in it, rather than all-or-none membership as in a conventional set. The degree of membership is described by a membership function. Using fuzzy inferencing to weight qualitative values, Liu et al. [5] implemented the computerization of CPGs for lumbar puncture.

In this paper, we propose a pragmatic solution to the problem of deciding of the best therapy for any given patient suffering from hypertension. Recommended drug therapies are provided for any complex patient suffering from numerous disorders. Though not always evidence-based, these therapies are elaborated from the synthesis of multiple disease-specific but evidence-based recommendations triggered from the partial matching of patient data and rules preconditions. Following a documentary paradigm of medical decision making, our approach aims at providing physicians with the set of all potentially relevant recommendations leaving them the responsibil-ity of a contextual interpretation to synthesize the best patient-specific therapy.

3 Material

GEM is a document model based on an XML DTD [2] that organizes the heterogene-ous knowledge contained in CPGs, in a multi-level hierarchy of more than 100 dis-crete elements structured in nine major branches. The knowledge components section represents the recommendation’s logic. We only used conditional recommendations that apply under specific circumstances. They are composed of different sub-elements among which only few are actually used (decision.variable, action, recommenda-tion.strength).

We worked on the 1999 Canadian recommendations for the management of hyper-tension [1] chosen as the knowledge resource in the ASTI project [6] (a French pro-

170 Gersende Georg, Brigitte Séroussi, and Jacques Bouaud

ject that aims at designing a guideline-based DSS to improve general practitioners compliance with best therapeutic practices). This guideline document is well struc-tured in chapters that correspond to specific clinical situations. Within each chapter, an ordered sequence of therapeutic recommendations is proposed. But, the translation from the text to a formalized knowledge base is complex because of incompleteness (i.e., no recommendation for polypathological patient conditions), and ambiguities (i.e., the terms used are imprecise or not defined) of the original document.

4 Method

In order to automatically derive production rules from the GEM-encoded instance of the Canadian recommendations for the management of hypertension, we first slightly extended the original GEM DTD to standardize the process of IF- and THEN- parts generation. Then, under the syntactic constraints of the new GEM DTD, (i) we cre-ated a normalized instance of the Canadian CPGs, (ii) we developed a module able to automatically derive a rule base from the instance, (iii) we elaborated a classic for-ward chaining inference engine to exploit the rule base and (iv) we proposed an alge-bra to resolve conflicts and synthesize eventually contradictory therapeutic recom-mendations proposed by the system for any given patient.

4.1 Creation of the GEM-Encoded Instance

In Canadian CPGs, a clinical situation is described by some criteria e.g., age, risk factors and associated diseases, denoted as a set, C, of patient specific parameters {Ci}. The therapeutic history, denoted T={Tj}, is also characterized by some ele-ments (lines of therapy, levels of association, etc.). These informations have been marked-up as attribute ids of corresponding value of decision.variable elements.

A given guideline-based clinical situation [C ∧ T], is associated to a set D={Dk} of recommended drug therapies. These proposed treatments have been marked-up as attribute ids of corresponding value of action elements. The grade of each recom-mendation is labeled as the recommendation.strength according to the guideline in-formation (A, B, C, or D).

4.2 Automatic Rule Base Derivation

Because recommendations are implicitly ordered by priority, we defined an additional attribute, the “character”, to make the difference between: (i) “dominant” recom-mendations, denoted D_Rec, established for hypertensive patients suffering from a specific disease (diabetes, etc.), and that have priority upon other therapeutic options; (ii) “neutral” recommendations, denoted N_Rec, that follow the recommendations established for uncomplicated hypertension (peripheral vascular disease, etc.); (iii) “recessive” recommendations, denoted R_Rec, that follow the recommendations

Managing Theoretical Single-Disease Guideline Recommendations 171

established for concurrent diseases or risk factors (cerebrovascular disease, etc.), but with a minor impact.

We defined a second additional attribute, the “sign”, to distinguish positive rec-ommendations (sign=“+”), which advocate a given therapeutic class, from negative recommendations (sign =“-”), which advocate, on the contrary, to avoid a therapeutic class.

The construction of the rule base relies on the identification and the extraction of decision.variable, action, and recommendation.strength elements from the GEM-encoded instance. We used the parser SAX (Simple API for XML) [8]. A rule Ri is finally formalized as an IF-THEN-WITH rule such as:

Ri : IF [C ∧ T] i THEN {Dk}i WITH [strength ∧ character ∧ sign] k,i

4.3 Inference Engine General Principles

We have developed a simple inference engine implementing a forward chaining mechanism to handle the previously built rule base. A strict unification stage is first processed. When there is at least one rule Ri whose IF-part strictly matches patient parameters i.e, ∃ Ri / [C ∧ T]patient = [C ∧ T]i, then Ri is triggered leading to the recom-mendation of the set {Dk}i of drug therapies. When no rule is triggered, the set D of recommended drug therapies is empty. A relaxed unification algorithm is then proc-essed that triggers rules Ri whose IF-part includes diseases present in the set Cpatient of patient clinical parameters and considered by the guidelines as relevant to recommend specific therapies (diabetes, ischemic heart disease, etc.) i.e, ∃ Ri / Cpatient ∩ Ci ∩ {CPGs diseases} ≠ ∅.

When numerous rules have been activated, two modalities are developed to sum-marize the set of recommended drug therapies: (i) Fusion of recommendations to eliminate redundancies i.e., when two or more

rules R1 and R2 having identical character and sign, lead to two recommendations D1 and D2 of the same drug, the two recommendations are merged.

(ii) Deletion of recommendations to eliminate incoherence i.e., when two or more rules R1 and R2 having identical character, but opposite signs, lead to two recom-mendations D1 and D2 of the same drug, the two contradictory recommendations are eliminated.

Once fusion and deletion steps are performed, there may still be more than one

recommendation to be considered. The last filter to be applied is based on the charac-ter of the different recommended therapies. A simple intuitive algebra has been de-fined:

• N_Rec + R_Rec = N_Rec; • D_Rec + R_Rec = D_Rec; • D_Rec + N_Rec = D_Rec.

172 Gersende Georg, Brigitte Séroussi, and Jacques Bouaud

As a conclusion, (i) if there is at least one dominant recommendation in the set of selected recommendations, neutral and recessive recommendations are eliminated and all the remaining dominant recommendations are finally displayed allowing the user to choose how to summarize the different drugs; (ii) if there is no dominant recom-mendation, the basic recommendation is applied.

5 Conclusion

We compared the GEM-based system and the ASTI project on a sample of 10 real patient cases reduced to 8 cases as 2 patient cases were not exploited by ASTI. From the 8 analyzed cases, therapies recommended by both approaches were identical in 37% of the cases (3/8), compatible in 60% of the cases (3/5) and different in 40% of the cases (2/5). As compared to the therapeutic decision of a domain expert for the same cases, the GEM-based approach always led to more relevant recommendations.

Since CPGs do not provide totally relevant recommendations for patients suffering from numerous disorders, we developed a system able to select partially relevant recommendations for these complex cases. However, instead of elaborating arbitrary processes to synthesize the set of the remaining "dominant" recommendations, and following the documentary paradigm of medical decision making, we propose a pragmatic approach, allowing for a physician-controlled interpretation of context. The system displays the filtered set of patient-specific partially relevant recommendations, but the physician keeps the responsibility of handling eventually contradictory rec-ommendations on the basis of his own way of weighting patient parameters or strengths of evidence.

References

1. Feldman RD, Campbell N, Larochelle P, Bolli P, Burgess ED, Carruthers SG, et al. Re-commandations de 1999 pour le traitement de l'hypertension artérielle au Canada. CMAJ 1999;161(12):SF1-25.URL: http://www.cma.ca/cmaj/vol–161/issue–12/hypertension/hyper-f.htm

2. Shiffman RN, Karras BT, Agrawal A, Chen R, Marenco L, Nath S. GEM: a proposal for a more comprehensive guideline document model using XML. J Am Med Inform Assoc 2000;7(5):488-98.

3. Georg G, Séroussi B, Bouaud J. Interpretative framework of chronic disease management to guide textual guideline GEM-encoding. In: Baud R, Fieschi M, Le Beux P, Ruch P (eds). Proceedings of MIE 2003. IOS Press 2003: pp. 531-6.

4. Zadeh LA. Fuzzy sets. Information and Control 1965;8(3):338-353. 5. Liu JCS, Shiffman RN. Operationalization of clinical practice guidelines using fuzzy logic.

J Am Med Inform Assoc 1997;4(4):283-87. 6. Séroussi B, Bouaud J, Dréau H, Falcoff H, Riou C, Joubert M, Simon C, Simon G, Venot A.

ASTI: A guideline-based drug-ordering system for primary care. In: Patel VL, Rogers R, Haux R (eds). Medinfo 2001;84(1):528-32.

7. URL: http://www.megginson.com/SAX/

Informal and Formal Medical Guidelines:Bridging the Gap�

Marije Geldof1, Annette ten Teije1, Frank van Harmelen1,Mar Marcos2, and Peter Votruba3

1 Vrije Universiteit Amsterdam, Dept. of Artificial Intelligence{annette,frank.van.harmelen}@cs.vu.nl

2 Universitat Jaume I, Dept. of Computer [email protected]

3 Institute of Software Technology and Interactive Systems

Abstract. The role of medical guidelines is becoming more and more impor-tant in the medical field. Within the Protocure project it has been shown that thequality of medical guidelines can be improved by formalisation. However formal-isation turns out to be a very time-consuming task, resulting in a formal guidelinethat is much more complex than the original version and the relation with thisoriginal guideline is often unclear. This paper presents a case study where therelation between the informal medical guideline and its formal counterpart is in-vestigated. This has been used to determine the gaps between the formal andinformal guidelines and the cause of the size explosion of the formal guidelines.

1 Introduction

Medical practice guidelines are “systematically developed statements to assist practi-tioners and patient decisions about appropriate health care for specific circumstances”[2]. They contain more or less precise recommendations about the medical tests or in-terventions to perform, or about other aspects of clinical practice. These guidelines areused by a wide variety of medical professionals: medical specialists, family doctors,hospital nurses.

The interest in medical guidelines has resulted in the development of a number ofspecial purpose knowledge representation languages intended for modelling guidelines[3, 7, 9]. They provide the opportunity to formalise informal guidelines into more for-mal objects. However formalisation of a guideline turns out to be a very time-consumingtask, resulting in a formal guideline that is much more complex than the original ver-sion and even more importantly the relation between the informal (original) and formalguideline is not always clear: which parts of the informal guideline correspond to whichparts of the formal model, are all parts of the informal guideline covered in the formalmodel, etc.

� This work has been partially supported by the European Commission’s IST program, undercontract number IST-2001-33049-Protocure. Part of this work (e.g., GMT) was done withinthe Asgaard Project, which is supported by “Fonds zur Forderung der wissenschaftlichenForschung” (Austrian Science Fund), grant P12797-INF.


174 Marije Geldof et al.

This paper presents the results of an analysis (more fully reported in [5])of twoinformal medical guidelines and their formalised counterparts. For this analysis therelation between the informal guideline and formal guideline was made explicit. Thefocus of the analysis was among others: (1) whether everything in the original guidelinethat should have been modelled has in practice been modelled; (2) whether elements inthe formal guideline are explicitly stated, implicitly stated or completely missing inthe original guideline; (3) why formal guidelines are so much bigger in size than theirinformal counterparts.

The contribution of this analysis is the categorisation of the gaps between informaland formal versions of the guidelines, a clarification of the size explosion and last butnot least the explicit representation of the relation between two selected informal guide-lines and their formal counterparts. The latter among others resulted in the visualisationof anomalies already found during the formalisation.

The structure of the paper is as follows. Section 2 describes our case study. Section3 discusses the gaps between the informal and formal representations of guidelines andour observations in the process of making the relation between informal and formalmodels explicit. Section 4 indicates the causes of increased complexity in the formalmodels. Finally section 5 concludes and discusses some open issues and future work.

2 The Case Study

This study has been carried out within the Protocure project (www.protocure.org), aEuropean project which aims to evaluate the use of formal methods for quality im-provement of medical guidelines. The guidelines selected and formalised in Asbru [9]within the Protocure project have been used as a starting point. The definition of therelations between the original and formal guideline has been done with the GuidelineMarkup Tool [10].

The Selected Guidelines. The guidelines that have been used in this study are the Amer-ican Academy of Pediatrics practice guideline for the Management of Hyperbilirubine-mia in the Healthy Term Newborn [1] and the Dutch College of General Practitioners(NHG) standard for Type 2 Diabetes Mellitus [8].

Asbru: A Representation Language for Medical Guidelines. Asbru is a plan representa-tion language to represent clinical guidelines as time-oriented, skeletal plans [9]. It canbe used to express clinical guidelines as skeletal plans that can be instantiated for everypatient. [4].

In Asbru a clinical guideline consists of a name, a set of arguments, including atime annotation (representing the temporal scope of the plan) and five elementary com-ponents: preferences, intentions, conditions, effects and a plan body, which describesthe actions to be executed. The plan name is compulsory and all the other componentsare optional. Each plan may contain an arbitrary number of subplans within its planbody, which may themselves be decomposed into sub-subplans. So a plan can includeseveral potentially decomposable sequential, concurrent or cyclical plans.

Guideline Markup Tool. GMT is an editor that helps translating guidelines from freetext into Asbru [10]. One of the functionalities of this tool that has been used in thiscase study is to define links between an original guideline (in the form of a natural

Informal and Formal Medical Guidelines: Bridging the Gap 175

text with tables and diagrams), and its Asbru model. To define a link the user selects apiece of the original guideline and a related piece of the formal guideline and inserts alink, which connects the two pieces. With this functionality all the relations between anoriginal guideline and its formal model have been defined.

3 Linking: The Relation between a Formaland Informal Guideline

Within this study we obtain insight in the relation between original and formalisedguideline by defining a link for each relationship between the two versions of guidelineswith the Guideline Markup Tool. These links between the original guideline and itsformal model can serve different purposes: (i) to give insight in the relation between theoriginal guideline and its formalisation. (ii) to enable discussion with domain experts.(iii) to reveal if everything in the original guideline that should have been modelled inthe formal guidelines really has been modelled. (iv) in case the original guideline isupdated, changes can much easier be made in the formal model, since the place wherethe adjustment should be made can easily be found with the link pointing there. (v) tohelp the modeller during the formalisation process.

Types of Links. The links that have been defined within this study can roughly be dis-tinguished in two ways. A link can be characterised as explicit or implicit. Furthermorethe level (high or low) at which a link is defined can be different (see [5]). Below, wediscuss the explicit and implicit links in more detail.

Explicit links are links that show a very direct, obvious relation. For example theoriginal diabetes guideline speaks about “fat metabolism problems” and the formal di-abetes guideline uses the condition “fat-metabolism = true”. Implicit links on the otherhand are much less apparent. They do relate two parts of the original and formalisedguideline that belong together, but the relation is not completely fitting.

Several reasons for implicit links can be identified. For example domain experts mayhave clarified terms that are vague in the original guideline. This results in a detailedstatement in the formalisation, which is related to a more vague statement in the originalguideline. For example the original diabetes guidelines speaks about “older age”, whichwith advice from domain experts has been translated with “age > 60”.

Another reason for an implicit link can be the need for a certain medical parameter.To be able to use this parameter its value first needs to be obtained. Original guidelinesmostly consider it to be clear this value needs to be obtained and don’t mention itexplicitly. In the formal guideline, on the other hand an explicit statement to obtain thisvalue is needed.

Third of all, common knowledge in the original guideline can cause a differentmodel in the formalisation. For example the diabetes guideline prescribes to check theblood pressure. It can be considered as a common fact that for checking blood pres-sure, both lower and higher blood pressure need to be measured. In this case the orig-inal guideline speaks about “blood pressure”, while the formal guideline speaks about“higher blood pressure” and “lower blood pressure”, which results in two implicit links.

Finally sometimes related aspects in the original guideline are put together in a“superplan”, which is subdivided in subplans that represent these different aspects. Be-


sides explicit links to these subplans, also an implicit link from the “superplan” to thecollection of related aspects is desirable.

The distinction between explicit and implicit links shows that some of the relationsbetween the original guideline and its formal guideline are very obvious, but others aremuch more indirect for various reasons.

Analysis of Defined Links. During the formalisation process of the two guidelines dif-ferent anomalies have been identified and documented [6]. Some of these anomaliesconcerned information that was missing from the original guideline. The linking pro-cess makes these pieces of missing information visually apparent, because they are partsof the formal guideline that remain unrelated to any part of the informal guideline.

One of the most surprising results was that new anomalies were uncovered. Somehad not been identified during the formalisation process and others had even been in-troduced during the formalisation process.

Furthermore the links have visualised those parts of the original guideline that havenot been translated in the formal guideline. These links give insight in the choices ofthe modeller of the formal guideline.

Not only are there parts of the original guideline that remain unlinked, there are evenmore parts of the formal model that remain unlinked, because there is no direct relationwith the original guideline. Mostly this is caused by information that is not explicit inthe original guideline but thought necessary in the formal guideline. These unlinkedparts show that the formal guidelines are much more extensive compared to the originalguideline, in this study considered as extra complexity. The next section will deal withall the different aspects causing this extra complexity that have been identified on thebasis of the defined links.

4 Where Does the Complexity Come from?

Formal guidelines turn out to be much more extensive than their original versions. Con-sidering the two guidelines used in this study, jaundice consisted originally of 8 pagesand its formalisation in an intermediate representation form is 40 pages long. The dia-betes model is even 56 pages long while its original covered only 4 pages.

We distinguish three main causes of this additional complexity: additional infor-mation, domain specification and nearly identical plans. In this section we focus on thefirst cause additional info, which is illustrated with six concrete reasons. During the for-malisation of medical guidelines, additional information can be necessary for a properformalisation. The additional information can appear in different forms:

1. Background knowledge. First of all medical guidelines in general assume certainbackground knowledge to be common knowledge for medical practitioners.

2. Missing information about conditions. Conditions control the sequence of pro-posed actions in the guideline. Sometimes a condition is implicitly derived from theoriginal guideline or derived from additional information that has been gained fromdomain experts.

3. The intentions of plans. When actions are performed with respect to medicaldiagnosis or treatment, it is often useful to realise why this action is being performed atall. This can be expressed by defining the intentions of a plan. In most cases intentions

Informal and Formal Medical Guidelines: Bridging the Gap 177

are not explicitly stated in the original guideline, but considered to be known by themedical practitioner.

4. Missing information about the repetition of actions. In a cyclical plan the defini-tion of the time-interval on which the plan should be repeated, the so-called retry delayhas to be specified. Sometimes this retry delay has to be gained from a medical expert.

5. An important aspect of medical guidelines is how all the different steps and ac-tions within the guideline are managed. Should the specified plans and actions be per-formed in parallel, in sequence etc. ? To be able to represent these kind of control as-pects a formal representation language should provide control structures to define howthe plans of a guideline should be executed. Asbru contains many different kind of suchcontrol structures. Many times this control information is not explicitly stated in theoriginal guideline. It can be either derived from the original guideline or obtained fromdomain experts.

6. A plan can be user-performed, which means this plan is executed through someaction by the user. Mostly it is apparent which actions should be executed by the userso this is not explicitly stated in the original guideline.

We give some numbers for illustrating how the above different complexity aspectsappear in the two selected guidelins. Conditions: 14 times in Jaundice, 24 times inDiabetes; Intentions: 18 times in Jaundice, 17 times in Diabetes; Retry delays: oncein Jaundice, 5 times in Jaundice; Control structures: 28 times in Jaundice, 50 times inDiabetes; User-performed plans: 19 times in Jaundice, 20 times in Diabetes.

All the aspects of increasing complexity mentioned above appear in Asbru, but willalso show up in any other formal representation language.

5 Conclusions

We have presented an analysis of the relationship between an informal medical guide-line and its formal counterpart. It turned out different sorts of relationships could beidentified. Links can be either explicit or implicit and they can appear at high or lowlevel. Furthermore some of the anomalies that had already been found during formali-sation were nicely visualised and surprisingly also new anomalies were found.

Not all parts of both the original guidelines as well as the formal guideline couldbe related to their counterpart though. Some parts of the original guideline remainedunlinked, but even a bigger amount of the formal guideline remained unlinked. All ap-pearances of this last example indicate causes of the size explosion of formal guidelines.All these reasons of increased complexity have been categorised.

Challenges for the future can be found in developing medical guidelines hand inhand with their formal counterparts assisted by the definition of the links between them.The modeling choices are then explicit represented and formal and informal guidelinesare no longer separated objects.

References

1. AAP. American Academy of Pediatrics, Provisional Committee for Quality Improvementand Subcommittee on Hyperbilirubinemia. Practice parameter: management of hyperbiliru-binemia in the healthy term newborn. Pediatrics, 94:558–565, 1994.


2. M. Field and K. Lohr. Clinical Practice Guidelines: Directions for a New Program. Instituteof Medicine, Washington D.C., National Academy Press, 1990.

3. J. Fox, N. Johns, C. Lyons, A. Rahmanzadeh, R. Thomson, and P. Wilson. PROforma: ageneral technology for clinical decision support systems. Computer Methods and Programsin Biomedicine, 54:59–67, 1997.

4. P. Friedland and Y. Iwasaki. The concept and implementation of skeletal plans. Journal ofautomated reasoning, 1(2):161–208, 1985.

5. M. Geldof. The formalisation of medical protocols: easier said than done. Master’s thesis,Vrije Universiteit Amsterdam, 2003.

6. M. Marcos, H. Roomans, A. ten Teije, and F. van Harmelen. Improving medical protocolsthrough formalization: a case study. In Proc. of the 6th Int. Conf. on Integrated Design andProcess Technology (IDPT-02), 2002.

7. L. Ohno-Machado, J. Gennari, S. Murphy, N. Jain, S. Tu, D. Oliver, E. Pattison-Gordon,R. Greenes, E. Shortliffe, and G. Octo Barnett. Guideline Interchange Format: a model forrepresenting guidelines. J. of the American Medical Informatics Ass., 5(4):357–372, 1998.

8. G. Rutten, S. Verhoeven, R. Heine, W. de Grauw, P. Cromme, K. Reenders, E. van Bal-legooie, and T. Wiersma. NHG-Standaard Diabetes Mellitus Type 2 (eerste herziening).Huisarts en Wetenschap, 42(2):67–84, 1999. First revision.

9. Y. Shahar, S. Miksch, and P. Johnson. The asgaard project: a task-specific framework for theapplication and critiquing of time-oriented clinical guidelines. AIM, 14:29–51, 1998.

10. P. Votruba. Structured knowledge acquisition for asbru. Master’s thesis, Institute of Soft-ware Technology and Interactive Systems, Vienna University of Technology, Vienna, Aus-tria, 2003.

Rhetorical Codingof Health Promotion Dialogues

Floriana Grasso

Department of Computer ScienceUniversity of Liverpool, [email protected]

Abstract. Health promotion is a complex activity that requires bothexplanation and persuasion skills. This paper proposes a three-layeredmodel of dialogue coding, based on a rhetorical argumentation model,and a behavioural model of change. The model was applied to the anal-ysis of a corpus of 40 e-mail dialogue exchanges on healthy nutritionadvice. Examples of analysis are given.

1 Introduction

Despite the benefits of a balanced diet are nowadays commonplace, researchhas shown that the promotion of healthy nutrition has to face stereotypes andprejudices [1, 2]. It has become clear that mere informational and educationalskills cannot be enough for these scenarios, for when people are not ready toaccept advice, confrontation and argumentation are very likely to take place.

In our research, we seek to build advice giving systems that embed rhetoricalargumentation skills, with the hope of providing more effective advice. In otherpapers, we have presented an architecture for an advice giving system based onrhetorical argumentation [3], and a variation of our conversational health pro-motion model, based on WWW interactions [4]. In the present paper, we arenot concerned with computational issues, nor with the gained insights from thehealth promotion point of view, which are described in the above mentionedworks. Instead, we want to focus this paper on the health promotion dialoguesthemselves, with the aim of capturing their fundamental peculiarities. A sys-tematic analysis of dialogues, or in general human produced texts, is useful tobuild computational systems, both because it produces training material for sys-tems developers, and because it provides benchmarks for the evaluation of thesesystems’ output. We propose a three-layered coding scheme for the analysis ofpersuasive dialogues, which is grounded both on a behavioural model, and onthe classical philosophy of argument. We apply the scheme to the analysis of acorpus of e-mail dialogues, which was collected to inform the above mentionedcomputational systems. Before entering into the details of the coding scheme,we spend the next section to briefly discuss these two theoretical grounds.


180 Floriana Grasso

2 Health Promotion: Argumentative Models of Change

The Stages of Change Model [5], a widely accepted theoretical model thatexplains how people modify their behaviour, suggests that individuals progressthrough very distinct stages of change on their way to change habits. From a firstprecontemplation stage, when people see no problem with their current behavior,a contemplation stage marks the moment when people come to understand theirproblem, and start thinking about solving it, though with no immediate plans.In a following preparation stage, people plan to take an action in the immediatefuture, and have already made some small changes in this direction. The ac-tion stage identifies people who are in the process of actively making behaviorchanges, until a maintenance stage is reached, where the new behavior is contin-ued on a regular basis. In each of the stages, an advisor can use various strategiesto foster movement to the next stage. These strategies are mainly informationbased for the more advanced stages of change, but it has become clear that thefirst passage, from the precontemplation to the contemplation stage, cannot bebased only on the provision of new information [6]. It has been argued [7, 8] thatmore efficacy can be obtained, in these cases, by appealing to techniques comingfrom fields like classical argumentation.

The classical philosophy of argument is, not surprisingly, an attractive sourceof insights for researchers in dialogue processing. Many have been inspired byToulmin’s model of argument structure [9], but for a long time this model hasbeen the almost exclusive point of contact between the two fields. Recently, how-ever, a series of events has given the two communities the chance to meet [10,11, e.g.], and works from new collaborations have started to ripe. We are amongthose who believe that the philosophy of argument has great potential for ex-ploitation, and in particular we ground our work on the New Rhetoric (NR)[12], a seminal theory of argumentation. Dealing with discoursive techniques, theNR theory not only classifies what premises are appropriate in argumentation,but is especially concerned with how these premises are presented to the audi-ence, for the exposition of the argument is sometimes more important than itsvalidity. The NR is, in fact, a collection of ways, schemata, for arranging premisesand conclusions that are successfully used by people in ordinary discourse. Themain objective of a schema is to exploit associations among concepts, eitherknown or new to the audience, in order to pass the audience’s acceptance (pos-itive or negative) from one concept to another. On the basis of this theory, wehave developed a framework for rhetorical argumentation [13], which we haveapplied to the generation of health promotion dialogues [3].

2.1 A Corpus of Argumentative Dialogues

One thing is to take inspiration from theoretical works, well established as theymight be. A completely different thing is, however, to verify whether the theoret-ical insights find confirmation in experimental practices. In order to pursue thelatter objective, we conducted an experiment from which we obtained a corpusof “argumentative” dialogues, on the subject of healthy nutrition advice. The

Rhetorical Coding of Health Promotion Dialogues 181

Table 1. Extracts from the corpus of e-mail dialogues

Dialogue 1HP Do you like cooking?A Not especially. [...] Cooking feels to me like a lot of effort for something (ie. eating)

that’s over quite quickly. Also, I often feel tired at the end of a day’s work anddon’t want to spend too much time in the kitchen.

HP You do not cook just because you have to eat! Cooking can be a very relaxing andinteresting activity, better than watching TV!

A I know you’re right but that still doesn’t make it easy to do!

Dialogue 2HP Have you ever considered having some fruit for breakfast or as a snack?B I should do that, yes. I’ll have to go and buy some....HP Don’t you have to go and buy chocolate as well?B I didn’t mean it would take an extra effort to buy fruit on my trips to the

supermarket. However [...] it’s much easier to get hold of unhealthy snack foodnear work than it is to get fruit.

Dialogue 3C I do enjoy fruit, but it is not as appealing as say a biccie, and vegetables in

summer aren’t really the go, I would eat vegies in the winter time at least 4 timesa week.

HP Maybe pears and apples are not that appealing, but what about satsumas, or a cupof strawberries, or a bunch of grapes? Nice and refreshing, especially in summer!

C Yummy, I think if someone was to give me a plate of cut up fruit like below thenI would definitely eat it, it is always more appealing when it is all done for you.

corpus consists of a collection of 46 e-mail dialogues, of varying length, with anaverage of 11 messages per dialogue, with the longest dialogue consisting of 45exchanges. These were all two-party exchanges, where the investigator playedthe “health promotor” role, in a dialogue with a second party. These interlocu-tors were recruited from subscribers to a mailing list with interests in nutrition.Excerpts of some of these dialogues are shown in Table 1, where the healthpromotor’s turns are labelled HP, and those of the different interlocutors’ are la-belled A, B, and C. The HP messages were generated semi-automatically, as theexperiment was conducted as an evaluation of a computational argumentationsystem, as described in [3]. The generation was based on a preliminary study inwhich the investigators played the opposite role of “advisees” in dialogues withreal nutritionists.

3 A Three-Layered Coding Scheme

It has been recognised by many (see, f.i., [14]) that the Speech Act theory [15]hypothesis that each utterance can be associated one single goal is not satis-factory, as the same speech act can serve to many purposes. A dialogue codingscheme should therefore have a more complex view on how the two partners

182 Floriana Grasso

contribute to the dialogue, in terms of their hierarchy of goals. We propose athree-layered structure to represent dialogues, where each utterance, or dialoguemove, can be seen from three perspectives:

1. a meta-goal level, that is the ultimate goal of the dialogue for a partner, thereason why the dialogue has been initiated;

2. the rhetorical level, that is what kind of rhetorical goals/strategies a portionof the dialogue shows;

3. the move level, that is the dialogue moves that have been actually used toconvey the above goals.

We better elaborate on the three levels in what follows.

3.1 Meta-dialogue Moves

Meta-dialogue moves identify the dialogue higher order goals. Typically therewill be one high order goal per dialogue, although this is not prescribing. Inour case, these goals are characterised by specific strategies associated with theStages of Change model. We want to explicitly mark the following meta-moves:

– From the advisor agent point of view:1. exploratory moves: the portion of dialogue turns used to establish in

which stage of change the advisee is;2. action moves: the portion of the dialogue in which the advisor applies

one of the strategies to encourage the advisee to move a stage further;– From the advisee agent point of view:

3. declarative moves: the portion of the dialogue in which the advisee givesinformation on the stage he is in;

4. responsive moves: the portion of the dialogue in which the advisee ac-cepts, or shows resistance to, the state change;

Information useful to characterise the above moves, is, in our case, grounded onvarious literature on behavioural research [16, for example].

The meta-moves are designed to apply to general argumentative dialogues,not necessarily on health promotion, where the “stage of change” can be substi-tuted by any “position” the “opponent” might be in, and the strategies can varyfrom domain to domain. In a general situation, also, the dialogue is not neces-sarily asymmetric: the two participants may both have a meta-goal to achieve,and strategies to pursue it. Therefore, both participants can in turn be in anyof the two roles (“attack” and “defend”) and play each of the four meta-moves,according to the situation.

3.2 Rhetorical Moves

By rhetorical moves we intend the moves which are specifically used for theargumentation, according to specific classical argumentation techniques, and inorder to satisfy the higher level goal, or goals. For defining these moves, we


base ourselves, as we anticipated, on the New Rhetoric model of argumentation.The theory lists 30+ techniques, or schemata, of argumentation, with examplesof their application. In [13] we give a formalisation of each of these techniquesin the form of a schema. The schema not only identifies the way in which thetechniques is applied, but also defines a series of applicability constraints thatwill make the argument stronger, or effective. We define a schema as a 6-tupleRS = 〈N, C, Oc, Ac, Rc, Sc〉 where:

– N is the name of the schema,– C is the claim the schema supports,– Oc are the ontological constraints the schema is based on, that is which

relations should exist in the domain ontology among the concepts used inthe schema for the schema to be applicable;

– Ac are the acceptability constraints, that is which beliefs the party the ar-gument is addressed to has to possess for the schema to be effective;

– Rc are the relevance constraints, that is which beliefs the party the argumentis addressed to has to have “in focus”, that is which beliefs are relevant tothe argument;

– Sc are the sufficiency constraints, that is which premises should exist in orderfor the argument not to be weak, or attackable. This is perhaps the mostelusive of the constraints, and it will vary from schema to schema.

An appropriate application of the pragmatic schema in a dialogue would assumethat the arguer has either supposed, or has verified, all the constraints. Forinstance, the pragmatic argument is one of the techniques proposed in theNR which aims at evaluating an act or an event in terms of its positive ornegative consequences, either observed, foreseen or even purely hypothetical.An example of pragmatic argument in our scenario could be: since eating appleshelps slimming, and being slim is important to you, then you should eat apples.The instantiated definition of the pragmatic schema will be:

Claim: “Apples should be eaten”;Ontological constraints: “Apples are edible”; “Apples help sliming”;Acceptability constraints: “the addressee believes being slim is a good thing”;Relevance constraints: “the addressee is aware, or should be made aware,

that apples help slimming”. If this already holds, the argument can be putforward in a short form you should eat apples for you would like to slim.

Sufficiency constraints: an effort should be made to show that the actionis necessary and sufficient for the consequence to happen, for instance bysupporting statistics. Also, it might be shown that eating apples has nonegative consequences, especially from a perspective similar to the one isbeing made (e.g. if eating apples helps slimming but, say, causes skin redness,this would not be acceptable to a person who is concerned about look).

3.3 Dialogue Moves

At the lowest level, the dialogue structure comprises single dialogue moves, typ-ically extracted from a pre-established set of basic moves. There is debate as

184 Floriana Grasso

to how many and which types of moves should be needed, but we are for aparsimonious approach, in the spirit that the effort of distinguishing betweenmoves of different types is only justified by a correspondent relief in the effort ofunderstanding the intention of the move. We distinguish among four main movetypes, some of them comprising a number of subtypes:

1. Assertions: including all general assertive utterances an agent can perform,that is all the moves in which the agent makes a point. An assertion can beone of:(a) Claims, that is statements which, although might address previous point

in the conversation, are put forward with the aim of making a new point.(b) Backings, that is statements that support one’s own claims.(c) Acknowledgements, that is statements which agree/accept another

agent’s point, or “re-state” one own’s point.(d) Replies, that is statements which reply to questions previously posed.(e) Disputations, that is statements which explicitly disagree on previously

made points.(f) Withdrawals, that is statements which explicitly deny one’s own previ-

ously made points.2. Queries: including one agent’s queries to which the interlocutor is supposed

to reply with an assertion. These comprise:(a) Open Questions: these are requests for information, on items the querying

agent does not suppose previous knowledge. Note that the open questionsdo not refer to any previous move in the dialogue.

(b) Requests for Argument: these are requests for an argument in supportof the claim expressed in the content of the move. Again, this requestis made with respect to one generic claim, and does not refer to anyprevious move in the dialogue, as opposed to what happens with the:

(c) Challenges of Claim: these are requests to provide an argument support-ing a claim previously made by another agent.

3. YN queries: we have included in a class of their own closed questions, thatis questions whose only reply can be either Yes (True) or No (False).

4. YN: similarly, a separate class identifies answers to a YN-question.

4 Analysing Argumentative Dialogues

The annotation of a dialogue according to a code scheme is aimed at recon-structing the structure of the dialogue, and of the participant agents’ goals. Ourthree-layered analysis can be done on a form as shown in Table. 2. The middlepart of the form (Dialogue) lists the numbered dialogue moves, as broken downby the annotator into “basic unit”. A basic unit can be an entire speaker’s turn,or a sentence in a turn, or smaller parts of a sentence, according to the analyst’sjudgement. The moves of the two agents are annotated separately, in the leftand right sections of the table.

Assuming a bottom-up approach to the analysis, starting from the movelevel, up to the meta-level, the analyst will annotate, in the two columns headed


Move, the specific dialogue move used, according to the list in Sect. 3.3. For eachmove, the move type is indicated, as well as, if appropriate, the number of refer-enced preceding move. In a second phase, the analyst will look for occurrencesof rhetorical schemata. Instances of the speaker assuming, or specifically testing,or actively meeting the constraints of a schema, as explained in Sect. 3.2, areidentified, as well as the putting forward of the schema’s claim. The two columnsheaded Schema will be filled in at this stage, with the specific mention of theconstraint satisfaction process. Note that a schema can span over several dia-logue turns. The final, in a bottom-up approach, phase looks for manifestationsof the two agents’ higher order moves, as explained in Sect. 3.1. In our healthpromotion dialogues, the meta-moves will be labelled with the stages of changethat is currently being acted upon.

Table 2 shows an example of annotation, applied to Dialogue 1 of Table 1.For the sake of brevity, the example does not show how the application of theargumentative schema has been checked against the constraints. In Move 1, aYes-No question, Agent 1 makes an Exploratory meta-move to establish the stageof change of the other agent with respect to the habit of preparing one’s food,rather than buying pre-packed meals. Agent 2 makes an extensive Declarativemeta-move, by replying to the question (Move 2) and providing extra informationin support. Move 3 is a Backing for the dislike of cooking, which is done with aschema of argument by Ends and Means: the end (quick eating) does not justifythe effort to put in the means (cooking). Another argument is supplied, in theClaim of Move 4, which appeals to the Motives behind the behaviour: time torelax is precious and should not be wasted cooking. Agent 1 triggers the Actionmeta-goal and starts providing an argument with the Claim in Move 5, and itsBacking in Move 6. The argumentative technique used is Dissociation, in orderto change the opponent’s perspective: this breaks the link between cooking andeating, and replaces it with a link between cooking and relaxing, something thatthe opponent has just claimed to value. Agent 2 adopts a Responsive meta-goal,which shows that the stage change is not yet accepted. An argument “Fromthe Easy” is used, which is meant to value that which is easy and possible,versus that which is difficult. The Acknowledgement in Move 7 serves again tochange the perspective under consideration: the “interestingness” of the activityis recognised, but it is valued less than its unpleasantness, or difficulty to do.

5 Related Research

The use of rhetorical notions to annotate text goes back to the seminal work ofMann and Thompson, in their “Rhetorical Structure Theory” (RST) [17]. Al-though not specifically designed for dialogues, the theory has had various elabo-rations and variations to make it applicable to a great variety of texts, includingdialogues. However, the theory has strong shortcomings in the representation ofargumentation texts [18]. An example of application of RST to dialogue codingis the one in [19], where argumentative spoken dialogues are annotated. How-ever, no higher level goal are considered, and the “argument schemas” are in fact

186 Floriana Grasso

Table 2. An annotated dialogue fragment

the basic RST constructs, which, as mentioned before, do have various problemswith argumentative discourse. In [20] an annotation theory is proposed for ar-gumentative texts. The authors concentrate on research papers, of which theargumentative nature is captured of sections like the description of backgroundresearch, or the comparison with other works. They address, therefore, a mono-logical situation, rather than a dialogical one. An important dialogue codingscheme is the one described in [21], which identifies, like ours, three levels ofdialogue structure: (i) conversational moves are the basic units of the dialoguestructure. (ii) Conversational games are sequences of moves fulfilling a specificgoal; games can be embedded (for instance for clarification subdialogues), andare typically identified with the name of the first move of the game. Finally,(iii) transactions are sequences of games which satisfy a given, higher level pur-pose. The levels, however, do not account for specific rhetorical strategies, norfor how one level serves to the purposes of its predecessor. Another example ofmulti-layered analysis is the one in [22]. In this work, three classes of moves aredefined, that are meant to identify the phenomena of (i) forward direction, thatis sentences which are are said in order for something to happen, (ii) backwarddirection, that is sentences that are directed to the past of the dialogue, e.g.acknowledgements or agreements, (iii) and form and content of the utterances.The work does not however concentrate on dialogue meta-goals, nor, once again,captures the rhetorical organisation of the dialogue.


Perhaps the best known example of the use of argumentation techniquesin medical environments is due to Fox & Das [23]. The work presented in thispaper, conversely, is not concerned with argumentative reasoning, but focuses onthe analysis of natural occurring argumentative dialogues in health promotionenvironments.

6 Conclusions and Further Works

We have presented a coding scheme to annotate argumentative dialogues inthe domain of health promotion. Our approach, to the best our knowledge, isthe only one which espouse the need for precisely capture the argumentativenature of the dialogues, by appealing to a classical theory from the philosophyof argument, with the way in which the argument serves to higher level goals ofthe participants, by appealing to a well established behavioural model.

An important test for any annotation scheme is its reliability, that is the factthat it can be applied and used by people other than the developers, and thatthe same analysts will give similar analyses over time [24]. This is the singlemost important step that is still in progress: so far the analyses have been doneby the developers, although sometimes by consultations with colleagues. A trialstudy is at the moment being conducted, with a set of annotators, trained onthe coding scheme, but not necessarily familiar with either the domain or thetheories behind the scheme. The complete set of statistics from the study is notavailable at the time of writing, but informal considerations seem encouraging insupporting the assumption that indeed the three layers do capture the behaviourof argumentative dialogues.

References

1. Fries, E., Croyle, R.: Stereotypes Associated with a Low-Fat Diet and their Rel-evance to Nutrition Education. Journal of the American Dietetic Association 93(1993) 551–555

2. Sadalla, E., Burroughs, J.: Profiles in Eating: Sexy Vegetarians and Other Diet-Based Social Stereotypes. Psychology Today 15 (1981) 51–57

3. Grasso, F., Cawsey, A., Jones, R.: Dialectical Argumentation to Solve Conflicts inAdvice Giving: a case study in the promotion of healthy nutrition. InternationalJournal of Human-Computer Studies 53 (2000) 1077–1115

4. Cawsey, A., Grasso, F., Jones, R.: A Conversational Model for Health Promotionon the World Wide Web. [25] 379–388

5. Prochaska, J., Clemente, C.D.: Stages of Change in the Modification of ProblemBehavior. In Hersen, M., Eisler, R., Miller, P., eds.: Progress in Behavior Modifi-cation. Volume 28. Sycamore Publishing Company, Sycamore, IL (1992)

6. Prochaska, J.: Strong and Weak Principles for Progressing from Precontemplationto Action on the Basis of Twelve Problem Behaviors. Health Psychology 13 (1994)

7. Cawsey, A., Grasso, F.: Goals and Attitude Change in Generation: a Case Studyin Health Education. In Jokinen, K., Maybury, M., Zock, M., Zukerman, I., eds.:Proceedings of the ECAI-96 Workshop on: Gaps and Bridges: New directions inPlanning and NLG. (1996) 19–23

188 Floriana Grasso

8. Reiter, E., Robertson, R., Osman, L.: Types of Knowledge Required to PersonalizeSmoking Cessation Letters. [25] 389–399

9. Toulmin, S.: The Uses of Argument. Cambridge University Press (1958)10. Reed, C., Norman, T., eds.: Symposium on Argument and Computation: position

papers. In Reed, C., Norman, T., eds.: Symposium on Argument and Computation:position papers, http://www.csd.abdn.ac.uk/˜tnorman/sac/ (2000)

11. Carenini, G., Grasso, F., Reed, C., eds.: Proceedings of the ECAI 2002 workshop onComputational Models of Natural Argument. In Carenini, G., Grasso, F., Reed,C., eds.: Proceedings of the ECAI 2002 workshop on Computational Models ofNatural Argument. (2002)

12. Perelman, C., Olbrechts-Tyteca, L.: The New Rhetoric: a treatise on argumenta-tion. University of Notre Dame Press, Notre Dame, Indiana (1969)

13. Grasso, F.: Towards a framework for rhetorical argumentation. In Bos, J., Fos-ter, M., Matheson, C., eds.: EDILOG’02: Proceedings of the 6th workshop on thesemantics and pragmatics of dialogue, Edinburgh (2002) 53–60

14. Cohen, P., Levesque, H.: Rational Interaction as the Basis for Communication.In Cohen, P., Morgan, J., Pollack, M., eds.: Intentions in Communication. MITPress, Cambridge, MA (1990) 221–255

15. Searle, J.: Speech Acts: An essay in the philosophy of language. CambridgeUniversity Press, Cambridge (1969)

16. Barrie, K.: Motivational Counselling in Groups. In Davidson, R., Stephem, R.,MacEwan, I., eds.: Counselling Problem Drinkers. Tavistock/Routledge, London(1991)

17. Mann, W., Thompson, S.: Rhetorical Structure Theory: Toward a FunctionalTheory of Text Organization. Text 8 (1988) 243–281

18. Reed, C., Long, D.: Generating the structure of argument. In: Proceedings ofthe 17th International Conference on Computational Linguistics and 36th An-nual Meeting of the Association for Computational Linguistics (COLING-ACL’98).(1998) 1091–1097

19. Stent, A., Allen, J.: Annotating Argumentation Acts in Spoken Dialog. TechnicalReport 740, The University of Rochester, Computer Science Department (2000)(TRAINS Technical Note 00-1).

20. Teufel, S., Carletta, J., Moens, M.: An Annotation Scheme for Discourse-LevelArgumentation in Research Articles. In: Proceedings of EACL. (1999)

21. Carletta, J., Isard, A., Isard, S., Kowtko, J., Doherty Sneddon, G., Anderson, A.:The Reliability of a Dialogue Structure Coding Scheme. Computational Linguistics23 (1997) 13–31

22. Core, M., Allen, J.: Coding dialogs with the DAMSL annotation scheme. InTraum, D., ed.: AAAI Fall Symposium on Communicative Action in Humans andMachines. (1997)

23. Fox, J., Das, S.: Safe and Sound: Artificial Intelligence in Hazardous Applications.AAAI Press / The MIT Press (2000)

24. Carletta, J.: Assessing Agreement on Classification Tasks: the kappa Statistic.Computational Linguistics 22 (1996) 249–254

25. P.W.Horn, Shahar, Y., Lindberg, G., S.Andreassen, Wyatt, J., eds.: Artificial In-telligence in Medicine. Proceedings of the Joint European Conference on ArtificialIntelligence in Medicine and Medical Decision Making, AIMDM’99. Volume 1620of LNAI., Springer-Verlag (1999)

Learning Derived Words from Medical Corpora

Pierre Zweigenbaum and Natalia Grabar

Mission de recherche en Sciences et Technologies de l’Information MédicaleSTIM/DPA/DSI, Assistance Publique – Hôpitaux de Paris & ERM 202 INSERM

{pz,ngr}@biomath.jussieu.frhttp://www.biomath.jussieu.fr/˜{pz,ngr}/

Abstract. Morphological knowledge (inflection, derivation, compounds) is use-ful for medical language processing. Some is available for medical English inthe UMLS Specialist Lexicon, but not for the French language. Large corpora ofmedical texts can nowadays be obtained from the Web. We propose here a method,based on the cooccurrence of formally similar words, which takes advantage ofsuch a corpus to learn morphological knowledge for French medical words. Therelations obtained before filtering have an average precision of 75.6% after 5,000word pairs. Detailed examination of the results obtained on a sample of 376 FrenchSNOMED anatomy nouns shows that 91–94% of the proposed derived adjectivesare correct, that 36% of the nouns receive a correct adjective, and that this methodcan add 41% more derived adjectives than SNOMED already specifies. We discussthese results and propose directions for improvement.

1 Introduction

Lexical knowledge, in particular morphological knowledge, is a necessary componentof medical language processing systems. It has been used for coding assistance [1]and for automated indexing and information retrieval [2–4]. For instance, searching forasthmatic child should (also) return documents mentioning children with asthma. Mor-phologically related words can be linked through inflection (e.g., cell / cells), derivation(e.g., cell / cellular) and compounding (e.g., hepatocellular). Inflectional and deriva-tional knowledge for medical English is included in the Specialist Lexicon distributedwith the Unified Medical Language System (UMLS) [5]. Following a German medi-cal lexicon [6], a project1 has just started [7] to pool and unify lexical resources formedical French [8, 9] and complement them with new resources. Automated processescan facilitate this work by learning candidate morphologically related word pairs fromselected sources. In previous work [10, 11], we showed how such word pairs can beextracted from structured terminologies with no or little a priori linguistic knowledge.An advantage of terminologies is that they provide a high density of specialized vocab-ulary, along with numerous, explicitly marked, semantic relations: synonymous terms,hierarchically-related (is-a) concepts, cross-references between concepts. These rela-tions were instrumental in our previous method for identifying links between derived

1 Project UMLF, ACI #02C0163, French Ministry for Research, National Network for HealthTechnologies, 2002–2004.


190 Pierre Zweigenbaum and Natalia Grabar

words, inflected word forms, etc. However, the necessarily limited size of available ter-minologies bounds the vocabulary and morphological variation that they display; and thenormative character of their terms may hide actual, differing word usage and thereforeother morphological variants.

To complement terminologies, we thus decided to explore learning the same kindof morphological resources from another source: corpora. It is nowadays increasinglyeasier, mainly thanks to the Web, to collect increasingly larger text corpora. If one isable to control criteria for text selection, one can both keep within the medical domainand at the same time represent a large diversity of document types, therefore favouringvocabulary extent and variation. The issue is then to design a method that can exhibitmorphological relations between words in the collected corpus.

Many works have been conducted lately to learn morphological relations [10–16].Xu and Croft’s [13] method is particularly interesting, for it works on a corpus withno a priori terminology or rules. Corpus words are first processed by an algorithmic,aggressive stemmer [17], which reduces them to their stem (e.g., banking to bank). Twowords which are stemmed to the same reduced form and which cooccur significantlymore than chance (as tested through a measure akin to the mutual information statistics)are considered as belonging to a same morphological equivalence class.

We have no equivalent of Porter’s [17] stemmer for French, and we are specificallyinterested in derived medical words for project UMLF. We therefore propose here anadaptation of Xu and Croft’s algorithm to our problem, with a focus on identifyingderived words (Sect. 2). We implemented this algorithm [18] and applied it to a medicalcorpus, taking as a gold standard the derived adjectives provided in the French SNOMEDMicroglossary for Pathology [19] (Sect. 3). We discuss the results and propose directionsfor future work (Sect. 4).

2 Material and Methods

2.1 Building a Medical Corpus from the Web

We collected a medical corpus from the Web, building on the CISMeF directory ofFrench-language medical Web sites (www.chu-rouen.fr/cismef [20]). CISMeF’s Websites (11,000 in 2002) satisfy quality criteria and are indexed with MeSH terms, whichmakes it a first-rate tool to build a medical corpus. We collected all the Web pagescataloged under the concept Signs and Symptoms (C23) or one of its descendants. Tocope with addresses that point to ‘frames’ or that are simple tables of contents, wealso collected pages located one link further below each initial page. We converted allHTML pages into raw text, then filtered the lines written in languages other than Frenchby adapting the method and reusing the data published in [21]. We then tagged eachword in the corpus with its part-of-speech (noun, adjective, etc.) with TreeTagger [22]coupled with the French lemmatizer FLEMM [23]: the lemmatizer determines eachword’s uninflected form. The resulting corpus contains 4,627 documents and 5,204,901word tokens (180,000 unique word forms or 142,000 unique uninflected forms). Wekept its ‘content words’ (noun, adjective, verb, adverb), i.e., 2,055,419 tokens. Many ofthem are noisy; we deleted leftover unbreakable spaces, then suppressed the words stillcontaining non-alphanumeric characters other than hyphen, leaving 2,041,627 tokens.

Learning Derived Words from Medical Corpora 191

2.2 Learning Morphological Relations from a Corpus

Grabar and Zweigenbaum’s [11] base principle consists in (i) finding words whose formsare similar and which (ii) entertain semantic links. Xu and Croft’s method [13] proceedsthe same way and refines the first criterion with an existing stemmer [17]. In a structuredterminology [11], we instantiated this principle by looking for (i) words which sharethe same initial substring and which (ii) are found in terms that are linked by semanticrelations. In a corpus, semantic links will rely on the notion of thematic continuity:the topic of a discourse does not change with every sentence. This continuity generallyshows through lexical links: the words that are used to talk about a given theme are oftensemantically related (e.g., hospital, doctor, surgery), which is sometimes instanciated bybeing morphologically related (surgery, surgical). As a consequence, morphologicallyrelated words are often found within a short distance.

Xu and Croft’s method approximates this notion of thematic continuity througha sliding, N -word window. They compute the morphological similarity of two wordswith Porter’s stemmer. We reuse their method [18], replacing this stemmer with aneven more aggressive stemmer: reducing each word to its first c characters (c = 4 inour experiments). To sum up, we collect words which share the same initial substringof length greater or equal to c (e.g., muscle, musculaire) and which are ‘often’ foundwithin a same, N -word window. The latter criterion will be effected through a sta-tistical association measure, the log-likelihood ratio [24]: the ratio λ = L(H1)

L(H2)of the

probability of observing the number of cooccurrences of word w2 with word w1 in hy-pothesis H1 where the words are independent over the probability of observing theirnumber of cooccurrences in hypothesis H2 where the words are dependent (−2 log λis computed). It is computed as follows. Let c1 the number of occurrences of wordw1, c2 the number of windows where word w2 occurs, c12 the number of windowswhere words w1 and w2 cooccur, N the size of the corpus; elementary probabilitiesare estimated as: p = c12

N ; p1 = c12c1

; p2 = c2−c12N−c1

. Assuming a binomial distribu-tion (probability of a series of k successes among n draws, each with probability p):b(k, n, p) = Cn

k pk(1−p)n−k, the probability of observation according to H1 (indepen-dence) is L(H1) = b(c12, c1, p)b(c2 − c1, N − c1, p) and the probability of observationaccording to H2 (dependence) L(H2) = b(c12, c1, p1)b(c2 − c1, N − c1, p2).

Xu and Croft apply an association threshold below which cooccurring pairs arediscarded. We consider instead that this association criterion must be taken as one factoramong others for ordering potentially morphologically-related pairs. Other factors aredescribed below. Let us note that this association measure is asymmetric since it dependsdifferently on each word’s own frequency. For instance, chances are higher to observenoun canal (481 occurrences) in the neighbourhood of adjective canalaire (65 occ.) thanthe reverse. We keep the highest association score of the two directions.

2.3 Additional Criteria for Filtering Derived Words

Project UMLF specifically needs derivational knowledge. The present work focuseson noun-adjective pairs. We select them through a series of tests that embody specificproperties of derived words:


1. No ‘regressive’ derivation: derivation adds a suffix. We test this condition with thedifference in length of the two suffixes (or words) involved; to keep some flexibility,we accept a derived word if it contains up to one character less than its base word(e.g., articulation / articulaire, sacrum / sacré).

2. Discard compounds: neoclassical compounds combine morphemes, generally ofGreek or Latin origin, which are longer on average than derivational suffixes. Weconsider as suspicious (and filter out) a ‘derived word’ whose length is more than 5characters longer than its base word (e.g., bronche / bronchopneumonique).

3. Frequency of the ‘rule’: the same morphological operator generally applies to morethan one word. We test the number of different ‘bases’ (maximal initial commonsubstrings) on which the operator is observed to apply. As an approximation, we donot take into account allomorphic affixes (see, e.g., [12]). For instance, ‘rule’ -en /-inal is observed twice, and -e / -aire 72 times in our data.

Criteria on word form (1 and 2) filter out some candidate cooccurring word pairs; therule frequency criterion (3) applies afterwards: if several adjectives are proposed forthe same noun, that with the most frequently applied rule is kept. Association strengthis then taken into account to untie two pairs produced by rules of identical frequency:tendon / tendineux (tendinous, freq = 1, association = 86) rather than tendon / tendinite(tendinitis, freq = 1, association = 11). Finally, the pairs produced by a very low frequencyrule are kept only if their association strength is high enough. We set experimentally anassociation threshold of 50 for pairs produced by rules of frequency one. This discardsnombril / nombreux (ombilic / numerous) (association = 22.01), but keeps cortex /cortical (association = 173.07).

2.4 Experiments and Evaluation

We performed two evaluations. A general evaluation examines whether the cooccurringword pairs obtained from the corpus (Sect. 2.2) before filtering are correct, i.e., whetherthe two words of a pair are actually morphologically related by inflection, derivation orcompounding operations, with a common main morpheme. In this purpose, we reviewedall the word pairs produced and computed the resulting precision: the ratio of correctpairs over all pairs. Ranking the pairs by decreasing association, we plotted the cumulatedprecision of pairs against their rank. We also computed the local precision over slices of200 pairs. This yields a second plot local precision against rank.

A more focused evaluation examines whether derived adjectives (Sect. 2.3) can befound for a given sample of nouns. This sample was built with the French SNOMEDMicroglossary for Pathology [19]. We tagged all its terms with their parts-of-speechand lemmatized them [11]. A set of anatomy nouns was compiled by selecting all termsfrom the T (Topography) axis which only consist of one word, tagged as noun, possiblyfollowed by ‘, SAI’ (French abbreviation for not otherwise specified). This eliminatesnouns which do not correspond per se to body parts or organs, such as arête (du nez)(tip (of nose)). 376 nouns were collected; those that start with letter a follow: abdomen,acromion, acétabulum, adventice, adénohypophyse, aine, aisselle, amnios, amygdale, anthélix,anus, aorte, aponévrose, apophyse, appendice, arachnoïde, articulation, artère, artériole, aréole,astragale, astrocyte, atlas, avorton, axis, axone. We want to measure which proportion of


these nouns obtain a correct derived adjective (recall) and whether additional adjectives,not specified in SNOMED, can also be proposed, thus providing a method to extendthe variants currently provided in this nomenclature. SNOMED does provide, for someof its terms, adjectival equivalents (term class 05); for instance, code T-00250 has apreferred term épithélium, SAI, a synonym term cellule épithéliale, SAI and an adjectivalequivalent épithélial. We collected all terms from the T axis which only consisted of oneword, tagged as adjective. A list of 170 adjectives was obtained. When such an adjectivehad the same code as one of the above nouns (e.g., the épithélium / épithélial example),it was associated with it. When several nouns or adjectives existed for the same code, weeliminated extraneous associations; for instance, code T-55000 is expressed by nounsgorge and pharynx and adjectives pharyngien and pharyngé. Among the four possibleassociations, we only kept pharynx / pharyngé and pharynx / pharyngien (and gorge wasleft with no known derived adjective). Among the 376 nouns, 161 are initially associatedin SNOMED with an adjective; 148 associations are kept according to our stricter scheme(two nouns have two derived adjectives, hence 146 different nouns). This constitutes thegold standard for our second evaluation.

The above methods were applied to the CISMeF corpus (Sect. 2.1). The generalevaluation of morphological cooccurrents was performed with a (half-)window size of150 words (a half-window size of N words corresponds to a maximal distance of Nwords between pivot word w1, which is the center of the window, and a cooccurrentw2). The focused evaluation on anatomical nouns and adjectives was done with the pairsobtained with a 100-word window. Processing was done with Perl, Unix shell scriptsand (for convenience) a relational database (PostgreSQL).

3 Results

3.1 General Evaluation of All Cooccurrents: Precision

With a 150-word window, 48,003 associations are found. Figure 1b plots cumulatedprecision and local precision against rank, associated pairs being ordered in decreasingassociation score (log likelihood). For instance, among the first 5,000 cooccurrents found,3,778 were correct (cumulated precision = 75.6%); locally, the 200 pairs found fromrank 4,801 to 5,000 had a 71.5% precision, and the 200 preceding pairs 63.0%. The plotshows that precision decreases with rank, and that the higher ranks have very low localprecision (below 20% for the last 6,000 pairs). This confirms that the association score isuseful: it pushes the less probably correct word pairs towards the farther ranks. To checkwhether this score fares better than simple cooccurrence counts, we also sorted wordpairs by decreasing number of cooccurrences, frequency of pivot word and frequencyof cooccurrent (Fig. 1a). The difference is clear: many incorrect word pairs obtain goodranks, whereas many correct ones are badly ranked. Nevertheless, even with the loglikelihood score, a consistent number of correct pairs is still found at the highest ranks(e.g., 728 correct pairs in the last 6,000 pairs). It would be a waste to discard them all: theadditional filtering aims at finding relevant pairs among those whose association scoreis low.

Error analysis showed accent omissions (hypoglycémie / hypoglycemie), spellingerrors (travaille / travalle), segmentation errors (glued words : maladie / maladiede),


0

0.2

0.4

0.6

0.8

1

10000 20000 30000 40000

prec

isio

n

rank

Cumulated precisionLocal precision (by 200 pairs)

0

0.2

0.4

0.6

0.8

1

10000 20000 30000 40000

prec

isio

n

rank

Cumulated precisionLocal precision (by 200 pairs)

(a) Sorted by decreasing cooccurrence count. (b) Sorted by decreasing log likelihood ratio.

Fig. 1. Cumulated and local precision of cooccurring pairs, plotted against rank.

Table 1. Propositions of denominal adjectives for anatomy nouns (letter a). # cooc = number ofwindows where both words cooccur; m.c.i.s. = maximal common initial substring; suf1 = finalnoun substring; suf2 = final adjective substring; f = frequency of rule -suf1/-suf2.

Noun Adjective # cooc loglike m.c.i.s. suf1 suf2 fabdomen abdominal 101 584.21 abdom en inal 2amygdale amygdalien 8 100.24 amygdal e ien 24aorte aortique 170 1314.74 aort e ique 131apophyse apophysaire++ 3 39.66 apophys e aire 72appendice appendiculaire++ 19 225.24 appendic e ulaire 5articulation articulaire 216 1406.34 articula tion ire 13artériole artériolaire+ 15 99.99 artériol e aire 72aréole aréolaire+ 2 27.55 aréol e aire 72astrocyte astrocytaire 2 28.60 astrocyt e aire 72axone axonal+ 8 93.21 axon e al 42

prefixes (very numerous: trans, télé, hyper, hypo, iso, méso, etc.), hyphen compounds(chien / chien-guide, aldostérone / aldostérone-synthase), and words of various lan-guages (English, Spanish, German) which were not correctly filtered. Incidentally, someof these words get correctly paired (Spanish nuevo / nueva, infeccione / infectada, Englishchild / children), which illustrates the fact that the association-based method is basicallylanguage-independent. Lemmatisation errors were also found with Latin words.

3.2 Focused Evaluation of 376 SNOMED Anatomy Nouns: Recall and Additions

Table 1 shows, as an illustration, the proposed derived adjectives for the 26 anatomynouns starting with letter a: 10 receive an adjective, all of which are considered cor-rect. 5 of these are explicitly associated to these nouns in the Microglossary, 3 occurelsewhere in the Microglossary (+), and 2 do not occur there at all (++). Table 2 showsthe recall and precision of derived adjectives for all 376 anatomy nouns in our test set.150 noun-adjective associations were selected for these nouns by the criteria exposed in


Table 2. Precision and recall of denominal adjectives (anatomy).

Candidate #nouns #proposed #correct precision recallMicroglossary 376 161 161 100% 43%Selected corpus cooccurrents 376 150 137 91% 36%Union 376 222 222 94% 59%

Table 3. Relative contribution of method to SNOMED-provided derived adjectives.

CorpusOnly in SNOMED Found by corpus New from corpus Errors

72 = 49% 76 = 51% recall 61 = 41% added 13 = 91% precisionMicroglossary

Sect. 2.3. After review, 13 errors were encountered, i.e., a precision of 91% and absoluterecall of 36%. As a comparison point, the Microglossary specifies derived adjectivesfor 161 nouns, i.e., an absolute recall of 43% not dramatically higher. Interestingly,the corpus method brings 61 new derived adjectives, so that when added to the initial,SNOMED-provided noun-adjective pairs, it increases their combined recall to 59%.Among these 61 additional adjectives, 38 occur elsewhere in the Microglossary, but 23(38%) do not: apophysaire, appendiculaire, cardial, cotyloïdien, cristallinien, diaphysaire, hip-pocampique, intimal, jambier, lysosomal, macrophagique, mastocytaire, myométrial, métatarsien,néphronique, olécrânien, paramétrial, plasmatique, rhinopharyngé, réticulocytaire, tympanique,uretéral, éosinophilique. In summary, we can organize the contribution of this method withrespect to what is explicitly provided by the French SNOMED Microglossary as shownin table 3.

3.3 Analysis of Errors and Silence (Anatomy Nouns)

The 13 erroneously paired adjectives are distributed as follows. 4 are actual denominaladjectives, but not the expected relational adjective: média / médiatique (médial), sang/ sanglant (sanguin), figure / figuré (facial), embryon / embryonné (embryonnaire).Facial is built with a ‘suppletive’ base (face or facies instead of figure), which cannotbe detected by our method. If these four adjectives are counted as correct, precisionincreases to 94% and recall to 38%. 4 neoclassical compounds passed our heuristicfilter: monocyte / monocytogene, hépatocyte / hépatique, iléon / iléorectal, érythroblaste/ érythrocytaire. 5 erroneous pairs come from words incorrectly tagged as adjectives:non-words (côlon / côlonb, muscle / musclaire) or actual French words (cornée / corné,glande / gland, main / maint).

We also studied the causes of silence (table 4) for the 26 anatomy nouns startingwith letter a, 10 of which received a correct derived adjective. Some of these nounsdid not occur in the corpus (or were not tagged as nouns); in total, among the 376nouns examined, 71 (19%) could not be found in the corpus, and could not thereforereceive a derived adjective by our method. Some of the nouns, although present, had noidentifiable derived adjective in the corpus. The constraint of the four initial characters


Table 4. Causes of silence for 16 out of 26 anatomy nouns starting with letter a. When a case istaken into account in a row, it is dropped from consideration in further rows.

Diagnosis Number % ExamplesNoun not in corpus 4 25 adénohypophyse, amniosAdjective absent or unknown 5 31 acromion, aisselle, avortonCommon initial substring < 4 3 19 artère / artériel, anus / analNoun and adjective do not cooccur 3 19 aponévrose / aponévrotiqueSuppletive base 1 6 aine / inguinalTotal 16 99

eliminated word pairs which would have been strongly associated (for instance, artère/ artériel). Finally, some related nouns and adjectives were both present in the corpus,but not together within a 100-word window.

4 Discussion and Perspectives

The method proposed collects a large number of morphologically related word pairs, alarge proportion of which is correct (with a 150-word window, 75.6% for the first 5,000pairs, and still 43.9% for the total 48,000 pairs). A significant part of the errors canbe filtered by taking into account additional criteria (Sect. 2.3) for derived word pairs.These criteria were applied to a sample of SNOMED anatomy nouns and identifiedtheir derived adjectives with a precision of 91–94%. The absolute recall obtained ismodest (36%); however, a significant number (+41%) of additional adjectives were alsoproposed, which can help to extend the terminological coverage of this nomenclature.

We have seen that a number of source nouns were not found in our corpus (19%in our sample of anatomy nouns). This limitation can be overcome in several ways.First, the corpus can be extended: it is presently made of only 4,627 documents which,though carefully selected through CISMeF, only represent a fraction of the domain.Cinical documents, such as patient discharge summaries, will also help to diversifythe available registers. Second, complementary methods can be applied, linking sourcenouns and corpus adjectives by rules induced from the corpus or from a terminology(as in our previous work [10, 11]) or from both [12]. For instance, a number of targetnoun+adjective word pairs were present in our corpus, but not together within a 100-content-word window. Following [10, 11], such pairs, e.g., aponévrose (11 occurrences)/ aponévrotique (4 occurrences) might be detected by applying rule -se / -tique (observedon 35 distinct cooccurring noun-adjective pairs) to all nouns and adjectives seen in thecorpus.As shown in a number of previous studies (e.g., [25] for collecting proper nouns),combining several classifiers may help boost recall with only a moderate loss in precision.The four-characters constraint too blocks pairs of words which would have been stronglyassociated (artère / artériel). We must study the amount of noise that would be addedby going down to three characters, but also look for methods (e.g., ‘cognates’) that takeinto account non-initial, non-contiguous strings of letters to match words (e.g., poumon/ pulmonaire). One must indeed note too that some of the nouns may have no associatedrelational adjective (e.g., avorton).


Precision is high (91–94%) for the selected derived noun-adjective pairs; it is loweron average before selection (see Fig. 1). Progressive addition of knowledge to the systemis a possible direction for reducing noise. For instance, providing a blacklist of prefixes(trans, hyper, iso, etc.) would block a number of undesired connections (see Sect. 3.1).This would be in line with other morphological analysis methods based on rules andexceptions [5, 23]. Such automated methods for collecting morphological knowledge asthe present one or [10, 11] can also help bootstrap or complement manual methods suchas [8] (see, e.g., [16]), which generally yields better overall results.

Besides improving the method itself, larger tests must now be performed on otherword samples: other axes of SNOMED, nouns from other medical terminologies, moregenerally the nouns found in the processed corpora, and other types of derivation (noun-verb, verb-adjective, etc.). The same method is in principle applicable to a wide rangeof languages, provided a corpus, tagger and lemmatizer are available.

We believe the present method can constitute one of a series of components forhelping human lexicon or vocabulary editors to collect more quickly and easily a largeramount of medical language data. It will be used as such in the UMLF project.

Acknowledgements

We still thank Dr. R.A. Côté for having kindly sent us a pre-commercial copy of theFrench version of the SNOMED International Microglossary for Pathology, and theCISMeF team of Rouen University Hospital for their Catalog and Index of French-language Medical Web Sites which also constitutes a precious resource for medicallanguage processing. F. Hadouche implemented the cooccurrence processing program.

References

1. Lovis, C., Baud, R., Michel, P.A., Scherrer, J.R.: A semi-automatic ICD encoder. J Am MedInform Assoc 3 (1996) 937–937

2. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: TheMetaMap program. J Am Med Inform Assoc 8 (2001)

3. Hahn, U., Honeck, M., Piotrowski, M., Schulz, S.: Subword segmentation: Leveling outmorphological variations for medical document retrieval. J Am Med Inform Assoc 8 (2001)229–233

4. Zweigenbaum, P., Darmoni, S.J., Grabar, N.: The contribution of morphological knowledge toFrench MeSH mapping for information retrieval. J Am Med Inform Assoc 8 (2001) 796–800

5. McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation inbiomedical terminologies. In: Proc 18th Annu Symp Comput Appl Med Care, Washington,Mc Graw Hill (1994) 235–239

6. Weske-Heck, G., Zaiß, A., Zabel, M., Schulz, S., Giere, W., Schopen, M., Klar, R.: TheGerman Specialist Lexicon. J Am Med Inform Assoc 8 (2002)

7. Zweigenbaum, P., Baud, R., Burgun, A., Namer, F., Jarrousse, E., Grabar, N., Ruch, P., LeDuff, F., Thirion, B., Darmoni, S.: Towards a unified medical lexicon for French. In Baud, R.,Fieschi, M., Le Beux, P., Ruch, P., eds.: Proceedings Medical Informatics Europe,Amsterdam,IOS Press (2003) 415–420


8. Lovis, C., Michel, P.A., Baud, R., Scherrer, J.R.: Word segmentation processing: a way toexponentially extend medical dictionaries. In Greenes, R.A., Peterson, H.E., Protti, D.J., eds.:Proc 8 th World Congress on Medical Informatics. (1995) 28–32

9. Zweigenbaum, P.: Resources for the medical domain: medical terminologies, lexicons andcorpora. ELRA Newsletter 6 (2001) 8–11

10. Zweigenbaum, P., Grabar, N.: Automatic acquisition of morphological knowledge for medicallanguage processing. In Horn, W., Shahar, Y., Lindberg, G., Andreassen, S., Wyatt, J., eds.:Artificial Intelligence in Medicine. Lecture Notes in Artificial Intelligence. Springer-Verlag(1999) 416–420

11. Grabar, N., Zweigenbaum, P.: Automatic acquisition of domain-specific morphological re-sources from thesauri. In: Proceedings of RIAO 2000: Content-Based Multimedia InformationAccess, Paris, France, C.I.D. (2000) 765–784

12. Jacquemin, C.: Guessing morphology from terms and corpora. In: Proc 20th ACM SIGIR,Philadelphia, PA (1997) 156–167

13. Xu, J., Croft, B.W.: Corpus-based stemming using co-occurrence of word variants. ACMTransactions on Information Systems 16 (1998) 61–81

14. Gaussier, E.: Unsupervised learning of derivational morphology from inflectional lexicons. InKehler, A., Stolcke, A., eds.: ACL workshop on Unsupervised Methods in Natural LanguageLearning, College Park, Md. (1999)

15. Daille, B.: Identification des adjectifs relationnels en corpus. In Amsili, P., ed.: Proceedingsof TALN 1999 (Traitement automatique des langues naturelles), Cargèse, ATALA (1999)105–114

16. Hathout, N., Namer, F., Dal, G.: An experimental constructional database: the MorTALproject. In Boucher, P., ed.: Many morphologies. Cascadilla Press, Somerville, MA (2002)178–209

17. Porter, M.F.: An algorithm for suffix stripping. Program 14 (1980) 130–13718. Hadouche, F.: Acquisition de resources morphologiques à partir de corpus. DESS d’ingénierie

multilingue, Institut National des Langues et Civilisations Orientales, Paris (2002)19. Côté, R.A.: Répertoire d’anatomopathologie de la SNOMED internationale, v3.4. Université

de Sherbrooke, Sherbrooke, Québec. (1996)20. Darmoni, S.J., Leroy, J.P., Thirion, B., Baudic, F., Douyere, M., Piot, J.: CISMeF: a structured

health resource guide. Methods Inf Med 39 (2000) 30–3521. Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the

WWW. In: Proceedings of RIAO 2000: Content-Based Multimedia Information Access,Paris, France, C.I.D. (2000) 237–246

22. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of theInternational Conference on New Methods in Language Processing, Manchester, UK (1994)44–49

23. Namer, F.: FLEMM : un analyseur flexionnel du français à base de règles. TraitementAutomatique des Langues 41 (2000) 523–547

24. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MITPress, Cambridge, MA (1999)

25. Bodenreider, O., Zweigenbaum, P.: Identifying proper names in parallel medical terminolo-gies. In Hasman, A., Blobel, B., Dudeck, J., Engelbrecht, R., Gell, G., Prokosh, H.U., eds.:Medical Infobahn for Europe — Proceedings of MIE2000 and GMDS2000, Amsterdam, IOSPress (2000) 443–447

Learning-Free Text Categorization

Patrick Ruch, Robert Baud, and Antoine Geissbuhler

University Hospital of Geneva, Medical Informatics Division1205 Geneva, Switzerland

[email protected]

Abstract. In this paper, we report on the fusion of simple retrievalstrategies with thesaural resources in order to perform large-scale textcategorization tasks. Unlike most related systems, which rely on trainingdata in order to infer text-to-concept relationships, our approach can beapplied with any controlled vocabulary and does not use any trainingdata. The first classification module uses a traditional vector-space re-trieval engine, which has been fine-tuned for the task, while the secondclassifier is based on regular variations of the concept list. For evaluationpurposes, the system uses a sample of MedLine and the Medical Sub-ject Headings (MeSH) terminology as collection of concepts. Preliminaryresults show that performances of the hybrid system are significantly im-proved as compared to each single system. For top returned concepts, thesystem reaches performances comparable to machine learning systems,while genericity and scalability issues are clearly in favor of the learning-free approach. We draw conclusion on the importance of hybrids strate-gies combining data-poor classifiers and knowledge-based terminologicalresources for general text mapping tasks.

1 Introduction

Typical concept mapping applications use a set of key-words as concepts to beselected into a glossary. However, key-word assignment is only the most popularapplication of such systems, and the task can also be seen as a named-entity(NE) recognition task if we consider entities that can be listed1. Computer-basedconcept mapping technologies include:

– retrieval based on word-matching, which attributes concepts to text basedon shared words between the text and the concepts;

– empirical learning of text-concept associations from a training set of textsand their associated concepts.

1 As it is the case in molecular biology with gene, protein and tissue entities [1]


200 Patrick Ruch, Robert Baud, and Antoine Geissbuhler

Retrieval is often presented as the weakest method [2], however there areseveral areas of applications where training data are clearly missing2.

1.1 Biomedical Domain

To our knowledge the largest class set ever used by text classification systemsis about 2.104, and such systems were applied to the biomedical domain, basedon the Medical Subject Heading (MeSH) ontology. Although such a class set isalready large for typical categorization tasks, terminological resources in healthsciences, as well as documentation’s needs require tools likely to process muchlarger sets of concepts3.

1.2 Concept Mapping as a Learning-Free Classification Task

General text classification has been largely studied and has led to an impressiveamount of papers (see [4] for a recent survey of the domain). A non exhaus-tive list of machines learning approaches to text categorization includes naiveBayes[5]), k-nearest neighbors[4], SVM[6], boosting[7], and rule-learning algo-rithms[8]. However, most of these studies apply text classification to a small setof classes (usually a few hundreds, as in the paradigmatic Reuters’ collection[9]). In comparison, our system is designed to handle large class sets: retrievalsystems can be applied on a virtually infinite set of concepts and 105−6 is stilla modest range. For sake of evaluation the class set ranges from about 20,000-if only unique canonical MeSH terms are taken into account- up to 140 000 -ifsynonyms are considered in addition to their canonical class.

1.3 MeSH Mapping and MedLine

Figure 1 provides an example of a citation4 in MedLine (authors, title, institu-tion, and publication types are omitted; major MeSH terms are indicated with* ; subheadings are removed and the semi-colon is used as separator) and its2 We must note that even if we would assume that large and representative training

data will be once available for any possible domain, current machine learning sys-tems still would have to face major scalability problems. The problem is twofold: itconcerns both the ability of the system to work with large concept sets, and its abil-ity to learn and generalize regularities for rare events: Larkey and Croft [3] show howthe frequency of the concept in the collection is a major parameter for learning-basetext categorization tools.

3 Thus, the May 2002 release of the Unified Medical Language System (UMLS-2002AB) contained 871,584 different concepts and 2.1 million terms. In molecularbiology, the SWISS-PROT Release 40.28 (September 2002) has 114033 entries, andmost entries have synonyms, while the TrEMBL Release 21.12 (September 2002) has684666 entries.

4 It must be observed that MedLine’s annotation is done on the basis of the completearticle, while in our experiments only the abstract is considered.

Learning-Free Text Categorization 201

corresponding MeSH terms. Most text categorization studies working with Med-Line collections neglect two important aspects of the MedLine’s annotation withMeSH terms that will be considered in the present study:

a. availability of thesauri: the MeSH is provided with an important thesaurus(120,020 synonyms), whose impact will be assessed in our study;

b. comprehensiveness: the MeSH follows a hierarchical structure, but if we con-sider only unique strings, there are 19 632 terms; unlike related results (dis-cussed in section 4.1), our system is applied with the full MeSH.

The production of exopolysaccharides (EPSs) by a mucoid clinical isolate ofBurkholderia cepacia involved in infections in cystic fibrosis patients, was studied.Depending on the growth conditions, this strain was able to produce two differentEPS, namely PS-I and PS-II, either alone or together. PS-I is composed of equimolaramounts of glucose and galactose with pyruvate as substituent, and was produced onall media tested. PS-II is constituted of rhamnose, mannose, galactose, glucose andglucuronic acid in the ratio 1:1:3:1:1, with acetate as substituent, and was produced oneither complex or minimal media with high-salt concentrations (0.3 or 0.5 M NaCl).Although this behavior is strain-specific, and not cepacia-specific, the stimulation ofproduction of PS-II in conditions that mimic those encountered by B. cepacia in therespiratory track of cystic fibrosis patients, suggests a putative role of this EPS in apathologic context.

Burkholderia cepacia*; Carbohydrate Conformation; Carbohydrate Sequence; Compar-ative Study; Culture Media*; Cystic Fibrosis*; Glucose; Glycerol; Human; MolecularSequence Data; Onions; Phenotype; Polysaccharides, Bacterial*; Temperature

Fig. 1. Citation of MedLine with MeSH terms provided by professional indexers.

The remainder of this paper is organized as follows: the next section presentsthe collection and metrics used, as well as the basic classifiers. Then, we de-scribe and evaluate our basic classifiers, before presenting and testing how theseclassifiers can be merged. The performance of the combined mapping system iscompared to related studies. Finally, we conclude and suggest some future work.

2 Evaluation

Following [3] and as it is usual with retrieval systems, the core measure forthe evaluation is based on the 11-point average precision. We provide the totalnumber of relevant terms returned by the system on the complete run. The topprecision (interpolated Precisionat Recall=0) is also given. In order to providea minimal assesment of the system, we apply the system on the Cystic Fibrosis5

(CF) collection [10], The CF collection is a collection of 1239 MedLine citations.For each citation, we used the content of the abstract field as input in the system.Using other fields, such as the title or the publication’s source may have provided5 Available on Marti Hearst’s pages at http://www.sims.berkeley.edu/ hearst/irbook/


interesting additional evidences for classification, but we decided to work onlywith the abstract in order to minimize the number of variables to be controlled.The average number of concepts per abstract in the collection is 12.3 and thefollowing measures were done considering the top-15 terms returned (TR).

3 Method

One of the most comprehensive study of MeSH classification based on simpleword-matching has been carried at the National Library of Medicine and hasled to the development of the MetaMap tool. For developing MetaMap, dif-ferent methods and combination of methods were compared [11], including re-trieval strategies (based on INQUERY distance metrics), syntactic and statisticalphrase chunking, and MeSH coocurrences. Unfortunately the system has beenevaluated on the UMLS collection, which is not publicly available. We use theUMLS distribution of the MetaMap system with the MeSH as concept list andwith default settings in order to obtain a blackbox baseline measure for com-parison with our systems. Table 1 shows the results of MetaMap, together withthe two basic classifiers, which are going to be described in the next section. Wesee that MetaMap outperforms each classifier on the complete Cystic Fibrosiscollection.

Table 1. Results for MetaMap, RegEx, and (tf.idf) classifiers. weighting schemas. Forthe VS engine, tf.idf parameters are provided: the first triplet indicates the weightingapplied to the “document collection”, i.e. the concepts, while the second is for the“query collection”, i.e. the abstracts. The total of relevant terms is 15193.

System or Relevant Top 11pt Averageparameters retrieved precision precisionMetaMap 4075 .7425 .1790

RegEx 3986 .7128 .1601tf.idf (VS)

lnc.atn 3838 .7733 .1421anc.atn 3813 .7733 .1418ltc.atn 3788 .7198 .1341ltc.lnn 2946 .7074 .111

3.1 Basic Classifiers

Two main modules constitute the skeleton of our system: the regular expressioncomponent, and the vector space component. The former component uses tokensas indexing units and can take advantage of the thesaurus, while the latteruses stems (Porter-like). Each of the basic classifiers uses known approachesto document retrieval. The first tool is based on a regular expression patternmatcher. Although such approach is less used in modern information retrievalsystems6, it is expected to perform well when applied on very short documents6 With a notable exception, the GLIMPSE system [12].


such as key-words: MeSH terms do not contains more than 5 words. The secondclassifier is based on a SMART-like vector-space engine[13]. This second toolis expected to provide high recall in contrast with the regular expression-basedtool, which should privilege precision.

Regular Expressions and MeSH Thesaurus. The regular expression(RegEx) pattern matcher is applied on the the canonic list of MeSH terms (19936 concepts) augmented with its thesaurus (the total includes 139 956 terms).In this system, text normalization is mainly processed by removing punctuationor by the MeSH terminological resources when the thesaurus is used. Indeed,the MeSH thesaurus provides a large set of “synonyms”, which are mapped to aunique MeSH representative in the canonic collection. Instead of synonyms, thisset gathers morpho-syntactic variants (mainly for plural forms), noun phrasereformulations, strict synonyms, and a last class of related terms, which mixesup generic terms, specific terms, and some other kinds of less obvious seman-tic relations: for example. Inhibition is mapped to Inhibition (Psychology). Themanually crafted transition network of the pattern-matcher is very simple, as itallows some insertions or deletions within a MeSH term, and ranks the proposedcandidate terms based on these basic edit operations following a completionprinciple: the more terms are matched, the more the term is relevant. The sys-tem hashes the abstract into 5 token phrases and moves the window throughthe abstract. The same type of operations is allowed at the token level, so thatthe system is able to handle minor string variations, as for instance betweendiarrhea and diarrhoea. Unexpectedly, table 1 shows that the single RegEx sys-tem performs better than any single tf.idf7 (term frequency-inverse documentfrequency) system, so that surprisingly, the thesaurus-powered pattern-matcherprovides better results than the basic VS engine for MeSH mapping.

Vector Space System. The vector space (VS) module is based on a generalIR engine8 with tf.idf weighting schema. In this study, it uses stems (Porter-like, with minor modifications) as indexing terms, and a stop word list. Whilestemming can be an important parameter, whose impact is sometimes a matterof discussion [15], we did not notice any significant differences between the useof tokens and the use of stems, while the index’s size is larger (8755 vs. 5972entries) when tokens are chosen as indexing units. The graceful behavior ofstemming is probably due to the fact that tokens of the biomedical vocabularyare usually longer that in regular English, so that word conflation creates onlyfew confusing stems. However, we noticed that a significant set of semanticallyrelated stems should have been conflated in the same indexing unit: for example,

7 We use the (de facto) SMART standard representation in order to express thesedifferent parameters, cf. [14] for a detailed presentation. For each triplet provided intable 1, the first letter refers to the term frequency, the second refers to the inversedocument frequency and the third letter refers to a normalization factor.

8 Available on the first author’s homepage.


the morpheme immun is found in 48 different stems, and using a morpheme-based word conflation system could have improved the system. Finally, let usnote that MeSH terms contain 1 to 5 words, so that, we could have used phrases(as in [16] and [17]), however, we believe that part of the improvement that couldhave been brought by using phrases is probably achieved by the RegEx module.

A large part of this study was dedicated to tuning the VS engine, and tf.idfweighting parameters were systematically evaluated. The conclusion is that co-sine normalization was especially effective for our task. This is not surprising,considering the fact that cosine normalization performs well when all documentsare short as is the case of MeSH terms9. Thus, in table 1, the top-4 weightingfunction uses cosine as normalization factor. We also observed that the idf fac-tor, which was calculated on the MeSH collection performed well, it means thatthe canonical MeSH vocabulary is large enough to effectively underweight non-content words (such as disease and syndrome). Calculating the idf factor on acollection of a large collection of abstracts could have been investigated, but suchsolution may have resulted in making the system more collection-dependent.

4 Results

The hybrid system combines the regular expression classifier with the vector-space classifier. Unlike [3] we do not merge our classifiers by linear combination,because the RegEx module does not return a scoring consistent with the vectorspace system. Therefore the combination does not use the RegEx’s score, andinstead it uses the list returned by the vector space module as a reference list(RL), while the list returned by the regular expression module is used as boostinglist (BL), which serves in order to improve the ranking of terms listed in RL. Athird factor takes into account the length of terms: both the character’s length(L1) and the byte size (L2, with L2 > 3) of terms are computed, so that longand/or multi-word terms appearing in both lists are favored over short and/orsingle word terms. We assume that the reference list has exhaustive coverage,and we do not set any threshold on it. For each term t listed in the RL, thecombined Retrieval Status Value (RSV) is:

RSVHybrid ={

RSVV S(t) · Ln(L1(t) · L2(t) · k) if t ∈ BL,RSVV S(t) otherwise. (1)

Table 2 shows that the optimal tf.idf parameters lnc.atn for the basic VSclassifier does not provide the optimal combination with RegEx. The optimalcombination is obtained with ltc.lnn settings10. We also observe that the atn.ntnweighting schema maximizes the top candidate (i.e. Precisionat Recall=0) mea-sure, but for a general purpose system, we prefer to maximize average precision,since this is the only measure that summarizes the performance of the full or-dering of concepts. However, in the context of a fully automatic system (for9 As for more advanced schema, we tested the combination of RegEx with pivoted

normalization and it did not outperform the combination RegEx + ltc.lnn.10 For the augmented term frequency factor (noted a, which is defined by the function

α + β × (tf/max(tf)), the value of the parameters is α = β = 0.5.


Table 2. Combining VS with RegEx.

Weighting function Relevant Top Averageconcepts.abstracts retrieved Precision Precision

Hybrids: tf.idf (VS) + RegExltc.lnn 4308 .8884 .1818lnc.lnn 4301 .8784 .1813anc.ntn 4184 .8746 .1806anc.ntn 4184 .8669 .1795atn.ntn 3763 .9143 .1794

example for CLIR purposes), the top-ranked concepts (1 or 2) are clearly ofmajor importance, therefore we also provide this measure.

4.1 Related Results

While several works have concentrated on applying machine learning methodsto text categorization, it is often difficult to compare and synthesize the widequantity of results provided in these studies. One of the main reason is probablythat there is no strict definition of the task, which we believe must be seen asa subtask11 rather than a task in itself. Indeed, apart from the central classifi-cation problem and the common textual material, which are shared by all thesesubtasks, there are few common points between them. The gap is well exem-plified if we consider on the one side TC applied to sentence extraction, like itis usual in automatic summarization, and on the other side concept mapping:while the former work with a couple of classes (up to a dozen in [18] or [19]), thelatter uses virtually infinite sets of classes. Between the two edges, a continuousspan of text classification experiments can be identified, whose the most studied-which can also be seen as the paradigmatic ones- are centrally located fromsome hundreds up to some thousands of classes.

OHSUMED. As for classification with the MeSH and using MedLine records,the OHSUMED collection has been often used. Like the CF collection, theOHSUMED collection contains a list of MedLine abstracts, so that both col-lection are equally representative of MedLine. To the best of our knowledge onlytwo studies have used the entire set of 14,000 MeSH categories [20] [21] used inOHSUMED, an no one ever used the complete 20000-items MeSH terminologythat we used, therefore comparison is difficult. The main reason for this is thatmany TC methods cannot process such large sets. Yang [20], Lewis et al. [22], andLam and Ho [21] have published results using the subset of categories from the“Heart Diseases” sub-tree (HD-119, so-called because it uses only 119 concepts).In [23], 42 categories of the HD sub-tree were excluded, because these categorieshad a training set frequency less than 15. Yang [20] reduces the collection to onlythose documents that are positive examples of the categories of the HD-119. Thefinal profile of the test collection is very different as the number of terms per11 However, document filtering as in TREC-9 is a real task.


abstract is 1.4. Joachims [6] has also published results for the OHSUMED col-lection using SVM. His work uses the first 20,000 documents of the year 1991divided into two subsets of 10,000 documents each that are respectively used fortraining and testing. He reports on very impressive results but his TC task isvery different: he assumes that if a category in the MeSH tree is assigned thenits more general category in the hierarchy is also present, so that he uses onlythe high level disease categories. This simplifies the task considerably and maypartially explain the good results obtained in these experiments.

Nevertheless, we still attempt to provide some elements for comparing oursystem with previous studies. The most similar experiment was probably con-ducted by [24] (noted YC in the following). The authors use a classifier basedon singular value decomposition (LLSF) for text categorization. They use theinternational Classification of Diseases (ICD) as concept list, and full-text diag-nosis as instances to be classified. ICD -like the MeSH- contains a large numberof categories (about 12 000), and is also provided with an important thesaurus.Both collections are lexically related: we can notice that most of the 6000 dis-eases listed in the MeSH subtree for diseases have an equivalent in ICD codes,so that ICD can be seen as a more specific partition of the MeSH categoriesrestricted to the “disease” subtree. So assigning ICD codes and MeSH termsare quite similar tasks and supports a possible comparison. Unfortunately onlycomparison with PrecisionatRecall=0 is available in their study. We also indicatesthe results obtained by the SMART system as reported in [25] (noted YY in thefollowing). Even if she works with about 4000 MeSH terms only, this result isuseful in order to provide a common baseline measure.

Table 3. Comparison: our hybrid system vs. learning systems. Weighting schema aregiven for the VS system.

Method/Collection/Paper Av. Prec. PrecatRec=0

SMART/OHSU/YY .15 (0) .61 (0)LLSF/ICD - .840 (+37.7)ltc.lnn/CFC .1818 (+20.0) .8884 (+45.0)atn.ntn/CFC - .9143 (+49.9)

Comparison measures are reported in table 3. For top precision, we observethat our hybrid system (+45.9 for ltc.lnn and +50.9 for atn.ntn) is more efficientthan LLSF (+37.7%). Now, regarding average precision, our method outperformsSMART by 25%.

Finally, these results are opposite to what is concluded in [20]: simple word-based strategies behaves gracefully as concept granularity grows, i.e. the moreconcepts there are in a collection, the more effective retrieval strategies will be.We can assume that retrieval approaches perform well when categories are nu-merous, not only because training becomes a major issue for learning systems12,

12 Again, this problem is avoided in studies conducted with learning systems by filteringout concepts with low frequencies.


but because the high granularity may help the retrieval system to cover everydimension of the conceptual space. On the opposite, learning systems are able toinfer and cluster categories (generic or specific concepts) that are not explicitlypresent in the source document, so high granularity does not really help them.

5 Conclusion and Future Work

Concept mapping can be seen both as an alternative to scalability issues oflearning methods and as a complementary module -as IR systems are often- likelyto provide solution, when training data are insufficient. In a medium position,concept mapping can be seen as an optional module, as it provides a strategyto classify along these classes that are absent or subrepresented in the trainingdata.

We have reported on the development and evaluation of a MeSH mappingtool. The systems combines a pattern matcher, based on regular expressions ofMeSH terms, and a vector space retrieval engines that uses stems as indexingterms, a traditional tf.idf weighting schema, and cosine as normalization factor.The hybrid system showed results similar or better to machine learning toolsfor the top returned candidate terms, while scalability of our data-poor (if not-independent) approach is also an advantage as compared to data-driven system.The system provides a new baseline for text categorization systems, improvingaverage precision by 20% in comparison to standard retrieval engines (SMART).Finally, combining learning and learning-free systems could be beneficial in orderto design general broad-coverage concept mapping systems.

Acknowledgements

The study has been partially sponsored by the European Union (IST Grant2001-33260, see www.wrapin.org) and the Swiss National Foundation (Grant3200-065228).

References

1. Shatkay, H., Edwards, S., Wilbur, W., Boguski, M.: Genes, themes and microar-rays: using information retrieval for large-scale gene analysis. Proc Int Conf IntellSyst Mol Biol 8 (2000) 317–28

2. Yang, Y.: Sampling strategies and learning efficiency in text categorization. In:Proceedings of the AAAI Spring Symposium on Machine Learning in InformationAccess. (1996)

3. Larkey, L., Croft, W.: Combining classifiers in text categorization. In: SIGIR,ACM Press, New York, US (1996) 289–297

4. Yang, Y.: An evaluation of statistical approaches to text categorization. Journalof Information Retrieval 1 (1999) 67–88

5. McCallum, A., Nigam, K.: A comparison of event models for naive bayes textclassification (1998)


6. Joachims, T.: Making large-scale svm learning practical. Advances in KernelMethods - Support Vector Learning (1999)

7. Schapire, R., Singer, Y.: BoosTexter: A boosting-based system for text categoriza-tion. Machine Learning 39 (2000) 135–168

8. Apte, C., Damerau, F., Weiss, S.: Automated learning of decision rules for textcategorization. ACM Transactions on Information Systems (TOIS) 12 (1994) 233–251

9. Hayes, P., Weinstein, S.: A system for content-based indexing of a database of newsstories. Proceedings of the Second Annual Conference on Innovative Applicationsof Intelligence (1990)

10. Shaw, W., Wood, J., Wood, R., Tibbo, H.: The cystic fibrosis database: Contentand research opportunities. LSIR 13 (1991) 347–366

11. Aronson, A., Bodenreider, O., Chang, H., Humphrey, S., Mork, J., Nelson, S.,Rindflesch, T., Wilbur, W.: The indexing initiative. A report to the board ofscientific counselors of the lister hill national center for biomedical communications.Technical report, NLM (1999)

12. Manber, U., Wu, S.: GLIMPSE: A tool to search through entire file systems. In:Proceedings of the USENIX Winter 1994 Technical Conference, San Fransisco CAUSA (1994) 23–32

13. Ruch, P.: Using contextual spelling correction to improve retrieval effectiveness indegraded text collections. COLING 2002 (2002)

14. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. ACM-SIGIR (1996) 21–29

15. Hull, D.: Stemming algorithms: A case study for detailed evaluation. Journal ofthe American Society of Information Science 47 (1996) 70–84

16. Tan, C., Wang, Y., Lee, C.: The use of bigrams to enhance text categorization.Information Processing and Management 38 (2002) 529–546

17. Aronson, A.: Exploiting a large thesaurus for information retrieval. Proceedingsof RIAO (1994) 197–216

18. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Researchand Development in Information Retrieval. (1995) 68–73

19. McKeown, K., Barzilay, R., Evans, D., Hatzivassiloglou, V., Schiffman, B., Teufel,S.: Columbia multi-document summarization: Approach and evaluation. In: Pro-ceedings of the Workshop on Text Summarization, ACM SIGIR Conference 2001,(DARPA/NIST, Document Understanding Conference)

20. Yang, Y.: An evaluation of statistical approaches to medline indexing. AMIA(1996) 358–362

21. Lam, W., Ho, C.: Using a generalized instance set for automatic text categorization.In: SIGIR. (1998) 81–89

22. Lewis, D.: Evaluating and Optimizing Autonomous Text Classification Systems.In: SIGIR, ACM Press (1995) 246–254

23. Lewis, D., Shapire, R., Callan, J., Papka, R.: Training algorithms for linear textclassifiers. In: SIGIR. (1996) 298–303

24. Yang, Y., Chute, C.: A linear least squares fit mapping method for informationretrieval from natural language texts. COLING (1992) 447–453

25. Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisionsin Text Categorization and Retrieval. In Croft, W., van Rijsbergen, C., eds.: SIGIR,ACM/Springer (1994) 13–22


Knowledge-Based Query Expansion over a Medical Terminology Oriented Ontology on the Web

Linda Fatima Soualmia1,2, Catherine Barry3, and Stefan J. Darmoni1,2

1 CISMeF & L@STICS, Rouen University Hospital, 76031 Rouen, France {lina.soualmia,stefan.darmoni}@chu-rouen.fr

2 PSI FRE CNRS 2645, INSA-Rouen, 76131 Mont-Saint Aignan, France 3LaRIA, Picardie University, 80000 Amiens, France

[email protected]

Abstract. This paper deals with the problem of information retrieval on the Web and present the CISMeF project (acronym of Catalogue and Index of French-speaking Medical Sites). Information retrieval in the CISMeF catalogue is done with a terminology that is similar to ontology of medical domain and a set of metadata. This allows us to place the project at an overlap between the present Web, which is informal, and the forthcoming Semantic Web. We also describe an ongoing work, which consists of applying three knowledge-based methods in order to enhance information retrieval.

1 Introduction

Nowadays the problematic is intelligent information retrieval on the Web. The Se-mantic Web [1] is an infrastructure that has to be built. It aims at creating a web where information semantics are represented in a form that can be understood by human as well as machines in order to enable computers and people to work in co-operation. One of its advantages is to bring sufficient information on the resources, by adding annotations in the form of metadata and to describe formally and significantly their content according to an ontology. Ontologies are considered to be powerful tools to lift ambiguity by providing a controlled vocabulary of terms and some specification of their meaning and are very useful for interoperability and for browsing and search-ing. Metadata describe Web information resources enhancing information retrieval.

In this paper we present the CISMeF1 project [2] (acronym of Catalogue and Index of French-speaking Medical Sites) developed since 1995. The objective of CISMeF is to help health professionals, as well as students and the general public, during their search for electronic health information. The CISMeF catalogue describes and in-dexes a large number of health information resources (n=11,504). A resource can be a Web site, Web pages, documents, reports and teaching material: any support that may contain health information. The resources are selected according to strict criteria by the librarian team and are indexed according to a methodology. The resources indexed in the CISMeF catalogue are described according to a terminology that is similar to an ontology of the medical domain, and a set of metadata elements. This structure en- 1 http://www.chu-rouen.fr/cismef/

210 Linda Fatima Soualmia, Catherine Barry, and Stefan J. Darmoni

ables us to place the project at an overlap between the present informal Web, mainly composed by HTML pages, and the forthcoming Semantic Web. We also describe in this paper an ongoing work which consists of applying three knowledge-based meth-ods (natural language processing, knowledge discovery in databases and reasoning on ontologies) to enhance information retrieval into CISMeF.

2 Towards a Medical Semantic Web

Metadata is data about data and specifically in the context of the Web, it is data that describe Web resources. When properly implemented, metadata can enhance informa-tion retrieval. In CISMeF several sets of metadata elements are used. The resource indexed are described the Dublin Core (DC) elements set [3] (e.g. author, date). DC is not a complete solution, it cannot be used to describe the quality or location of a resource. To fill these gaps, CISMeF uses its own elements to extend the DC stan-dard. Eight elements are specific to CISMeF [2](e.g. institution, target public). Two additional fields are in the resources intended for the health professionals: indication of the evidence-based medicine and the method used to determine it. In the teaching resources eleven elements of the IEEE 1484 LOM (Learning Object Metadata) “Edu-cational” category are added. The metadata format was the HTML language in 1995. Since December 2002, the format used is RDF, a Semantic Web language, within the ongoing MedCIRCLE project [4], developed to qualify health information quality.

The catalogue resources are indexed according to the CISMeF terminology, which is based on the MeSH [5] concepts and its French translation. We have not used the UMLS [6] because there is no available French translation. Approximately 22,000 keywords (e.g. hepatitis) and 84 qualifiers (e.g. complications) compose the MeSH thesaurus, in its 2003 version. These concepts are organized into hierarchies going from the most general on the top to the most specific in the bottom of the hierarchy. The qualifiers, organized into hierarchies, specify which particular aspect of a key-word is addressed. The keywords and the qualifiers that are dispersed in several trees but are semantically related in CISMeF are gathered according to metaterms (n=66). They concern medical specialties. In addition, a hierarchy of resource types (n=127) describes the nature of the resource (e.g. clinical guidelines). The metaterms and resource types enhance information retrieval into the catalogue when searching for “guidelines in cardiology”, where cardiology is a metaterm and guidelines is a re-source type, which is not possible using the MeSH thesaurus.

The CISMeF terminology has the same structure as a terminological ontology [7]. The vocabulary describes major terms of the medical domain and is well known by the librarians and the health professional. Each concept has a preferred term to ex-press it in natural language, a set of properties, a natural language definition that al-lows to differentiate it from the concepts it subsumes and those that subsume it, a set of synonyms and a set of rules and constraints.

3 Enhancing Information Retrieval

The submitted queries over the search engine are seldom matched to the terms of the ontology. We have extracted and analyzed 1,552,776 queries of the http server log

Knowledge-Based Query Expansion 211

and their associated number of answers (between the 08/15/02 and the 02/06/03). 892,591 queries (58.62%) were submitted via the free text search interface [2] and 365,688 (40.97%) had no answer.

3.1 Natural Language Processing

We apply here a morphological analysis of the queries. A preliminary work [8] showed that using morphological knowledge enhance information retrieval. The pro-posed algorithm consists in correcting the user query by eliminating stop words (the, and, when) and replacing each word of the query by a disjunction of all the terms of its morphological family. A morphological family of a term is composed by its inflex-ions {accident, accidents} and derivations {probability, probabilistic}. If the user query is “interaction between the drugs” it will be replaced by the MeSH term “drug interactions”. There is not yet an available French Medical Lexicon, such as the Spe-cialist Lexicon of the UMLS, so we have used a terminological resource Lexique [9] that is not related to the medical domain. Nevertheless, it allowed us to obtain 31,016 derived terms that match exactly 1,292 CISMeF terms.

Table 1. Structure of the terms used for indexing the resources.

Number of words Keywords Qualifiers Resource Types Terms 1 1 437 55 28 1 520

2 to 7 2 516 24 99 2 639 TOTAL 3953 79 127 4 159

Table 2. Matching the vocabulary.

Keywords Qualifiers Resource Types Terms Nb terms matched 1 207 55 28 1 292 1 word matching 83.99% 100% 100% 85% Semi-matching 78.57% 79.74% 39.37% 77.59% Total matching 30.53% 69.62% 22.04% 31.06%

The analysis of the other terms composed by 2 or more words showed that 1,935

terms (1,899 keywords; 8 qualifiers; 22 resource types) are semi-matched. A term is semi-matched if at least one of the words that compose it is matched. In addition to morphological knowledge, semantic knowledge is necessary, for example heart and cardiac are semantically related. A set of synonyms has been created with the col-laboration of several patients associations and we are currently analyzing the user queries to complete it.

3.2 Knowledge Discovery in Database

We want to discover “new” knowledge from the CISMeF database to exploit it in the process of information retrieval. We apply a Data Mining technique called Associa-tion Rules to extract interesting associations, previously unknown, from the database. A Boolean association rule AR is expressed as:

212 Linda Fatima Soualmia, Catherine Barry, and Stefan J. Darmoni

AR : i1 ∧ i2 ∧ … ∧ ij ⇒ ij+1 ∧ … ∧ in (1)

This formula states that if an object has the items {i1, i2…, ij} it tends also to have the items {ij+1, …, in}. The AR support represents its utility. This measure corresponds to the proportion of objects which contains at the same time the rule antecedent and consequent. The AR confidence represents its precision. This measure corresponds to the proportion of objects that contains the consequent rule among those containing the antecedent. The extraction context is the triplet C= (O, I, R) where O is the set of objects, I the set of all the items and R a binary relation between O and I. The objects are the annotations used to describe the indexed resources. The relation R represents the indexing relation between an object and an item. We first consider two cases for the items: I1={Keywords} and I2={(Keywords/Qualifiers)}. An itemset is frequent in the context C if its support is higher than the minimal threshold initially fixed. We use the A-Close algorithm [10], which deduces bases for association rules. We have tested our algorithm on few examples. The first step of the algorithm allowed us to find for example the following rules: Hepatitis C ⇒ AIDS with support=14 for I1 and AIDS/prevention and control ⇒ condom with support=6 for I2. The second step is to extract all the other association rules and to apply them in the information retrieval process by a query expansion.

3.3 Reasoning on Ontologies

In order to complete the CISMeF ontology with more refined links between concepts, we have decided to exploit the UMLS Semantic Network, which is composed by medical concepts and semantic relations between concepts. They take the form of Complications (Hepatitis, Liver Cirrhosis) denoting that the concept Hepatitis is re-lated to the concept Liver Cirrhosis by the relation Complications. These relations correspond to the MeSH qualifiers and the concepts correspond to the MeSH key-words. These relations won’t be used to annotate the resources but they will be con-verted into inference rules enriching by that the ontology by other links between con-cepts. In our example, the only one information available from the ontology is that the concepts Hepatitis and Liver Cirrhosis are subsumed by the concept Liver Diseases. In order to enable content reasoning over the resources, we will convert a part of the CISMeF ontology into RDF Schema by transforming keywords and resource types into concepts and the qualifiers into roles (or relations). The resources will be trans-formed into RDF according to the CISMeF RDF Schema. RDFS doesn’t include reasoning mechanisms such as those included in the Description Logics Systems but unlike RDFS, the query languages for the other ontology standards are still ongoing. Writing inference rules with RDFS is possible with TRIPLE [11]: it has been devel-oped for knowledge-based intelligent information retrieval. It enables to carry out complex reasoning on RDF resources that represent the concepts’ instances. In our case, for example, if a resource R is an instance of Hepatitis/Complications and the user is searching for resources related to Liver Cirrhosis, the system would infer that the resource R is also an answer to the query. We will use the TRIPLE query engine to carry out higher level queries over the CISMeF catalogue.

Knowledge-Based Query Expansion 213


We have discussed in this paper the problems of information retrieval on the Web. We have presented particular aspects of the CISMeF project. We have also proposed different methods to enhance information retrieval. The natural language processing is used to build morphological knowledge base. Data Mining enables association rules discovery between concepts. Finally, reasoning on ontologies will offer a higher level for the ontology (consistency and coherence checking, exploitation of the Semantic Network of the UMLS) and for information retrieval. To our knowledge, no existing work has combined these methods in order to enhance information retrieval. The next step of this study is to evaluate the contribution of each method separately and con-jointly: we will apply an automatic and an interactive query expansion over the users’ queries. The evaluation on a real scale will allow us to deduce a process, according to the type of the query, to apply a method with a particular order.

References

1. Berners-Lee, T., Heudler, J. and Lassila, O. (2001). The Semantic Web. Scientific Ameri-can, p.35-43.

2. Darmoni, SJ., Thirion, B., Leroy, JP. et al. (2001). A Search Tool based on ‘Encapsulated’ MeSH Thesaurus to Retrieve Quality Health Resources on the Internet. Medical Informatics & the Internet in Medicine, 26 (3) :165-178.

3. Baker, T.(2000) A Grammar of Dublin Core. Digital-Library Magazine, vol 6 n°10. 4. Mayer, MA., Darmoni, SJ., Fiene, M. et al. (2003) MedCIRCLE - Modeling a Collabora-

tion for Internet Rating, Certification, Labeling and Evaluation of Health Information on the Semantic World-Wide-Web. Medical Informatics Europe, p.667-672.

5. Nelson, SJ., Johnson, WD., and Humphreys, BL. (2001) Relationships in Medical Subject Headings. Bean and Green (eds). Kluwer Academic Publishers, 171-184.

6. Lindberg, DAB, Humphreys, BL and McCray, AT. (1993) The Unified Medical Language System. Methods of Information in Medicine, 32 (4):281-291.

7. Sowa, JF. (2000) Ontology, Metadata and Semiotics. Lecture Notes in AI #1867, Springer Verlag, p.55-81.

8. Zweigenbaum, P., Darmoni, SJ. and Grabar, N. (2001) The Contribution of Morphological Knowledge to French MeSH Mapping for Information Retrieval. JAMIA 8:796-800.

9. New, B., Pallier, C., Ferrand, L. and Matos R. (2001) Une Base de Données Lexicales du Français Contemporain sur Internet: LEXIQUE, L'Année Psychologique, 447-462.

10. Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. (1999) Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems, 24(1):25-46.

11. Sintek, M. and Decker, S. (2001) TRIPLE- An RDF Query, Inference and Transformation Language. Proceedings of Deductive Databases and Knowledge Management Workshop.

Linking Rules to Terminologies and Applicationsin Medical Planning

Sanjay Modgil

Biomedical InformaticsEastman Institute for Oral Health Care Sciences

University College London256, Gray’s Inn Road, London WC1X 8LD

[email protected]

Abstract. In this paper we describe the compilation of conjunctive bod-ies of a restricted class of Horn rules into Description Logic updates onterminologies. We motivate and illustrate application of this work in amedical planning context, by showing how updates to a medical termi-nology can be computed from the bodies of partially evaluated safetyrules for reasoning about a designed plan. In this way, a new action canbe included in the designed plan, while the terminology can maintainincomplete information about the action.

1 Background and Introduction

Medical Artificial Intelligence applications have long made use of logic rule basedlanguages. More recently, Description Logics have been especially designed to en-code rich hierarchical medical knowledge in the form of terminologies (e.g., [7]).This paper contributes to research (e.g., [3]) advocating the benefits of hybridrule based/description logic reasoning, by describing a novel treatment of theinteraction between Horn rules and terminologies. We briefly describe compila-tion of the conjunctive bodies of a restricted class of Horn rules into descriptionlogic expressions for updating terminologies. These updates exploit the abilityof description logics to maintain incomplete information about individuals. Wedemonstrate application and benefits of this work in a medical planning context.The requirement to maintain and reason with incompletely described actionsarises because of the inevitable incompleteness of medical knowledge bases, andbecause of the benefits of modelling planned actions in terms of their inten-tions or goals [8]. We illustrate with previous work on decision support tools forclinical trial design [5]. These include a medical plan authoring tool linked to a“safety advisor” [4]. The latter contains logic program rules for reasoning aboutplan safety. A Prolog meta-interpreter enables the user to (partially) unfold /evaluate a natural language representation of a rule so as to suggest updates toa plan being designed using the plan authoring tool. For example, consider thequery ? add plan(NewAct, Action, Effect) on the rule and domain knowledge:add plan(NewAct, Action, Effect) ← plan(Action), causes(Action,Cause), (1)

effect(Cause, Effect1), hazard(Effect1), action(NewAct),effect(NewAct, Effect2), ameliorate(Effect2, Cause, Effect1)


Linking Rules to Terminologies and Applications in Medical Planning 215

ameliorate(AE,Cause,Effect) ← counters(AE,Effect) (2)ameliorate(AE,Cause,Effect) ← prevent(AE,Cause) (3)hazard(dehydration)(4) effect(vomiting, dehydration)(5) causes(cisplatin,vomiting)(6)

Presented with a natural language translation of (1), the user selects goal atomsfrom the body of (1), to be resolved against user selected clauses in the knowl-edge base. Each successive selection of a goal atom and matching clause leads tounfolding of the rule, i.e., replacement of the goal atom by the body of the clause.Resolving on (2), (4)-(6), and assuming a plan action involving administrationof the drug cisplatin, rule (1) can be partially unfolded to:add plan(NewAct, cisplatin, dehydration) ← (7)action(NewAct), effect(NewAct, Effect2), counters(Effect2,dehydration)

The user may now elect to cease unfolding so that a suggested new action bedescribed in terms of its intentions (or the user may be forced to cease unfoldingdue to lack of knowledge regarding actions that counter dehydration): the bodyof (7) is used to generate the text “execute an action that has an effect whichcounters dehydration”, which is added to (updates) the textual protocol describ-ing the plan. This incomplete characterisation of the planned action allows forflexibility in its detailed specification at plan execution time, at which point thespecifics of a suggested action can be checked for compliance with the intentions.However, (7) does not constitute a corresponding declarative representation ofan action with the desired properties, that can be reasoned with (e.g., orderedtemporally) as part of the symbolic representation of the plan being designed.

In the following section we show how the body of (7) can be used to computesuch a symbolic representation in a hybrid knowledge base consisting of theabove safety rules linked to a Description Logic encoding of medical domainknowledge.

2 Defining the Knowledge Base and Computationof Rule Based Updates

A knowledge base is a tuple (R,T,A), where R is a set of non-recursive Horn rulesH(Y ) ← B1(X1), . . . , Bn(Xn), where Y ,X1, . . . ,Xn are tuples of variables orconstants, and any variable in Y must appear in some Xi. A is a set of groundfacts, and T an acyclic terminology in a language that is a superset of theDescription Logic ERIB. A Description Logic is a subset of first order logic,based on unary relations (concepts/classes) interpreted as sets of objects, androles interpreted as binary relations on objects. A Description Logic languageis composed of symbols taken from the set of Concept Names (denoted hereby the letters A, B), Role Names (P ,Q) and Individual Names (a, b, c). Inaddition, a language includes a number of constructors (that vary from logic tologic). These permit the formation of concept expressions, denoted by C, D, androle expressions, denoted by R. In this work, a Description Logic terminology(TBox ) contains concept definitions which are statements of the form A

.= C,and concept inclusions of the form A � C. The semantics of the TBox is givenvia interpretations. An interpretation I contains a non-empty domain DI and a

216 Sanjay Modgil

function .I that maps every concept A to a subset AI of DI , every role P to asubset P I of DI × DI , and every individual a to an element aI of DI . Belowwe list the constructors used in ERIB to define complex concepts inductively.Equations describing the extensions of the constructors are also given.

C, D →A| (primitive concept) AI ⊆ DI

�|⊥| (top, bottom) �I = DI ,⊥I = ØC ∧ D| (concept conjunction) (C ∧ D)I = CI ∩ DI

∃R.C| (existential quantification) (∃R.C)I = {a ∈ DI |∃b : RI(a, b)∧b ∈ CI}R : b (fills) (R : b)I = {a ∈ DI |(a, bI) ∈ RI}R →P | (primitive role) P I ⊆ DI × DI

R1 ∧ . . . ∧ Rm| (role conjunction) (R1 ∧ ... . . . ∧ Rm)I = RI1 ∩ ... . . . ∩ RI

m

R−1| (inverse role) (R−1)I = {(a, b) ∈ DI × DI |RI(b, a)}In [6] we formalise compilation of the body B of a computable Horn rule r =H ← B into ERIB concept expressions. The binding graph G(B) is defined tobe a set of directed labelled edges (α, β, {r1, . . . , rn}), such that α and/or β isa variable in B, and for i = 1 . . . n, ri(α, β) is a predicate in B. r is computableif all non-ground predicates in B are unary or binary, and all cycles in G(B)include a constant. The binding graph for the computable rule:

h(X) ← i(X), m(Z, Y ), p(X, Y ), q(X, a), r(a, Y ), s(a, W ), t(V, W ) (8)

is shown in fig.1a (note constant a in the cycle). In [6] we formalise transformationof a binding graph to enable compilation of ERIB expressions. This involvesreversal of edges (α, β, {r1, . . . , rn}) to (β,α,{r−1

1 , . . . , r−1n }), and partitioning of

graphs into subgraphs. The resultant subgraphs have the following properties:1) No variable is the successor of more than one edge; 2) no constant is thepredecessor of an edge; 3) no variable is common to any two subgraphs; 4) nosubgraph has more than one root node. The italicised ERIB expressions in fig.1care compiled from the transformed subgraphs in fig.1b. The updates to (R,T,A),computed from r = H ← B, are then defined as follows: 1) Ground predicatesBi(a) (a a tuple of constants) in B are included in A; 2) for each unary non-ground predicate Bi(X) in B such that X does not appear in a binary predicate:Bi(ai) is included in A, where ai is a fresh constant; 3) for each ERIB expressiondi compiled from G(B): the concept definition ci

.= di is included in T, where ci isa fresh concept name, and the assertion ci(ai) is included in A where ai is a freshconstant. The updates computed for rule (8) are shown in fig.1c. Note that in [6]we define a skolemised first order translation of ERIB concept definitions c

.= dto c ↔ skol(fol(d)), where the first order translation fol(d) is straightforwardlygiven by the semantics of the ERIB operators listed above, and skolemisationreplaces every existentially quantified variable by a fresh skolem constant. In[6] we then prove “correctness” of the update procedure by proving that givenupdates T′ and A′ defined on the basis of r = H ← B, then a ground instanceH(a) of the head of r is entailed by r ∪ A′ ∪ {c ↔ skol(fol(d))|c .= d ∈ T′}.

Linking Rules to Terminologies and Applications in Medical Planning 217

Fig. 1. a) Binding graph for body of rule (8); (b) the binding graph transformed intotwo subgraphs; c) Updates computed on the basis of the body of rule (8)

Referring now to the example in section 1, assume a knowledge base � =(R,T,A), where R denotes the set of safety rules, T a terminology of medicaldomain knowledge about actions, effects, hazards e.t.c., and A a set of assertions(facts) about the domain, and a specific plan being designed. On the basis ofrule (7), the computed updates to T and A are counter dehy act

.= action∧∃effect.counter : dehydration and counter dehy act(a1), i.e., the plan is up-dated with a “place-holder” action a1 which belongs to a concept describingthose individuals that are actions that have an (unspecified) effect which coun-ters dehydration. A basic Description Logic reasoning service is classification([1]), which, for some concept C, determines those concepts subsumed by andsubsuming C. For example, counter dehy act is a subclass of (subsumed by)action. The intentions of the “place-holder” action a1 are encoded in the newconcept definition, so that at plan execution time, a user suggested specific ac-tion can be checked for compliance with the encoded intentions. This involveschecking whether the specific action suggested is a member of (instance of) theconcept counter dehy act (instance checking [1]). Furthermore, at plan designtime, the action a1 can be reasoned with as part of the symbolic plan (e.g.,ordered temporally) in the same way as other concrete actions (e.g., cisplatin).Also, properties of a1 can be reasoned about, via hybrid reasoning of the typedescribed in [3], in which a sound and complete decidable reasoning proceduredetermines whether (R,T,A) � q(a), where q is a concept or role, or an ordinarypredicate that appears in R but not in T. In particular, hybrid reasoning mightdetermine that add plan(a1, cisplatin, dehydration) follows from the updatedknowledge base1, indicating that an action that counters dehydration is alreadyincluded in the plan. Indeed, to demonstrate proof of concept, we have simulatedhybrid reasoning of the above type, by translating T to a set of definite programrules T∗, and extending the natural language interface and interactive unfold-

1 Note that in [3], T is encoded in a Description Logic that does not include theoperators fills and inverse role. However the authors of [3] have indicated (in privatecommunications) that their reasoning procedure can be extended (straightforwardly)for fills, and (with difficulty) for inverse roles

218 Sanjay Modgil

ing facility to T∗ ∪ R ∪ A , so that (7) can be fully unfolded on the updatedknowledge base.

3 Conclusions

We have shown that Description Logic expressions can be used to model medicalactions in terms of their intentions. Such actions are “inferred” as updates, asa result of rule based reasoning about the safety of a plan being designed. Onecan continue to reason fully with such incompletely specified actions duringplan design. Later detailed specification of these actions can be checked forcompliance with the intentions. Other works (e.g., [8]) model intentions, althoughnone provide decision support for deriving intentions. Our work also contributesto existing work on compilation of conjunctions into concept descriptions [2], byextending the scope of conjunctions considered to those that contain constants,and whose binding graphs need not define trees. On a more general note, recentworks formalising medical terminologies (e.g., [7]), and the traditional use of rulebased reasoning in medical applications, suggest the importance of research intohybrid Rule based/Description Logic medical systems. In this paper we haveshown that the ability of Description Logics to maintain incomplete informationabout individuals can be exploited in such systems. An immediate future researchgoal is further development of the proof of concept implementation described inthe previous section. In particular, we aim to link the safety rules to a large scalemedical terminology [7].

References

1. F. M. Donini, M. Lenzerini, D, Nardi and A. Schaerf, Reasoning in Description Log-ics. In: G. Brewka, ed., Principles of Knowledge Representation, CSLI Publications,Stanford, California, 191-236, 1996.

2. F. Goasdou and M. Rousset, Compilation and Approximation of ConjunctiveQueries by Concept Descriptions. In: Proceedings of the 15th European Confer-ence on Artificial Intelligence, (ECAI 2002).

3. A. Y. Levy and M. Rousset, Combining Horn Rules and Description Logics inCARIN. In: Artificial Intelligence 104 (1-2), 165-209, 1998.

4. S. Modgil and P. Hammond, Generating Symbolic and Natural Language PartialSolutions for Inclusion in Medical Plans. In: Proc. 8th Conf. on Artificial Intelligencein Medicine in Europe, (LNAI 2101, Springer-Verlag), 239-248, 2001.

5. S. Modgil and P. Hammond, Decision Support Tools for Clinical Trial Design. In:Artificial Intelligence in Medicine, 27(2), 181-200, 2003.

6. S. Modgil, Rule Based Computation of Updates to Terminologies. Submitted forpublication in: 2003 International Workshop on Description Logics, Rome, Italy,September 5-7, 2003 (http://www.eastman.ucl.ac.uk/ dmi/Papers/index.html)

7. A. Rector et. al., The GRAIL concept modelling language for representing medicalterminology. In: Artificial Intelligence in Medicine, (9), 139-171, 1997.

8. Y. Shahar, S. Miksch, P. Johnson, The Asgaard project: a task-specific frameworkfor the application and critiquing of time-oriented clinical guidelines. In: ArtificialIntelligence in Medicine, 14(1-2), 29-51, 1998.

Classification of Ovarian TumorsUsing Bayesian Least Squares

Support Vector Machines

Chuan Lu1, Tony Van-Gestel1, Johan A.K. Suykens1, Sabine Van-Huffel1,Dirk Timmerman2, and Ignace Vergote2

1 Dept. of Electrical Engineering, Katholieke Universiteit Leuven3001 Leuven, Belgium

{chuan.lu,tony.vangestel,Johan.Suykens,Sabine.VanHuffel}@esat.kuleuven.ac.be

2 Dept. of Obstetrics and Gynecology, University Hospitals Leuven3000 Leuven, Belgium

{dirk.timmerman,ignace.vergote}@uz.kuleuven.ac.be

Abstract. The aim of this study is to develop the Bayesian LeastSquares Support Vector Machine (LS-SVM) classifiers for preoperativediscrimination between benign and malignant ovarian tumors. We de-scribe how to perform (hyper)parameter estimation, input variable selec-tion for LS-SVMs within the evidence framework. The issue of computingthe posterior class probability for risk minimization decision making isaddressed. The performance of the LS-SVM models with linear and RBFkernels has been evaluated and compared with Bayesian multi-layer per-ceptrons (MLPs) and linear discriminant analysis.

1 Introduction

Ovarian masses are a very common problem in gynecology. The difficulties inearly detection of ovarian malignancy result into the highest mortality rateamong gynecologic cancers. An accurate discrimination between benign and ma-lignant tumors before operation is critical to obtain the most effective treatmentand best advice, and will influence the outcome for the patient and the medicalcosts. Several attempts have been made in order to automate the classificationprocess, such as the risk of malignancy index (RMI), logistic regression, neuralnetworks, Bayesian belief networks [1][2][3]. In this paper, we focus on the de-velopment of Bayesian Least Squares Support Vector Machines (LS-SVMs), topreoperatively predict the malignancy of ovarian tumors.

Support Vector Machines (SVMs) [5] have become a state-of-the-art tech-nique for pattern recognition. The basic idea of the nonlinear SVM classifier andrelated kernel techniques is: map an n-dimensional input vector x ∈ IRn into ahigh nf -dimensional feature space by the mapping ϕ(·) : IRn → IRnf : x → ϕ(x).A linear classifier is then constructed in this feature space. These kernel-basedalgorithms have attractive features such as good generalization performance, the


220 Chuan Lu et al.

existence of a unique solution, and strong theoretical background, i.e., statisticallearning theory [5], supporting their good empirical results. Here a least squaresversion of SVM [6][7] is considered, in which the training is expressed in terms ofsolving a set of linear equations in the dual space instead of quadratic program-ming as for the standard SVM case. Also remarkable is that LS-SVM is closelyrelated to Gaussian processes and kernel Fisher discriminant analysis [10].

The need of applying Bayesian methods to LS-SVMs for this task is twofold.One is to tune the regularization and possible kernel parameters automatically totheir near-optimal values, second is to judge the uncertainty in predictions thatis critical in a medical environment. A unified theoretical treatment of learn-ing in feedforward neural networks has been provided by MacKay’s Bayesianevidence framework [9][8]. Recently this Bayesian framework was also appliedto LS-SVMs, and a numerical implementation was derived. This approach hasbeen applied to several benchmark problems, achieving similar test set resultsas Gaussian processes and SVMs [10].

After a brief review of the LS-SVM classifier and the Bayesian evidenceframework, we will show the scheme for input variable selection and the way tocompute the posterior class probabilities for minimum risk decision making. Thetest set performance of models are assessed via Receiver Operating Characteristic(ROC) curve analysis.

2 Data

The data set includes the information of 525 patients who were referred to asingle ultrasonographer at University Hospitals Leuven, Belgium, between 1994and 1999. These patients have a persistent extrauterine pelvic mass, which wassubsequently surgically removed. The study is designed mainly for preoperativedifferentiation between benign and malignant adnexal masses [1]. Patients with-out preoperative results of serum CA 125 levels are excluded from this analysis.The gold standard for discrimination of the tumors were the results of histo-logical examination. Among the available 425 cases, 291 patients (68.5%) hadbenign tumors, whereas 134 ones (31.5%) had malignant tumors.

The measurements and observations were acquired before operation, includ-ing: age and menopausal status of the patients, serum CA 125 levels from theblood test, the ultrasonographic morphologic findings about the mass, colorDoppler imaging and blood flow indexing, etc [1][4]. The data set contains 27variables after preprocessing (e.g. color score was transformed into three dummyvariables, CA 125 serum level was rescaled by taking its logarithm). Table 1 liststhe most important variables that were considered.

Fig. 1 shows the biplot generated by the first two principal components ofthe data set, visualizing the correlation between the variables, and the relationsbetween the variables and classes. In particular, a small angle between two vari-ables such as (Age, Meno) points out that those variables are highly correlated;the observations of malignant tumors (indicated by ‘+’) have relatively highvalues for variables Sol, Age, Meno, Asc, L CA125, Colsc4, Pap, Irreg, etc; but

Classification of Ovarian Tumors 221

Table 1. Descriptive statistics of ovarian tumor data

Variable (Symbol) Benign MalignantDemographic Age (Age) 45.6±15.2 56.9±14.6

Postmenopausal (Meno) 31.0 % 66.0 %Serum marker CA 125 (log)(L CA125) 3.0±1.2 5.2±1.5CDI Normal blood flow (Colsc3) 15.8 % 35.8 %

Strong blood flow (Colsc4) 4.5 % 20.3 %Morphologic Abdominal fluid (Asc) 32.7 % 67.3 %

Bilateral mass (Bilat) 13.3 % 39.1 %Solid tumor (Sol) 8.3 % 37.6 %Irregular wall (Irreg) 33.8 % 73.2 %Papillations (Pap) 13.0 % 53.2 %Acoustic shadows(Shadows) 12.2 % 5.7 %

Note: for continuous variables, mean±SD in case of a benign and malignant tumorrespectively are reported; for binary variables, the occurrences (%) of the

corresponding features are reported.

Fig. 1. Biplot of ovarian tumor data (‘×’- benign, ‘+’- malignant), projected on thefirst two principal components

222 Chuan Lu et al.

relatively low values for the variables Colsc2, Smooth, Un, Mul, etc. The biplotreveals that many variables are correlated, implying the need of variable selec-tion. On the other hand, quite a lot of overlap between the two classes can alsobe observed, suggesting that the classical linear techniques might not be enoughto capture the underlying structure of the data, and a nonlinear classifier mightgive better results than a linear classifier.

3 Methods

3.1 Least Squares SVMs for Classification

The LS-SVM classifier y(x) = sign[wT ϕ(x) + b

]is inferred from the data D =

{(xi, yi)}Ni=1 with binary targets yi = ±1 (+1: malignant, −1: benign) by mini-

mizing the following cost function:

minw,b,e J (w, e) = μEW + ζED = μ2 wT w + ζ

2

∑Ni=1 e2

i (1)

subject to the equality constraints yi[wT ϕ(xi) + b] = 1 − ei, i = 1, ..., N. Theregularization and sum of squares error term are defined as EW = 1

2wT w, andED = 1

2

∑Ni=1 e2

i respectively. The tradeoff between the training error and reg-ularization is determined by the ratio γ = ζ/μ. This optimization problem canbe transformed and solved through a linear system in the dual space [6][7]:[

0 Y T

Y Ω + γ−1IN

] [bα

]=

[01v

](2)

with Y = [y1 · · · yN ]T , α = [α1 · · ·αN ]T , e = [e1 · · · eN ]T , 1v = [1 · · · 1]T , andIN the N × N identity matrix. Mercer’s theorem is applied to the matrix Ωwith Ωij = yiyj ϕ(xi)T ϕ(xj) = yiyj K(xi, xj), where K(·, ·) is a chosen positivedefinite kernel that satisfies Mercer condition. The most common kernels includea linear kernel K(xi, xj) = xT

i xj and an RBF kernel K(xi, xj) = exp(−‖xi −xj‖2

2/σ2). The LS-SVM classifier is then constructed in the dual space as:

y(x) = sign

[N∑

i=1

αi yi K(x, xi) + b

]. (3)

3.2 Bayesian Inference

In [10] the application of the evidence framework to LS-SVMs originated fromthe feature space formulation, whereas analytic expressions are obtained in thedual space on the three levels of Bayesian inferences. For the computationaldetails, the interested readers are referred to [10] and [7].

The Bayesian evidence approach first finds the maximum a posteriori esti-mates of model parameters wMP and bMP, using conventional LS-SVM trainingmethods, i.e. by solving the linear set of equations in (2) in the dual space in or-der to optimize (1). Then the distribution over the parameters is approximated


using information available at this maximum. The hyperparameters μ and ζ aredetermined by maximizing the posterior probability of the parameters, whichcan be estimated using the Gaussian probability at wMP, bMP.

Different models can be compared by examining their posterior p(Hj |D).Assuming a uniform prior p(Hj) over all models, the models can be ranked bythe model evidence p(D|Hj), which can be again evaluated using a Gaussian ap-proximation. The kernel parameters, e.g. the bandwidth parameter σ of the RBFkernel, are chosen from a set of candidates by maximizing the model evidence.

3.3 Model Comparison and Input Variable Selection

Statistical interpretation is also available for the comparison between two modelsin the Bayesian framework. Bayes factor B10 for model H1 against H0 from dataD is defined as B10 = p(D|H1)/p(D|H0). Under the assumption of equal modelpriors, the Bayes factor can be seen as a measure of the evidence given by thedata in favor of a model compared to a competing one. When the Bayes factoris greater than 1, the data favor H1 over H0; otherwise, the reverse is true. Therules of thumb for interpreting 2 log B10 include: the evidence for H1 is veryweak if 0 ≤ 2 log B10 ≤ 2.2, and the evidence for H1 is decisive if 2 log B10 > 10,etc, as also shown in Fig. 2 [12].

Therefore, given a certain type of kernel for the model, we propose to selectthe input variables according to the model evidence p(D|Hj). The heuristicsearch strategy for variable selection can be e.g. backward elimination, forwardselection, stepwise selection, etc. Here we concentrate on the forward selection(greedy search) method.

The procedure starts from zero variables, and chooses each time the variablewhich gives the greatest increase in the current model evidence. The selectionis stopped when the addition of any remaining variables can no longer increasethe model evidence.

3.4 Computing Posterior Class Probability

For a given test case, the conditional class probabilities p(x|y = ±1, D, μ, ζ,H)can be computed using the two normal probability densities of wT ϕ(x) for twoclasses at the most probable value wT

MPϕ(x) [10][7]. The mean of each distribu-tion is defined as the class center of the output (in the training set), and thevariance comes from both the target noise and the uncertainty in the parame-ter w. By applying Bayes’ rule the posterior class probabilities of the LS-SVMclassifier can be obtained:

p(y|x, D, μ, ζ,H) =p(y)p(x|y, D, μ, ζ,H)∑

y′=±1 p(y′)p(x|y′, D, μ, ζ,H), (4)

where p(y) corresponds to the prior class probability. The posterior prob-ability could also be used to make minimum risk decisions in case of differ-ent error costs. Let c+

− and c−+ denote the cost of misclassifying a case from

224 Chuan Lu et al.

class ‘−’ and ‘+’ respectively. One obtains the minimum risk decision ruleby formally replacing the prior p(y) in (3) with the adjusted class prior, e.g.P ′(y=1)=P (y=1)c−

+/(P (y=1)c−+ + P (y=−1)c+

−).

4 Experiments and Results

In these experiments, the data set is split according to the time scale: the datafrom the first treated 265 patients (collected between 1994 and 1997) are takenas training set, 160 of the remaining data (collected between 1997 and 1999)are used as test set. The proportion of malignant tumors in the training setand test set are both about 1/3. All the input data have been normalized usingthe mean and variance estimated from the training data. Several competitivemodels are built and evaluated using the same variables selected from the pro-posed forward procedures. Besides LS-SVM models with linear and RBF kernels,the other considered competitive models include a linear discriminant analysis(LDA) classifier, and a Bayesian MLP classifier as the counterpart of SVMs inneural network modelling.

4.1 Selecting Predictive Input Variables

Selecting the most predictive input variables is critical to effective model devel-opment, since it not only helps to understand the disease, but also potentiallydecreases the measurement cost for the future. Here we adapt the forward se-lection which tries to maximize the evidence of the LS-SVM classifiers witheither linear or RBF kernels. In order to stabilize the selection, the three vari-ables with the smallest univariate model evidence are first removed. Then theselection starts from the remaining 24 candidate variables.

Fig. 2 shows the evolution of the model evidence during the input selectionusing RBF kernels. The variable added to the model at each selection step andthe corresponding Bayes factor have been depicted. The Bayes factor for theunivariate model is obtained by comparing it to a model with only a randomvariable, the other Bayes factors are obtained by comparing the current modelto the previously selected models. Ten variables were selected by the LS-SVMwith RBF kernels, and they were used to build all the competitive models in thefollowing experiments. Linear kernels have also been tried, but resulted into asmaller evidence and an inferior model performance.

Compared to the variables selected by a stepwise logistic regression basedon the whole dataset (which should be over optimistic) [2], the new identifiedsubset based only on the 265 training data includes 2 more variables. However,it still gives a comparable performance on the test set.

4.2 Model Fitting and Prediction

The model fitting procedure for LS-SVM classifiers has two stages. The first isthe construction of the standard LS-SVM model within the evidence framework.


1 2 3 4 5 6 7 8 9 10 11

0

10

20

30

40

50

60

70

80

90

100

number of input variables

2 lo

g B

10

>10 Decisive

5 ~ 10 Strong

2 ~ 5 Positive

< 2 Very weak

2

5

L_CA125

Pap

Sol

Col3 Bilat

Meno Asc

Shadows

Col4

Irreg

Evidence2log B10 against H0

Fig. 2. Evolution of the model evidence during the forward input selection for LS-SVMwith RBF kernels

Sparseness can be imposed to LS-SVMs at this stage in order to improve thegeneralization ability, by iteratively pruning, e.g. the ‘easy’ cases which havenegative support values in α. At the second stage, all the available trainingdata will be used to compute the output probability, indicating the posteriorprobability for a tumor to be malignant.

For MLP models, we use MacKay’s Bayesian MLP classifier [9][8], which islimited to one hidden layer with two hidden neurons, with hyperbolic tangent ac-tivation function for the hidden layer, and sigmoidal logistic activation functionfor the output layer. Other models with various number of hidden neurons werealso tried, but not reported here due to their smaller evidence and inferior per-formance on the test set. Because of the existence of multiple local minima, theMLP classifier was trained 10 times with different initialization of the weights,and the one with the highest evidence was chosen.

In risk minimization decision making, different error costs are considered inorder to reduce the expected loss. Since misclassification of a malignant tumoris very serious, the adjusted prior for the malignant class in the following experi-ments is intuitively set to 2/3, higher than that of the benign class 1/3. The sameadjusted class priors have been combined for the computation of the posterioroutput for all the compared models.

4.3 Model Evaluation

The model performance is assessed by ROC analysis. Unlike the classificationaccuracy, ROC is independent of class distributions or error costs, and has been

226 Chuan Lu et al.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 − specificity

Sen

sitiv

ity

ROC curves

RMILDAMLPLS−SVM_linLS−SVM_rbf

Fig. 3. ROC curves from different models on the test set

widely used in the biomedical field. The ROC curve plots the true positive rate(sensitivity) against the false positive rate (1-specificity) for the different cutoffvalues of a diagnostic test. Here the sensitivity and specificity is the correctclassification rate for the malignant and benign class, respectively. The areaunder the ROC curve (AUC) can be statistically interpreted as the probabilityof the classifier to correctly classify malignant cases and benign cases [11].

Fig. 3 and Table 2 report the performance of different models on the test set,the performance of RMI, a widely used score system (calculated as the productof the CA 125 level, a morphologic score, and a score for the menopausal status),is listed as a reference. All the competitive models perform much better thanRMI, among which the LS-SVM model with RBF kernels achieves the bestperformance. The performance of Bayesian MLP is comparable to Bayesian LS-SVM with RBF kernels.

We also try to check the ability of our classifiers in rejecting uncertain testcases which need further examination by a human expert. The discrepancy be-tween the posterior probability and the cutoff value reflects the uncertainty ofthe prediction: the smaller the discrepancy, the larger the uncertainty. The per-formance of the models has been reevaluated after rejecting a certain number ofthe most ‘uncertain’ test cases, and RBF LS-SVM model keeps giving the bestresults. Table 3 shows how the rejection of the uncertain cases can improve theperformance of the RBF LS-SVM classifier.


Table 2. Comparison of the model performance on the test set

Model Type Cutoff Accuracy Sensitivity Specificity(AUC) value (%) (%) (%)RMI 100 78.13 74.07 80.19

(0.8733) 75 76.88 81.48 74.53LDA 0.5 84.38 75.93 88.68

(0.9034) 0.4 83.13 75.93 86.790.3 81.87 77.78 83.96

MLP 0.5 82.50 77.78 84.91(0.9174) 0.4 83.13 81.48 83.96

0.3 81.87 83.33 81.13LS-SVMLin 0.5 82.50 77.78 84.91

(0.9141) 0.4 81.25 77.78 83.020.3 81.88 83.33 81.13

LS-SVMRBF 0.5 84.38 77.78 87.74(0.9184) 0.4 83.13 81.48 83.96

0.3 84.38 85.19 83.96

Table 3. Classification performance with rejection

Model Type Reject AUC Acc(%) Sens(%) Spec(%)LS-SVMRBF 5%(8/160) 0.9343 87.50 82.61 89.80

10%(16/160) 0.9420 88.97 83.72 91.40

Note: the cutoff probability level is set to 0.5.

5 Conclusions

In this paper, we have discussed the application of Bayesian LS-SVM classifiersto predict the malignancy of the ovarian tumors. Within the evidence framework,the hyperparameter tuning, input variable selection and computation of posteriorclass probability can be conducted in a unified way. Our results demonstratethat the LS-SVM models have the potential to obtain a reliable preoperativedistinction between benign and malignant ovarian tumors, and to assist theclinicians for making a correct diagnosis.

This work is part of the International Ovarian Tumor Analysis (IOTA)project, which is a multi-center study on the preoperative characterization ofovarian tumors based on artificial intelligence models [4]. Future work includethe application of our models to the multi-center data on a larger scale, andpossibly further subclassify the tumors.

Acknowledgments

S. Van Huffel is a full professor with KU Leuven, Belgium. J.A.K. Suykens is apostdoctoral researcher with FWO Flanders and a professor with KU Leuven. T.

228 Chuan Lu et al.

Van Gestel is a postdoctoral researcher with FWO Flanders. C. Lu is supportedby a K.U.Leuven doctoral fellowship. This research is also supported by theprojects of Belgian Federal Government IUAP IV-02 and IUAP V-22, of theResearch Council KUL MEFISTO-666 and IDO/99/03, and the FWO projectsG.0407.02 and G.0269.02.

References

1. D. Timmerman, H. Verrelst, T.H. Bourne, B. De Moor, W.P. Collins, I. Vergote andJ.Vandewalle, “Artificial neural network models for the preoperative discriminationbetween malignant and benign adnexal masses,” Ultrasound Obstet Gynecol, vol.13, pp.17-25, 1999.

2. C. Lu, J. De Brabanter, S. Van Huffel, I. Vergote, D. Timmerman, “Using artificialneural networks to predict malignancy of ovarian tumors,” in Proc. 23rd Annu. Int.Conf. of the IEEE Engineering in Medicine and Biology Society, Istanbul, Turkey,October 25-28, 2001, Paper 4.2.2-6.

3. P. Antal, H. Verrelst, D. Timmerman, Y. Moreau, S. Van Huffel, B. De Moor,I. Vergote, “Bayesian networks in ovarian cancer diagnosis: potentials and limi-tations,” in Proceeding of the 13th IEEE Symposium on Computer-Based MedicalSystems (CBMS 2000), Houston, TX, 2000, 103-109.

4. D. Timmerman, L. Valentin, T.H. Bourne, W.P. Collins, H. Verrelst, I. Vergote,“Terms, Definitions and Measurements to describe the ultrasonographic features ofadnexal tumors: a consensus opinion from the international ovarian tumor analysis(IOTA) group,” Ultrasound Obstet Gynecol, vol. 16, pp.500-505, 2000.

5. V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.6. J.A.K. Suykens, J. Vandewalle, “Least squares support vector machine classifiers,”

Neural Processing Letters, vol. 9, no. 3, pp. 293-300, 1999.7. J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least

Squares Support Vector Machines. Singapore: World Scientific, 2002.8. C.M. Bishop, Neural Networks for Pattern Recognition (Oxford University Press,

1995).9. D.J.C. MacKay, “The evidence framework applied to classification networks,” Neu-

ral Computation, vol. 4, no. 5, pp. 698-741, 1992.10. T. Van Gestel, J.A.K. Suykens, G. Lanckriet, A. Lambrechts, B. De Moor, J.

Vandewalle, “A Bayesian framework for Least Squares Support Vector Machineclassifiers, Gaussian processes and kernel Fisher discriminant analysis,” NeuralComputation, vo. 15, no.5, pp. 1115-1148, 2002.

11. J.A. Hanley, B. McNeil, “The meaning and use of the area under a Receiver Op-erating Characteristic (ROC) curve,” Radiology, vol. 143, pp. 29-36, 1982.

12. H. Jeffreys, Theory of Probability. New York: Oxford University Press, 1961.

Attribute Interactions in Medical Data Analysis

Aleks Jakulin1, Ivan Bratko1,2, Dragica Smrke3,Janez Demsar1, and Blaz Zupan1,2,4

1 Faculty of Computer and Information Science, University of LjubljanaTrzaska 25, Ljubljana, Slovenia

2 J. Stefan Institute, Jamova 39, Ljubljana, Slovenia3 Dept. of Traumatology, University Clinical Center, Ljubljana, Slovenia

4 Dept. of Human and Mol. Genetics, Baylor College of Medicine, Houston, USA

Abstract. There is much empirical evidence about the success of naiveBayesian classification (NBC) in medical applications of attribute-basedmachine learning. NBC assumes conditional independence between at-tributes. In classification, such classifiers sum up the pieces of class-related evidence from individual attributes, independently of other at-tributes. The performance, however, deteriorates significantly when the“interactions” between attributes become critical. We propose an ap-proach to handling attribute interactions within the framework of “vot-ing” classifiers, such as NBC. We propose an operational test for detect-ing interactions in learning data and a procedure that takes the detectedinteractions into account while learning. This approach induces a struc-turing of the domain of attributes, it may lead to improved classifier’sperformance and may provide useful novel information for the domainexpert when interpreting the results of learning. We report on its ap-plication in data analysis and model construction for the prediction ofclinical outcome in hip arthroplasty.

1 Introduction

The most common form of machine learning is attribute-based supervised learn-ing. Given a set of instances, each of them described by the values of the attributesand the class, we learn a model with which we predict the class of a previouslyunseen instance. In this paper we consider such a classification problem whenboth the attributes and class are nominal. That is, the domains of the attributesand the class are discrete and unordered.

Naive Bayesian classification (NBC) is a popular machine learning methodthat assumes that the attributes are conditionally independent. Experienceshows that the NBC approach in medical applications is effective and givesrelatively good classification accuracy in comparison with other, more elaboratelearning methods, even if the assumption is not always correct. In fact, the rela-tive strength of the approaches comes precisely from the simplifying assumptionof conditional attribute independence.

If the independence assumption is made, an attribute’s contribution of ev-idence about the class is determined independently of other attributes. Such


230 Aleks Jakulin et al.

evidence estimation is more robust than one that assumes attribute dependence.This increase in robustness is particularly important when the data is scarce,which is a common problem in medical applications. The evidence from indi-vidual attributes can be estimated from larger data samples, whereas handlingattribute dependence leads to fragmentation of the available data and conse-quently to unreliable estimates of evidence. Consequently, more sophisticatedmethods (which do not assume independence) often perform worse than thesimple NBC.

When attribute dependencies become critical, ignoring dependencies maylead to inferior performance. Methods like NBC that look at just one attributeat a time are called “myopic” in machine learning. Such methods compute ev-idence about the class separately for each attribute (independently from otherattributes), and then simply “sum up” all these pieces of evidence. This “vot-ing” does not have to be an actual arithmetic sum (for example, it can be aproduct, which is a sum of logarithms, as in NBC). The important point is thatthe aggregation of pieces of evidence coming from individual attributes does notdepend on the relations among the attributes. We will refer to such methods as“voting methods”; they employ “voting classifiers.”

A well known example where the myopia of voting methods results in com-plete failure, is the concept of exclusive OR: C = XOR(X, Y ), where C is aBoolean class, and X and Y are Boolean attributes. Myopically looking at at-tribute X alone provides no evidence about the value of C. The reason is thatthe relation between X and C critically depends on Y . For Y = 0, C = X; forY = 1, C �= X. Similarly, Y alone fails. However, X and Y together perfectlydetermine C. We say that there is a positive interaction between X and Y withrespect to C. In the case of positive interaction between X and Y with respectto class C, the evidence from jointly X and Y about C is greater than the sumof the evidence from X alone and evidence from Y alone.

The opposite may also happen, namely that the evidence from X and Yjointly is worth less than the sum of the individual pieces of evidence. In suchcases we say that there is a negative interaction between X and Y w.r.t. C.A simple example is when attribute Y is (essentially) a duplicate of X. Forexample, the length of the diagonal of a square duplicates the side of the square.Similar to positive interactions, voting classifiers are also confused by negativeinteractions.

In this paper we propose an approach to handling attribute interactionswithin the framework of voting classifiers, such as the naive Bayesian classifier.We propose an operational test for detecting positive and negative interactionsin learning data, and a procedure for “resolving” the detected interactions whenlearning a voting classifier. The key in resolving interaction is that the interact-ing pairs of attributes are treated jointly, giving rise to new attributes, which issimilar to the idea of structured induction [1–3]. This approach induces an auto-matic structuring of the domain of attributes. In addition to improved classifierperformance, it is hoped that such domain structuring also provides useful novelinformation for the domain expert when interpreting the results of learning.

Attribute Interactions in Medical Data Analysis 231

We apply our proposed approach to the medical problem of predicting thesuccess of hip arthroplasty in terms of Harris hip score (HHS; [4]). We also com-pare the automatically induced attribute structure based on interaction analysis,with the structure proposed by a medical expert for the same domain [5].

2 Attribute Interactions

Let us first define the concept of an interaction between attributes formally. Letthere be a learning problem with the class C and attributes X1, X2, . . .. Underconditions of noise or incomplete information, the attributes need not determinethe class value perfectly. Instead, they provide some “degree of evidence” for oragainst particular class values. For example, given an attribute-value vector, thedegrees of evidence for all possible class values may be a probability distributionover the class values given the attribute values.

Let the evidence function f(C, X1, X2, . . . , Xk) define a “chosen” true degreeof evidence for class C in the domain. The task of machine learning is to inducean approximation to function f from learning data. In this sense, f is the targetconcept for learning. In classification, f (or its approximation) would be usedas follows: if for given attribute values x1, x2, . . . , xk : f(c1, x1, x2, . . . , xk) >f(c2, x1, . . . , xk), then the class c1 is more likely than c2.

We define the presence, or absence, of interactions among the attributes asfollows. If the evidence function can be written as a (“voting”) sum:

f(C, X1, X2, . . . , Xk) = g

⎛⎝ ∑i=1,2,...,k

gi(C, Xi)

⎞⎠ (1)

for some functions g, and g1, g2, . . . , gk, then there is no interaction between theattributes. Equation (1) requires that the joint evidence of all the attributescan be reduced to the sum of the pieces of evidence gi(C, Xi) from individualattributes.

If, on the other hand, no such functions g, g1, g2, . . . , gk exist for which (1)holds, then there are interactions among the attributes. The strength of inter-actions IS can be defined as IS := f(C, X1, X2, . . . , Xk) − g (

∑i gi(C, Xi)).

IS greater than some positive threshold would indicate a positive interaction,and IS less than some negative threshold would indicate a negative interaction.Positive interactions indicate that a holistic view of the attributes unveils newevidence. Negative interactions are caused by multiple attributes providing thesame evidence, which should get counted only once.

We will not refine this definition to make it applicable in a practical learningsetting. Instead, we propose a heuristic test for detecting positive and negativeinteractions in the data, in the spirit of the above principled definition of inter-actions. Interaction gain is based on the well-known idea of information gain.Information gain of a single attribute X with the class C, also known as mutualinformation between X and C, is defined as:


GainC(X) = I(X; C) =∑

x∈DX

∑c∈DC

P (x, c) logP (x, c)

P (x)P (c). (2)

Information gain can be regarded as a measure of the strength of a 2-way inter-action between an attribute X and the class C. In this spirit, we can generalizeit to 3-way interactions by introducing the interaction gain [6] or interactioninformation [7]:

I(X; Y ; C) := I(XY ; C) − I(X; C) − I(Y ; C) = I(X; Y |C) − I(X; Y ). (3)

We have joined the attributes X and Y into their Cartesian product XY . Inter-action gain can be understood as the difference between the actual decrease inentropy achieved by the joint attribute XY and the expected decrease in entropywith the assumption of independence between attributes X and Y . The higherthe interaction gain, the more information was gained by joining the attributesin the Cartesian product, in comparison with the information gained myopicallyfrom individual attributes. When the interaction gain is negative, both X andY carry the same evidence, which could consequently be counted twice.

We have also expressed interaction gain through conditional mutual infor-mation I(X; Y |C), which has recently been used for learning tree-augmentednaive Bayes classifiers [8]. It is easy to see that conditional mutual information,unlike interaction gain, is unable to distinguish dependence given the contextI(X; Y |C) from dependence regardless of the context I(X; Y ). With conditionalmutual information, it is impossible to distinguish negative from positive in-teractions. Furthermore, trees can only represent a subset of possible attributedependencies in a domain.

3 Interaction Analysis in a Hip Arthroplasty Domain

We have studied attribute interactions and the effect they have on performanceof the naive Bayesian classifier in the domain of predicting the patient’s long termclinical status after hip arthroplasty. The particular problem domain was chosenfor two main reasons. First, the construction of a good predictive model for hipendoprosthesis domain may provide the physician with a tool to better plan thetreatment after the operation — in this respect, discovery of interesting attributeinteractions is beneficial. Second, in our previous study [5] the participatingphysician defined an attribute taxonomy for this domain in order to construct arequired concept hierarchy for the decision support model: this provided groundsfor comparison with the taxonomy discovered by observing attribute interactionsfrom the data.

3.1 The Data

The data we have considered was gathered at Department of Traumatology ofUniversity Clinical Center in Ljubljana from January 1988 to December 1996.For each of the 112 patients, 28 attributes were observed at the time of or


immediately after the operation. All attributes are nominal and most, but notall, are binary (e.g., presence or absence of a complication). Patient’s long-termclinical status was assessed in terms of Harris hip score [4] at least 18 monthsafter the operation. Harris hip score gives an overall assessment of the patient’scondition and is evaluated by a physician who considers, for example, patient’sability to walk and climb stairs, patient’s overall mobility and activity, presenceof pain, etc. The numerical Harris hip score in scale from 0 to 100 was discretizedinto three classes: bad (up to 70, 43 patients), good (between 70 and 90, 34patients) and excellent (above 90, 35 patients).

3.2 Interaction Gain Analysis

We first analyzed the hip arthroplasty data to determine the interaction gain (3)between pairs of attributes. Results of this analysis are presented in Fig. 1, which,for the presentation clarity, shows only the most positive (I(X; Y ; C) ≥ 0.039)and the most negative interactions (I(X; Y ; C) < −0.007).

The domain expert first examined the graph with positive interactions; theysurprised her (she would not immediately think about these if she would berequired to name them), but could all justify them well. For instance, with herknowledge or knowledge obtained from the literature, specific (bipolar) type ofendoprosthesis and short duration of operation significantly increases the chancesof a good outcome. Presence of neurological disease is a high risk factor only inthe presence of other complications during operation. It was harder for her tounderstand the concept of negative interactions, but she could confirm thatthe attributes related in this graph are indeed, as expected, correlated with oneanother. In general, she found the graph with positive interactions more revealingand interesting.

walkingabil

standingabil

0.0390

diabetes

mobilitybefore

operation

0.0465

hospitalizationduration

0.0523

luxation

injuryoperation

time

0.0477

neurologicaldisease

0.0631

operationcompl

0.0417

operationduration

0.0434

endoprosthesis

0.0641 0.0485

otherdisease

0.0395

luxationneurological

disease

lateluxation

-0.0082-0.0079

latedeep

infection

-0.0084

-0.0151

mobilitybefore

operation

-0.0081

superficialinfection

-0.0084

pulmonarydisease

-0.0075

-0.0151

cardiovasculardisease

-0.0130

hospitalizationduration

-0.0081

Fig. 1. Graphs displaying the distinctly positive (the two subgraphs on the left), andnegative (the graph on the right) interactions. Each edge is labeled with the value ofI(X; Y ; C) for the pair of connected attributes.


3.3 Induction of Attribute Structure

To further investigate interactions in our domain, we used the hierarchical clus-tering method ‘agnes’ [9]. Pairs of attributes that interact strongly with theclass, either positively or negatively, should appear close to one another, whilethose which do not interact should be placed further apart. They do not inter-act if they are conditionally independent, which also happens when one of theattributes is irrelevant. The dissimilarity function, which we express as a matrixD, was obtained with the following formula:

D(A, B) =

{|I(A; B; C)−1| if |I(A; B; C)| > 0.001,

1000 otherwise.(4)

ageother_disease

loss_of_consciousnessmyositis

neurol_lesionpulmonary_embolism

late_neurological_complcardiovascular_disease

deep_infectionlate_deep_infection

late_luxationmobility_before_operation

hospitalization_durationneurological_disease

diabetesstanding_abilwalking_abil

sitting_abilphlebothrombosis

cooperativenesspulmonary_diseasesuperficial_infectionoperation_duration

endoprosthesisgeneral_compl

injury_operation_timeoperation_compl

luxation

agemobility before operation

neurological diseasediabetes

other diseasepulmonary disease

cardiovascular diseaseoperation duration

injury operation timeendoprosthesis

general comploperation compl

superficial infectiondeep infection

luxationneurol lesion

loss of consciousnesspulmonary embolism

phlebothrombosislate deep infection

late luxationlate neurological compl

myositissitting abil

standing abilwalking abil

cooperativenesshospitalization duration

Fig. 2. An attribute interaction dendrogram (left) illustrates which attributes inter-act, positively or negatively, while the expert-defined concept structure (right) wasreproduced from [5].

In Fig. 2, we compared the attribute interaction dendrogram with an expert-defined concept structure (attribute taxonomy) that was used as a skeleton fordecision support model in our previous study [5]. While there are some simi-larities (like close relation between the abilities to stand and to walk), the twohierarchies are mostly dissimilar. The domain expert appears to have definedher structure on the basis of medical (anatomical, physiological) taxonomy; thisdoes not seem to correspond to attribute interactions, as defined in this text.


4 Construction of Classification Models

While the naive Bayesian classifiers cannot exploit the information hidden in apositive interaction [10, 11], the attributes in negative interactions tend to con-fuse their predictions [12]. The effects of negative interactions have not beenstudied extensively, but provide explanation for benefits of feature selection pro-cedures, which are one way of eliminating this problem.

With resolving interactions, we refer to a procedure where the interactingpairs of attributes are treated jointly, giving rise to new attributes which areadded to the data set. The best subset of attributes is then found using a featuresubset selection technique, and later used for construction of a target predictionmodel. For feature subset selection, we used a greedy heuristic, driven by themyopic information gain (2): only the n attributes with the highest informationgain were selected. For resolution of interactions we also used a greedy heuris-tic, guided by the interaction gain (3): we introduced the Cartesian productattributes only for the N attribute pairs with the highest interaction gain.

In our experimental evaluation, interaction gain scores were obtained fromconsidering the complete data set, new attributes were created and added intothe data set. In the second phase, the naive Bayesian classifier was built usingthe altered data set and evaluated at different sizes of the selected feature subset.The ordering of the attributes for feature subset selection using information gainand modeling using the subset were both performed on the learning data set,but evaluated on the test set. The evaluation was performed using the leave-one-out scheme: for the data set containing l instances, we performed l iterations,j = 1, 2, . . . , l, in which all instances except j-th were used for training, and theresulting predictive model was tested on the j-th instance. We report averageperformance statistics over all l iterations. All the experiments were performedwith the Orange toolkit [13].

To measure the performance of classification models we have used two errormeasures. Error rate is the proportion of test cases where the classifier predictedthe wrong class, i.e., the class for which the classifier predicted the highest prob-ability was not the true class of the test case. The second error measure, Brierscore, has been originally used to assess the quality of weather forecasting mod-els [14], and has recently gained attention in medicine [15]. It is better suitedfor evaluating probabilistic classifiers because it measures the deviations fromthe actual to the predicted outcome probabilities. As such, it is more sensitivethan the error rate, but yet conceptually very similar to error rate. A learningmethod should attempt to minimize the error rate and the Brier score.

We have assessed how the inclusion of different number of newly constructedand original attributes affects the prediction performance. Figure 3 illustrates thesearch space for our domain, where the number n of attributes selected is plot-ted on the horizontal and the number N of interactions resolved on the verticalaxis. The best choice of n and N can be determined with a wrapper mechanismfor model selection. We can observe several phenomena: increasing the numberof attributes in the feature subset does not increase the error rate as much asit hurts the precision of probability estimates, as measured by the Brier score.


Brier’s Score Error Rate

2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

9

Fig. 3. Dependence of the Brier score and error rate on the feature subset size, n (hor-izontal axis) and on the number of interactions resolved, N (vertical axis). Emphasizedare the areas of the best predictive accuracy, where Brier score is less than 0.2 and theerror rate less than 0.45.

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25

2 4 6 8 10 12 14 16 18 20

Brie

r S

core

Feature Subset Size

OriginalBest 4 Int.

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

2 4 6 8 10 12 14 16 18 20

Err

or R

ate

Feature Subset Size

OriginalBest 4 Int.

Fig. 4. Average Brier score and error rate as computed by leave-one-out and its de-pendence on the number of attributes used in the model for N = 4 (solid line) andN = 0 (dashed). For all measurements, the standard error is shown.

Furthermore, there are diminishing returns to resolving an increasing number ofinteractions, as illustrated in the contour diagrams in Fig. 3. Unnecessary inter-actions merely burden the feature subset selection mechanisms with additionalnegative interactions. Figure 4 presents the results in terms of Brier score anderror rate with four resolved interactions.

There are several islands of improved predictive accuracy, but the best ap-pears to be the area with approximately 4 resolved interactions and 4 selectedattributes. Classification accuracy reaches its peak of 60% at the same numberof attributes used. This accuracy improves upon the accuracy of 56% obtainedin our previous study, where manually crafted features as proposed by domainexperts were used in the naive Bayesian classifier [5]. Both are a substantialimprovement over models constructed from the original set of features, where


the accuracy of NBC with the original 28 attributes is 45%, and does not risebeyond 54% even with use of feature subset selection. The results in Table 1show that three of the four constructed attributes were chosen in building of themodel. The table provides the set of important interactions in the data, wherean important increase in predictive accuracy can be seen as an assessment of theinteraction importance itself, given the data.

We have compared the results obtained with the greedy method with globalsearch-based feature subset selection as implemented in [16]. The model withoutinteractions achieved classification accuracy of 59% with 7 selected attributes.If the 10 interactions with the highest interaction gain were added, the modelachieved classification accuracy of 62% with a model consisting of 8 attributes.B-Course’s model included all the features from Table 1, in addition to two ofthe original attributes and two interactions.

Table 1. Average information gain for attributes for the case N = 4, n = 4. Theresolved interactions are emphasized.

Information Gain Attribute0.118 luxation + injury operation time0.116 diabetes + neurological disease0.109 hospitalization duration + diabetes0.094 pulmonary disease

5 Summary and Conclusions

We have defined interactions as deviations from the conditional independenceassumption between attributes. Positive interactions imply conditional depen-dence of attributes given the class is in excess of their mutual dependence; newevidence is unveiled if the positively interacting attributes are treated jointly.Negative interactions indicate that mutual dependence of attributes is greaterthan their conditional dependence; we should not account for the same evidencemore than once. We have introduced interaction gain as a heuristic estimateof the interaction magnitude and type for 3-way interactions between a pair ofattributes and the class.

We have proposed a method for analysis and management of attribute in-teractions in prognostic modeling. In an experimental evaluation on hip arthro-plasty domain, we have obtained a number of promising and unexpected results.Promising were those based on performance evaluation: resolution of positiveinteractions yielded attributes that could improve the performance of predic-tive model built by the naive Bayesian classification method. Promising but alsounexpected were the interactions themselves: we have observed that pairs of in-teracting attributes proposed using our algorithm and induced from the datawere quite different from those obtained from expert-designed attribute taxon-omy. Although the new attributes proposed by experts can constitute a valuablepart of a background knowledge, and may significantly improve the performance


of predictive models (see [5]), other important attribute combinations may beoverlooked. The algorithms described in this paper may help the domain expertsto reveal them, and, if found meaningful, include them in their knowledge base.

References

1. Shapiro, A.D.: Structured induction in expert systems. Turing Institute Press inassociation with Addison-Wesley Publishing Company (1987)

2. Michie, D.: Problem decomposition and the learning of skills. In Lavrac, N.,Wrobel, S., eds.: Machine Learning: ECML-95. Notes in Artificial Intelligence 912.Springer-Verlag (1995) 17–31

3. Zupan, B., Bohanec, M., Demsar, J., Bratko, I.: Learning by discovering concepthierarchies. Artificial Intelligence 109 (1999) 211–42

4. Harris, W.H.: Traumatic arthritis of the hip after dislocation and acetabular frac-tures: Treatment by mold arthroplasty: end result study using a new method ofresult evaluation. J Bone Joint Surg 51-A (1969) 737–55

5. Zupan, B., Demsar, J., Smrke, D., Bozikov, K., Stankovski, V., Bratko, I., Beck,J.R.: Predicting patient’s long term clinical status after hip arthroplasty using hi-erarchical decision modeling and data mining. Methods of Information in Medicine40 (2001) 25–31

6. Jakulin, A.: Attribute interactions in machine learning. Master’s thesis, Universityof Ljubljana, Faculty of Computer and Information Science (2003)

7. McGill, W.J.: Multivariate information transmission. Psychometrika 19 (1954)97–116

8. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. MachineLearning 29 (1997) 131–163

9. Struyf, A., Hubert, M., Rousseeuw, P.J.: Integrating robust clustering techniquesin S-PLUS. Computational Statistics and Data Analysis 26 (1997) 17–37

10. Kononenko, I.: Semi-naive Bayesian classifier. In Kodratoff, Y., ed.: EuropeanWorking Session on Learning - EWSL91. Volume 482 of LNAI., Springer Verlag(1991)

11. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifierunder zero-one loss. Machine Learning 29 (1997) 103–130

12. Rish, I., Hellerstein, J., Jayram, T.: An analysis of data characteristics that affectnaive Bayes performance. Technical Report RC21993, IBM (2001)

13. Demsar, J., Zupan, B.: Orange: a data mining framework. http://magix.fri.uni-lj.si/orange (2002)

14. Brier, G.W.: Verification of forecasts expressed in terms of probability. WeatherRev 78 (1950) 1–3

15. Margolis, D.J., Halpern, A.C., Rebbeck, T., et al.: Validation of a melanomaprognostic model. Arch Dermatol. 134 (1998) 1597–1601

16. Myllymaki, P., Silander, T., Tirri, H., Uronen, P.: B-Course: A web-based tool forBayesian and causal data analysis. International Journal on Artificial IntelligenceTools 11 (2002) 369–387

Combining Supervised and UnsupervisedMethods to Support Early Diagnosis

of Hepatocellular Carcinoma

Federica Ciocchetta1, Rossana Dell’Anna1, Francesca Demichelis1,Amar Paul Dhillon2, Alberto Quaglia2, and Andrea Sboner1

1 ITC-irst, Via Sommarive 1838050 Povo (TN), Italy

{ciocchetta,dellanna,michelis,sboner}@itc.it2 Royal Free and University College Medical School

Rowland Hill Street, Hampstead, London NW3 2PF, UK{a.dhillon,aquaglia}@rfc.ucl.ac.uk

Abstract. The early diagnosis of Hepatocellular Carcinoma (HCC) isextremely important for effective treatment and improvements in diagno-sis are indispensable, particularly concerning the differentiation between“early” HCC and non neoplastic nodules. In this paper, we reconsideredthe results obtained previously and compared them with the results ofan unsupervised method to achieve a deep knowledge on uncertain le-sions. This analysis agreed with the predictions on DNs obtained by thesupervised system, providing pathologists with reliable information tosupport their diagnostic process.

1 Introduction

Hepatocellular (HCC) carcinoma is one of the main causes of cancer death, dueto the fact that HCC is often diagnosed at a late stage, when effective treatmentis extremely critic. Actually, the early diagnosis is particularly hard, mainlybecause the specific histopathological and morphological criteria are uncertainand inadequate. Therefore, efforts have to be made to accurately identify “early”or “small” lesions to help effective treatments. In this context, machine learningmethods can give an useful support to knowledge discovery, for instance to helpdiagnosis of critic lesions.

In a previous investigation [2], a significant feature subset from a set of 11clinical features (as it appears from published work [7],[1]) was found out byapplying two feature selection algorithms and a classifier system was built toreclassify the so-called dysplastic nodules, i.e. clinically uncertain cases. In thispaper an unsupervised approach is combined with the previous one, in such a wayto give a more robust support to pathologists and provide easier interpretableresults.


240 Federica Ciocchetta et al.

2 Materials and Methods

This section describes the data and the methods used: classification algorithmsare provided by WEKA [4], while cluster analysis is implemented in R [9].

2.1 The Data and the Classification of DNs

Two hundred and twelve liver nodules were retrieved from the Liver TumourDatabase of the Royal Free Liver Pathology Unit. These nodules had been iso-lated in cirrhotic livers removed from 68 patients who received liver transplan-tation and were assigned by two expert pathologists to one of the two classes:Hepatocellular Carcinoma (HCC), or Macro-regenerative Nodules (MRN). Diag-nosis of some nodules remained uncertain and they were considered as DysplasticNodules (DN), i.e. borderline lesions. As results of this diagnostic process, thenodules were divided in 106 HCCs, 74 MRNs and 32 DNs.

In a previous study [2], two feature selection algorithms extracted a mean-ingful subset of 4 histological features (Reticulin Loss, Capillarization, CellularAtypia and Nodule Heterogeneity) from the 11 clinical ones. Moreover we triedto predict the nature of dysplastic nodules by using a combination of 5 classifiersto assign each of these lesions to HCC or MRN class. We built each classifier onthe data set composed of HCC-MRN instances by using 10 fold cross-validationon the 180 certain lesions. Only the 4 selected features were involved. After-wards, we obtained the prediction for each DN by combining the output ofthese classifiers into one single prediction. In details, after defining the learningset Σ (HCCs-MRNs) and the set Γ to be predicted (DNs), we implementedfive classifiers {Mi}i=1...5, such that ∀xk ∈ Γ and ∀i ∈ [1, 5] we could writeMi(xk) = yik, where yik ∈ {0, 1} (0 is HCC, 1 is MRN). At this point a functionf(xk) =

∑5i=1 yik ∈ [0, 5] was defined and the final prediction yk was set as:

yk =

⎧⎨⎩HCC if f(xk) ≤ 1;MRN if f(xk) ≥ 4;uncertain otherwise.

(1)

2.2 Cluster Analysis

Cluster analysis is a set of unsupervised techniques: the data are separated intonatural groups accordingly to their similarities and no information is requestedabout classes. In this paper, a fuzzy clustering method is chosen, as it providesthe membership function uiv, i.e. a “measure of confidence” describing to whatextent each instance i belongs to a certain cluster v [8]. Data are assigned to thecluster with the greatest membership.

Two graphical layouts can be used to display the outputs of clustering anal-ysis: Clusplot and Silhouette plot. In the former the data are points in two-dimensional graph (relative to the first two principal components) and the clus-ters are ellipses. The latter is based on the silhouette value s(i), defined as:

s(i) =b(i) − a(i)

max{a(i), b(i)} (2)

Combining Supervised and Unsupervised Methods 241

where a(i) and b(i) are the average dissimilarity of i from respectively all theother instances of first better cluster A and second better cluster B. The averageof all s(i) gives a quality index for the partition, the so-called overall averagesilhouette width (a.s.v.). The higher this coefficient, the more robust the clustersdivision.

3 Results

3.1 Classification of Dysplastic Nodules

In a previous work [2], we classified the 32 DNs cases using the method describedin section 2. Summarizing the results, 27 DNs (85%) were assigned to one of thetwo classes (HCC and MRN, respectively 8 and 19 cases) and only 5 were scoredas uncertain. It is important to point out that this method gave only a hint onthe possible class, the real diagnosis actually being unknown.

3.2 Cluster Analysis

Cluster analysis is applied to the whole data set to possibly have a confirmationof our previous classification and to see how DNs are merged into groups. Weconsider two clusters. Figure 1 shows the related clusplot: 104/106 HCC are inthe first cluster, all MRN (74/74) belong to the second one, while 17/32 DNsare in the first and 15/32 in the second cluster.

Other interesting observations may be found out from the analysis of thesilhouette plot: the a.s.v. is 0.69 and both clusters have average s.v greaterthan 0.50. Furthermore, we observed how DNs are assigned to the two clusters.The results are summarized in table 1. It is important to note that all 8 DNs

(a) Clusplot of total data

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Average silhouette width : 0.69

n = 212 2 clusters Cj

j : nj | avei∈Cj si

1 : 122 | 0.58

2 : 90 | 0.84

(b) Silhuette plot

Fig. 1. Cluster analysis

242 Federica Ciocchetta et al.

Table 1. The division of DNs, identified by a numeric index, in two clusters

Cluster DN index Class predicted Silhouette value Membership3, 15, 24, 27 HCC ≥ 0.43 ≥0.7119 HCC 0.41 0.705 uncertain 0.32 0.65

CLUSTER 1 20 HCC 0.29 0.6423 HCC 0.28 0.6426 HCC 0.22 0.621,2,30,22 uncertain 0.13 0.5818 MRN 0.13 0.5828,29,31 MRN -0.09 0.5014, 21, 25, 32 MRN ≤ 0.36 <60

CLUSTER 2 4,10 MRN 0.46 0.6511,12,13 MRN 0.89 0.956,7,8,9,16,17 MRN 0.92 0.99

previously classified as HCCs belong to the first cluster and 5 of these havesilhouette values (s.v.) greater than 0.40 and high memberships (≥ 0.7), while15 out of 19 DNs previously classified as MRNs are in the second cluster. It isworth noting that among these 15 nodules 9 have really high s.v. (≥ 0.89) andmemberships (≥ 0.95). The 5 uncertain nodules are instead put all in the firstcluster, but both their silhouette value and membership, except for one case, arevery low (0.13 and 0.58 respectively).

To summarize, a remarkable result of this study is that most DNs are clus-tered according to our previous classification [2]. This confirms similarities be-tween cases of a class and DNs classified in the same class, supporting the ideathat the DNs classification is reasonable.

4 Discussion

Diagnosis of early hepatocellular carcinoma is still a challenging task. Few papersproposed the use of computerised systems to help pathologists. Di Giacomo et al.[3] approached this task by using logic reasoning in terms of conjunctive normalform systems; they did not provide any validation of their outcomes. Poon etal. [6] compared classification tree and neural networks for the identification ofserological liver marker profiles, reporting similar diagnostic performances whendifferentiating the two classes. To our knowledge no scientific paper faces theclassification problem from a histo-pathological viewpoint in terms of comput-erised systems.

In this paper machine learning methods were applied to support the earlydiagnosis of HCC and to get insight into the critical diagnosis problem of dys-plastic lesions. In the medical field it is well known that, besides the accuracy,the development of a computerised decision support system should take into ac-count the interpretability of the model [5]. With this purpose, both supervised

Combining Supervised and Unsupervised Methods 243

and unsupervised methods has been employed to extract more knowledge aboutthe nature of DNs, whose malignancy is uncertain. In our previous study, fea-ture selection algorithms reduced the feature space dimensionality and moreovera combination of five supervised algorithms built on MRN-HCC data set wasemployed to obtain more reliable predictions on the DN data.

In the present study an unsupervised technique has been applied to investi-gate relations among data. The combination of the two techniques on the samedataset together with the concordance of their results allow providing patholo-gists with as reliable information as possible. Nevertheless, further biomedicalinvestigations and analysis on uncertain cases are needed to confirm our results.

Acknowledgments

The authors thank Arjun Dhillon and Andrew Godfrey of the Royal Free andUniversity College Medical School for their activity in the collection of data.

References

1. Anonymous. Terminology of nodular hepatocellular lesions. international workingparty. Hepatology, 22:983–993, 1995.

2. F. Ciocchetta, R. Dell’Anna, F. Demichelis, A. Sboner, A. P. Dhillon, A. Dhillon,A. Godfrey, and A. Quaglia. Knowledge discovery to support hepatocellular carci-noma early diagnosis. In International Joint Conference on Neural Network - SpecialSession: Knowledge Discovery, and Image and Signal Processing in Medicine, 2003.

3. P. D. Giacomo, G. Felici, R. Maceratini, and K. Truemper. Application of a newlogic domain method for the diagnosis of hepatocellular carcinoma. In Proceedingsof MEDINFO 2001. Amsterdam: IOS Press, 2001.

4. I. H.Witten and E. Frank. Data Mining: Practical Machine Learning Tools andTechniques with Java Implementations. Morgan Kaufmann, 1999.

5. N. Lavrac. Selected techniques for data mining in medicine. Artif Intell Med,16(1):3–23, 1999.

6. T. C. W. Poon, A. T.-C. Chan, B. Zee, S. K.-W. Ho, T. S.-K. Mok, T. W.-T.Leung, and P. J. Johnson. Application of classification tree and neural networkalgorithms to the identification of serological liver marker profiles for the diagnosisof hepatocellular carcinoma. Oncology, 61:275–283, 2001.

7. A. Quaglia, S. Bhattacharjya, and A. P. Dhillon. Limitations of the histopathogicaldiagnosis and prognostic assessment of hepatocellular carcinoma. Histopathology,38:167–174, 2001.

8. A. Stuyf, M. Humbert, and P. Rousseuw. Clustering in an-object-oriented enviro-ment. Journal of Statistical Software, 1(4), 1996.

9. W. Venables, D.M.Smith, and R. development Core Team. An Introduction to R.Note on R: A Program Enviroment for Data analysis and Graphics. Version 1.6.1,2002. http://www.r-project.org (last access Feb 28, 2003).

Analysis of Gene Expression Databy the Logic Minimization Approach

Dragan Gamberger1 and Nada Lavrac2

1 Rudjer Boskovic Institute, Bijenicka 54,10000 Zagreb, [email protected]

2 Jozef Stefan Institute, Jamova 39, 1000 Ljubljana, [email protected]

Abstract. This paper presents an application of machine learning algo-rithms based on inductive learning by logic minimization to the analysisof gene expression data. The characteristic properties of these data are avery large number of attributes (genes) and a relatively small number ofexamples (samples). Approaches to gene set reduction and to the detec-tion of important disease markers are described. The results obtained ontwo well known publicly available gene expression classification problemsare presented.

1 Introduction

This work addresses two well known gene expression molecular cancer classifica-tion problems: the first problem is the distinction between acute lymphoblasticleukemia (ALL) and acute myeloid leukemia (AML) described in [2], and thesecond problem is the multi-class cancer diagnosis for 14 different cancer typesdescribed in [5]. For the first problem, a training set with 38 samples (27 of typeALL and 11 of type AML) and a test set with 34 samples (20 of type ALL and14 of type AML) are available. Every sample is described with expression valuesfor 7129 genes. The second problem has 144 samples in the training set and 54samples in the test set. Every sample is described with expression values for16063 genes, where the first 7129 genes are the same as in the leukemia problem.Training and test data sets, together with the description files, can be found anddownloaded from http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi .

In contrast to the work presented in [2, 5], in this work we have used onlysignal specificity (presence call with values A as absent, P as present, and Mas medium) while signal intensity real values have been ignored. Additionally,the M value was used as the ‘do not know’ value so that the difference betweentwo samples is accepted as significant only if for one sample some gene has Avalue while the other sample for the same gene has P value. This is done withthe intention to obtain results which are not very sensitive to the quality andreproducibility of signal specificity measurements.

The aim of this work is to demonstrate the applicability of inductive learn-ing by logic minimization (ILLM) approach for gene expression data analysis.


Analysis of Gene Expression Data by the Logic Minimization Approach 245

The first goal is to reduce the complexity of the problem by detecting and elim-inating irrelevant genes while the second goal is to apply subgroup discoverymethodology for the detection of disease markers which have good classificationproperties and which are interesting for expert interpretation, enabling a betterunderstanding of gene functionality.

2 Detection and Elimination of Irrelevant Genes

The detection and elimination of irrelevant attributes (genes) from gene expres-sion data sets is very important as these problems are complex because of thelarge number of genes. The concept of relevancy has been introduced in [3].

Definition 1 (Irrelevant genes). A gene (attribute) X is irrelevant if thereexists another gene (attribute) Y with the property that if, in any rule, X is sub-stituted by Y, the rule will cover at least the same subset of examples. Otherwise,the attribute is relevant. A rule covers a target (non-target) class example if itsbody is true (false) for the example.

The consequence of this definition is that by using only the set of relevant at-tributes instead of the set of all available attributes (relevant and irrelevant),the quality of induced rules - in terms of rule coverage - will not decrease. Theconcept of logic minimization is defined for two class problems. For multi-classproblems there must always be a transformation to a set of two class problems.The most common is the transformation of an n-class problem to a set of n twoclass problems by the one-versus-all (OVA) approach. This methodology is usedalso in this work.

The process of irrelevant gene detection is performed in three steps. It startsby detecting attributes which are unconditionally irrelevant. The second step isthe detection and elimination of gene copies. The third and the most importantstep takes into the account the classification of samples. Every sample belongseither to the positive (target) or the negative (non-target) class. A gene may beimportant for class discrimination only in two ways: if it has many A values forthe positive class and at the same time many P values for the negative class (thisis called AP relevance), or if it has many P values for the positive and manyA values for the negative class (PA relevance). A gene is irrelevant if its bothAP and PA relevances are such that there are other more relevant genes betterdistinguishing between the positive and the negative class examples.

For the leukemia problem, the final number of relevant genes is only 1040.This means that more than 85% of the genes can be eliminated as irrelevant,significantly reducing the complexity of any data analysis process that may fol-low. For the multi-class cancer problem, the relevancy approach was tested on 14different two class OVA problems. The approach was not so effective. Reducedattribute sets consisted of between 4385 and 10449 genes with the mean valueof 8095 genes. But this still presents a reduction of about 50% of the problemcomplexity.

246 Dragan Gamberger and Nada Lavrac

3 Detection of Disease Markers

In [5], very good classification results have been reported for the multi-class geneexpression domain. The classifications are based on a support vector machineapproach which has the ability to incorporate activity of many genes in the de-cision procedure. It was shown that the prediction quality increases when thenumber of genes used in the decision process increases. The genes with maximalweight in the classification process can be selected as target class markers. Theproblem is that the classification accuracy based on so selected markers is typi-cally not very good. For the detection of a single marker or a very small set ofgenes as a successful marker, the approach based on boundary emerging patternsand presented in [4] seems more appropriate. The methodology can find simpleclass markers with a very good prediction quality. If there are more genes withthe same property they are included in the so called plateau space. All markersin this space are undoubtfully interesting for biomedical analysis.

The methodology for subgroup discovery [1] enables the detection of simpledecision rules, similar to the one based on boundary emerging patterns, but ithas the property that inherent domain noise can be better tolerated because itis not required that the induced rules exclude all the non-target class samples.In this way, more general target class descriptions can be found which betterreflect actual biomedical relationships. Additionally, the SD approach enabledus to use signal specificity values A and P instead of numerical signal intensityvalues. It is much more difficult to find good rules in this case when it is notallowed to select signal intensity cut-off points and when the M region is used asthe ‘do not know’ value. This approach results in the reliability and robustnessof the induced knowledge, whose expert analysis may lead to the detection ofimportant relations.

Subgroup discovery experiments start from the reduced data sets from whichirrelevant genes have been eliminated by the approach presented in Section 2. Forthe first problem two rules were induced for classes ALL and AML (Table 1).In the rules, genes are coded by their corresponding row number in the datasets. Table 3 presents complete gene names for the used gene codes. In order tobetter test the quality of the rules we have extracted leukemia samples from themulti-class problem, both from its training and test data sets, and used them asan additional test set. The sensitivity and specificity results are reported in thelast column of Table 1.

The same procedure was repeated for the second problem so that for everycancer type one rule was induced from the corresponding OVA subproblem.Table 2 presents the results. The obtained results on the test sets are typicallybad regardless of the fact that all results on the training sets are good. The rulesare listed only for those cancers for which the obtained precision on the test setis better than 50 %. It is important to notice that such precision was obtained inthree out of four cases for cancers that have more than 8 samples in the trainingset. The conclusion is that the use of the subgroup discovery algorithm SD is notappropriate for problems with a small number of examples but that for largertraining sets it can enable effective detection of important markers.

Analysis of Gene Expression Data by the Logic Minimization Approach 247

Table 1. Rules induced for the leukemia problem for classes ALL and AML withsensitivity and specificity values for the leukemia training and test set, as well as forthe leukemia samples from the multi class problem.

rule training set test set multi-classsens. spec. sens. spec. sens. spec.

for the ALL class(G02288=A) AND (G00461=A) 26/27 11/11 19/20 13/14 8/20 10/10(G01926=A) AND (G02394=A) 24/27 11/11 14/20 12/14 16/20 9/10for the AML class(G05039=P) AND (G03252=P) 11/11 27/27 9/14 18/20 7/10 19/20(G00760=P) AND (G00048=P) 11/11 25/27 7/14 15/20 9/10 16/20

Table 2. Results obtained for 14 cancer types. Rules are given only for cancer typesfor which the precision on the test set is better than 50%.

cancer training set test set rulesens. spec. sens. spec. precision

breast 5/8 136/136 0/4 49/50 0%prostate 6/8 136/136 0/6 45/48 0%lung 6/8 136/136 1/4 49/50 50%colorectal 6/8 136/136 0/4 46/50 0%lymphoma 12/16 128/128 2/6 47/48 67% (G04260=A)(G12187=A)(G15807=P)bladder 6/8 136/136 1/3 50/51 50%melanoma 7/8 136/136 2/2 51/52 67% (G03244=A)(G06364=A)(G09629=A)uterus adeno 7/8 136/136 0/2 44/52 0%leukemia 19/24 120/120 6/6 47/48 86% (G04743=A)(G09771=P)renal 7/8 136/136 2/3 48/51 40%pancreas 5/8 132/136 1/3 49/51 33%ovary 6/8 136/136 1/4 45/50 17%mesothelioma 6/8 136/136 0/3 50/51 0%CNS 12/16 128/128 3/4 50/50 100% (G05440=P)(G14622=P)

Especially encouraging are the results for the leukemia because the inductionfrom 24 target class samples resulted in a rule with only two genes which has100% sensitivity and 86% precision on the test set; and for the CNS with 16target class examples which enabled induction of the rule with two genes thathas 75% sensitivity and 100% precision on the test set.

Conclusions

The results obtained on gene expression data demonstrate that the logic min-imization methodology is interesting as a preprocessing tool which can enablethe detection and elimination of irrelevant genes. The effectiveness of the ap-proach decreases with the increase of the number of samples, but the obtainedreduction in the range of 50% to 85% for real world problems seems attractive.The results obtained by the subgroup discovery algorithm are promising becausethey are simple and can be accepted as real disease markers. The expert analysisof detected gene activities, especially at the level of the combination of activitiesof different genes, may lead to better understanding of the disease itself.

248 Dragan Gamberger and Nada Lavrac

Table 3. Features and descriptions of gene codes used in the rules.

gene code feature gene descriptionG02288 M84526 at DF D component of complement (adipsin)G00461 D49950 at Liver mRNA for interferon-gamma inducing factorG01926 M31166 at PTX3 Pentaxin-related, rapidly induced by IL-1 betaG02394 M95678 at PLCB2 Phospholipase C, beta 2G05039 Y12670 at LEPR Leptin receptorG03252 U46499 at GLUTATHIONE S-TRANSFERASE, MICROSOMALG00760 D88422 at CYSTATIN AG00048 AFFX-HUMTFRR/ AFFX-HUMTFRR/M11507 5 at (endogenous control)

M11507 5 atG04743 X86809 at Major astrocytic phosphoprotein PEA-15G09771 L37747 s at LAMIN B1G05440 U96136 at Delta-catenin mRNA, partial cdsG14622 RC AA448280 at EST: zw83h05.s1 Soares testis NHT Homo sapiens

cDNA clone 782841 3’, mRNA sequence.

References

1. Gamberger, D. & Lavrac, N. (2002) Expert-guided subgroup discovery: Methodol-ogy and application. Journal of Artficial Intelligence Research 17: 501–527.

2. Golub, T.R. et al. (1999) Molecular classification of cancer: Class discovery andclass prediction by gene expression monitoring. Science 286: 531–537.

3. Lavrac, N., Gamberger, D. & Turney, P. (1997) A relevancy filter for constructiveinduction. IEEE Intelligent Systems & Their Applications 13: 50–56.

4. Li, J. & Wong, L. (2002) Geography of differences between two classes of data. InProc. of 6th European Conference on Principles of Data Mining and KnowledgeDiscovery (PKDD2002), 325–337, Springer.

5. Ramaswamy, S. et al. (2001) Multiclass cancer diagnosis using tumor gene expres-sion signitures. In Proc. Natl. Acad. Sci USA, 98(26): 15149–15154.


A Journey trough Clinical Applications of Multimethod Decision Trees

Petra Povalej1, Mitja Leni�1, Milojka Molan Štiglic3, Maja Skerbinjek Kavalar3, Jernej Završnik2, and Peter Kokol1

1 University of Maribor, Faculty of Electrical Engineering and Computer Science, Laboratory for System Design, SI-2000 Maribor, Slovenia

{Petra.Povalej,Mitja.Lenic,Kokol}@uni-mb.si 2 Adolf Drolc Health Centre, SI-2000 Maribor, Slovenia

3 Maribor Teaching Hospital, Department of Paediatric Surgery, SI-2000 Maribor, Slovenia

Abstract. We present a journey through successful applications of multimethod approach to induction of decision trees in knowledge extraction, discovery of new knowledge and early diagnosis on the cases of asthma, cardiovascular problems and mitral valve prolapse. The results show that the multimethod ap-proach is a powerful and promising technique enabling the conformation of ex-isting medical knowledge and more interestingly, also enables the induction of new facts and hypothesis, which can reveal some new interesting patterns and possibly improve the existing medical knowledge.

1 Introduction

Recently intelligent systems proved to be a very useful and successful paradigm, which has often been used for decision support, data mining and knowledge discov-ery in various fields, especially in medicine. Intelligent systems are enable to prove existing hypothesis or generate new ones – thus they in longer term support the ad-vancement of medicine as a scientific discipline and as a consequence the develop-ment a new treatment, prediction and diagnosing methods. The intelligent decision support tools can to process huge amounts of data available from previous solved cases and they can suggest the probable diagnosis or extract knowledge based on the values of several attributes. Clearly, all approaches (such as black-box approaches) are not appropriate for this kind of task, because the clinical experts need to evaluate and validate the decision making process, induced by those tools. On the other hand, the evaluation of the induced classifiers produced by the computerized tools and evaluated by a clinical expert can be an important source of new knowledge on how to make a diagnosis based on available attributes. In order to achieve this goal, the intelligent system should be easy to understand and straightforward. For that reason we decided to use only methods based on decision trees [1] since they provide a very important feature – the possibility of explaining the decisions in a way understand-able by humans. We developed a new multimethod approach for decision tree induc-tion, which in general combines various approaches with genetic algorithms.

In this paper we will focus on intelligent systems for knowledge extraction.

250 Petra Povalej et al.

2 Multimethod Approach

Historically different approaches for knowledge extraction evolved [2], such as sym-bolic approaches and computational learning theory. Among them we can find many classical approaches, like decision trees, rules, rough-sets, case based reasoning, neu-ral networks, support vector machines, different fuzzy methodologies, ensemble methods [3], but they all have some advantages and limitations. Evolutionary ap-proaches (EA) are also a good alternative, because they are not inherently limited to local solutions [4]. Recently, taking into account the limitations of classical ap-proaches many researchers focused their research on hybrid approaches, following the assumption that only the synergetic combination of single models can unleash their full power [5].

Current studies show that the selection of appropriate method for data analysis can be crucial for the success. Therefore, for a given problem, different methods should be tried to increase the quality of extracted knowledge. According to the previous paragraph a logical step would also be to combine different methods into one more complex methodology in order to overcome the limitations of a single method. We noticed that almost all attempts to combine different methods use loose coupling approach where the methods work almost independent of each other. Therefore a lot of “luck” and trying out many different combination are needed to unify them into a “team”. Thus we decided to design a new approach that enables tight tangling of single methods. This new approach is called a multimethod approach [6]. Opposed to the conventional hybrids our idea is to dynamically combine and apply different methods in not predefined order in the manner to solve a single problem or the de-composition of that problem.

Multimethod approach introduces the idea of a population of different intelligent systems - individuals that can produce multiple comparable good solutions, which are incrementally improved using the EA approach. In order to enable knowledge sharing between different methods the support for transformation between each individual method is provided. Initial population of intelligent systems is generated using differ-ent methods. In each generation different operations appropriate for individual knowledge representation are applied to improve existing and also to create new intelligent systems. That enables incremental refinement of extracted knowledge, with different views on a given problem. For example, using different induction methods such as different purity measures can be simply combined into a decision trees. As long as the knowledge representation is the same, a combination of different methods is not a big obstacle. The main problem is how to combine methods that use different knowledge representations (for example neural networks and decision trees). In such cases we provide two alternatives: (1) to convert one knowledge representa-tion into another, using different already known methods or (2) to combine both knowledge representations into a single intelligent system.

The first alternative requires implementation of the knowledge conversion (for example conversion of a neural network into a decision tree). Such conversions are not perfect and some of the knowledge is normally lost, but conversions can produce a different aspect on a presented problem that can lead to better results.

A Journey trough Clinical Applications of Multimethod Decision Trees 251

The second alternative requires some cut-points where knowledge representations can be merged. In a decision tree internal nodes or decision leafs represent such cut points (Fig.1), i.e. a condition can be replaced by another intelligent system (for ex-ample support vector machine - SVM). We call such trees the hybrid decision trees.

��

�

�

��

�

…�

…�

Fig. 1. An example of a hybrid decision tree induced by the multimethod approach. Each node is induced with appropriate method (GA – genetic algorithm, ID3, Gini, SVM, neural network, etc.)

3 Clinical Application of Multimethod Decision Trees

In this section we present the very successful application of the multimethod ap-proach on three real-world medical problems: asthma, cardiovascular and mitral valve prolapse (MVP) databases. Our intentions were twofold, first to confirm existing knowledge about selected medical problems and second certainly more interesting, to find some qualitative new knowledge. In case studies bellow we used the mul-timethod approach to induce hybrid decision trees. The results were than discussed and evaluated by clinical domain experts.

Asthma Database. The aim of the study was to prove the most important risk factors for diagnosis of asthma in very young children, because of inability to perform pul-monary function tests, which are the golden standard in older children. The study included 106 children aged 2-8 years. The clinical data and data gathered with blood examinations was collected and used for intelligent analysis based on multimethod decision tree induction. Results. The medical analysis of the decision tree exposes high symmetry with cur-rently existing medical knowledge. The main outlines of our intelligent analysis from the medical point of view are:

− IgE mite, total IgE and EOZ are the most important indicators for diagnosing asthma in children older than 4.7 years.

− Increased absolute number of EOZ (>439.3) in children older than 4.7 years with low level of IgE mite (<=1.2) and total IgE (<=563.3 units) is a positive indicator of asthma. On the other hand, lover number of EOZ points out positive diagnosis of asthma in children older than 6.7 years and negative asthma in younger chil-dren.

252 Petra Povalej et al.

Cardiovascular Database. Cardiovascular disease is a prime cause of death in peo-ple under the age of 24, therefore early and accurate identification of cardiovascular problems in children patients is of vital importance. In this study pediatric records of 100 young patients from Maribor Hospital containing general data (age, sex, etc.), a health status (family history, previous illnesses), a general cardiovascular data (blood pressure, pulse, chest pain, etc.) and more specialized cardiovascular data (cardiac history, ultrasound and ECG examinations). Each of the patients was diagnosed with one of the following diagnoses: innocent heart murmur, congenital heart disease and palpitations with chest pain.

Results. The results revealed many medically acceptable logical rules:

− The occurrence of other congenital malformations is related to an extended risk of joined congenital heart disease.

− Recurrent tonsillitis can lead to the damage of heart valves or myocardium. − Convulsions in children and a possible difficult aortic valve disease are related. − Fever illness in children and a possible innocent heart murmur due to tachycardia

are connected.

Some rules in induced decision tree also represent possible new knowledge: − A close relation between the history of children’s operations under general anaes-

thesia and the appearance of different heart arrhythmias was discovered. It is well known that such arrhythmias can appear during the act of surgery. The system, however, has discovered that surgery can also be an important cause of arrhyth-mias occurring later in a child’s life.

MVP Database. Mitral valve prolapse is one of the most prevalent cardiac condi-tions, which may affect from five up to ten percent of population and is one of the most controversial one. Using the Monte Carlo sampling method, 900 children and adolescents were selected representing the whole population under eighteen years of life. Routinely they were called for an echocardiography no matter of prior findings. 631 of them passed an examination of their health state in a form of a carefully pre-pared protocol specially made for the syndrome of MVP. The protocol consisted of 103 parameters that can possibly indicate the presence of MVP. The three decision classes were: 5% “prolapse”, 6% “silent prolapse”, and 89% “no prolapse”.

Results. Again some well-known medical diagnostic pathways were shown but sur-prisingly the most informative attribute for diagnosing silent prolapse in the decision tree was systolic index (SI) value. The medical explanation is that we could speculate about the SI values, which are directly correlated with systolic and diastolic diame-ters. Systolic and diastolic parameters represent indirect functional values of the heart muscle. In a heart with prolapsed mitral valve the end systolic pressure could be a little bit higher than in normal hearts since the mitral valve descend in the left atrium. We can confidently speculate that because of this unusual movement of the leaflets of the prolapsed mitral valve the SI is actually higher in the normal hearts.

A Journey trough Clinical Applications of Multimethod Decision Trees 253


As presented above all decision trees induced with our multimethod approach have been well evaluated by clinical experts. We may say that obtained results equip the physicians with a powerful techniques to a) confirm their existing knowledge about some medical problems and b) enable searching for new facts, which should reveal some new interesting patterns and possibly improve the existing medical knowledge.

Table 1. A comparison of multimethod approach with other conventional and evolutionary approaches for decision tree induction

Data set / Method Asthma Cardiovascular MVP �� Genetic 81.57 80.53 87.87 90.30 92.31 65.83 Multimethod 86.84 87.53 87.87 90.30 93.84 84.31 Greedy ID3 76.31 75.21 63.63 61.90 88.46 73.39 Boost Greedy ID3 84.21 83.47 75.75 81.41 91.54 70.94 Greedy Chi square 76.31 75.21 63.28 61.90 90.00 82.91 Boost Greedy Chi square 50.00 49.72 81.81 82.51 90.77 57.21 Greedy Gini 76.31 75.21 63.63 61.90 88.46 77.87 Boost Greedy Gini 57.89 57.98 51.51 53.70 87.69 64.15 Greedy J measure 76.31 75.21 63.63 61.90 83.08 50.84 Boost Greedy J measure 47.36 46.77 54.54 59.28 89.23 53.08

� total accuracy on the test set (%) � average class accuracy on the test set (%)

In order to evaluate the quality of multimethod approach a comparison in classify-ing new unseen cases to other well-known approaches was made (Table 1). In the terms of total accuracy and average class accuracy of classifying unseen cases our multimethod approach outperformed all other approaches on each database, even on cardiovascular database when we also consider the size of the induced decision trees. An explanation for better performance of multimethod approach can be found in the concept of multimethod approach, which tries to combine benefits of classical and evolutionary approaches and reduces search space with help of conventional meth-ods.

References

1. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publ., (1993) 2. Thrun, S., Pratt, L.: Learning to Learn. Kluwer Academic Publishers (1998) 3. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: First International Workshop

on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer-Verlag, (2000), 1-15

4. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addi-son Wesley, Reading MA (1989)

5. Iglesias, C.J.: The Role of Hybrid Systems in Intelligent Data Management: The Case of Fuzzy/neural Hybrids, Control Engineering Practice, Vol. 4, No. 6 (1996) 839-845

6. Leni�, M., Kokol, P.: Combining classifiers with multimethod approach. In Soft computing systems: design, management and applications, IOS Press, (2002) 374-383

Detailing Test Characteristicsfor Probabilistic Networks

Danielle Sent and Linda C. van der Gaag

Institute of Information and Computing Sciences, Utrecht UniversityP.O. Box 80.089, 3508 TB Utrecht, the Netherlands

{danielle,linda}@cs.uu.nl

Abstract. In the medical domain, establishing a diagnosis typicallyamounts to reasoning about the unobservable truth, based upon a setof indirect observations from diagnostic tests. A diagnostic test may notbe perfectly reliable, however. To avoid misdiagnosis, therefore, the re-liability characteristics of the test should be taken into account uponreasoning. In this paper, we address the issue of modelling such charac-teristics in a probabilistic network. We argue that the standard reliabilitycharacteristics that are generally available from the literature have to befurther detailed, for example by experts, before they can be included ina network. We illustrate this and related modelling issues by means of areal-life probabilistic network in oncology.

1 Introduction

Establishing a diagnosis for a patient in essence amounts to constructing a hy-pothesis about the disease the patient is suffering from. To this end, typically anumber of diagnostic tests are performed. To establish the presence or absenceof lung cancer, for example, an X-ray of the patient’s chest is made. Diagnostictests, however, generally do not unambiguously reveal the true condition of apatient. An X-ray, for example, can be difficult to interpret: a physician mayeasily overlook a small tumour and state a negative result, or state a positiveresult based upon a phantom image. To avoid misdiagnosis, therefore, the reli-ability characteristics of the various diagnostic tests employed should be takeninto consideration upon constructing a diagnostic hypothesis [1, 2]. Specialisedconcepts have been designed for expressing the uncertainties in a test’s results.The sensitivity of a diagnostic test is defined as the probability that a positiveresult is found in a patient who actually has the disease tested for; the test’sspecificity is the probability of finding a negative result in a patient withoutthe disease. The medical literature lists these characteristics for most diagnostictests [3].

In the medical domain, knowledge-based systems are being developed for awide range of diagnostic applications. These systems increasingly build upon aprobabilistic network for their knowledge representation. A probabilistic networkis a representation of a joint probability distribution on a set of statistical vari-ables, consisting of a graphical structure and an associated set of probabilities


Detailing Test Characteristics for Probabilistic Networks 255

[4]. Since a probabilistic network uniquely defines a probability distribution, itallows for computing any probability of interest over its variables. Diagnosticreasoning with a probabilistic network, for example, amounts to entering thesymptoms, signs, and test results for a patient into the network and computingthe posterior probabilities for the various possible diseases.

For correct diagnostic reasoning, a probabilistic network has to capture thereliability characteristics with the various diagnostic tests that are representedin its graphical structure. In this paper, we address the issue of modelling thesecharacteristics. We argue that a probabilistic network has to distinguish explic-itly between variables that represent test results on the one hand and variablesthat model the unobservable truth on the other hand. Failing to explicitly dis-tinguish between these two types of variable would amount to assuming thattest results unambiguously reflect the underlying truth and, hence, that diag-nostic tests have a sensitivity and specificity both equal to 1.00 [5]. We furtherargue that the concepts of sensitivity and specificity often have to be further de-tailed before they can be included in a probabilistic network: while the standardconcepts are defined for binary variables, a network may include non-binary vari-ables for which more detailed reliability characteristics are required. In addition,the network may require these characteristics for several different subgroups ofpatients. We illustrate these as well as various closely related modelling issuesby means of a real-life probabilistic network in oncology.

The paper is organised as follows. In Section 2, we provide some preliminar-ies on probabilistic networks in general and describe the oesophagus networkmore specifically. Section 3 introduces the standard concepts of sensitivity andspecificity, and describes how these concepts can be further detailed for inclu-sion in a probabilistic network. Section 4 discusses the closely related concept ofpredictive value of a diagnostic test and shows how it can be computed from aprobabilistic network. The paper ends with our conclusions in Section 5.

2 Preliminaries

A probabilistic network is a concise representation of a joint probability distribu-tion Pr on a set of statistical variables. It consists of a directed acyclic graph andan associated set of numerical parameters. Each node in the graph represents avariable that can take a value from among a set of mutually exclusive, discretevalues. Variables are indicated by capital letters, for example A; the possiblevalues of A are denoted a1, . . . , an, n ≥ 1. The arcs in the graph represent directinfluential relationships between the variables. More formally, the set of arcscaptures probabilistic independence: two variables are said to be independentgiven the available observations if every chain between the two variables con-tains an observed variable with at least one emanating arc, or a variable withtwo incoming arcs such that neither the variable itself nor any of its descendantsin the graph have been observed.

Associated with the graph of the network are numerical parameters that de-scribe the strengths of the relationships between the variables. With each variable

256 Danielle Sent and Linda C. van der Gaag

nonex<10%

x>=10%

Weightloss

solidpureeliquidnone

Passage

x<55<=x<10

10<=x

Length

x<55<=x<10

10<=xnon-determ

Gastro-length

T1T2T3T4

Invasion-wall

polypoidscirrheus

ulcerating

Shape

polypoidscirrheus

ulcerating

Gastro-shape

yesno

Necrosis

yesno

non-determ

Gastro-necrosis

yesno

Fistula

nonetrachea

mediastinumdiaphragm

heart

Invasion-organs

proximalmid

distal

Location

proximalmid

distal

Gastro-location

yesno

Metas-truncus

N0N1M1

Lymph-metas IIIAIIBIII

IVAIVB

Stage

yesno

Haema-metas

yesno

Metas-liver

yesno

Lapa-liver

yesno

CT-liver

yesno

Metas-lungs

yesno

X-lungs

yesno

CT-lungs

yesno

Metas-loco

yesno

non-determ

Endosono-loco

yesno

CT-loco

yesno

Metas-cervix

yesno

Physical-exam

yesno

Sono-cervix

yesno

non-determ

Endosono-truncus

yesno

CT-truncus

yesno

Lapa-truncus

adenosquamous

undifferentiated

Type

adenosquamous

undifferentiated

Biopsy

yesno

Lapa-diaphragm

nonetrachea

mediastinumdiaphragm

heart

CT-organs

yesno

Bronchoscopy

yesno

non-determ

Endosono-mediast

yesno

non-determ

X-fistula

T1T2T3T4

non-determ

Endosono-wall

circularnon-circular

Circumf

circularnon-circularnon-determ

Gastro-circumf

Fig. 1. The oesophagus network

A are associated conditional probabilities p(A | π(A)) that describe the joint in-fluence of the various values of the variable’s parents π(A) on the probabilitiesof the values of A itself. The graphical structure and its associated probabilitiesserve to uniquely define a joint probability distribution. A probabilistic networkthus provides for computing any probability of interest over its variables. In thesequel, we explicitly distinguish between computed probabilities, for examplePr(A), and the parameter probabilities p(A | π(A)).

In this paper, we use a real-life probabilistic network to illustrate variousmodelling issues. The oesophagus network provides for the staging of oesophagealcancer and has been developed with the help of two experts in gastrointestinaloncology from the Netherlands Cancer Institute, Antoni van Leeuwenhoekhuis.It models important properties of an oesophageal tumour, such as its length andmacroscopic shape, and describes the pathophysiological processes that influenceits growth and metastasis. The main diagnostic variable of the network is thevariable Stage that captures the degree to which the cancer has progressed.The oesophagus network currently includes 42 variables, for which almost 1000probabilities have been specified. Of these 42 variables, 25 serve to model theresults of the various diagnostic tests in use. Fig. 1 depicts the graphical structureof the network and the prior probabilities per variable. For further details of theoesophagus network we refer the reader to [6].


3 Test Characteristics

In the medical domain, establishing a diagnosis for a patient typically involvesperforming a number of diagnostic tests. The results of these tests serve to re-veal, to at least some extent, the underlying truth about the disease the patientis suffering from. As we have argued before, the reliability characteristics of thesetests should be taken into account upon reasoning about the various possible dis-eases. A probabilistic network developed for a diagnostic application, therefore,should explicitly capture these characteristics.

3.1 The Standard Concepts of Sensitivity and Specificity

The standard concepts of sensitivity and specificity have been designed to cap-ture the uncertainty in the results of a diagnostic test and thereby constitutethe test’s reliability characteristics. The concepts are defined in terms of two bi-nary variables. We use the variable D to represent the presence, indicated by thevalue d, or absence, indicated by d, of a specific disease. The variable T describesthe result of an associated diagnostic test, where the value t models a positivetest result, suggesting presence of the disease, and t models a negative result,suggesting absence of the disease. The sensitivity of the test now is defined asthe probability that a positive test result is found in a patient who actually hasthe disease. More formally, it is defined as Pr(t | d). The specificity of the test isthe probability Pr(t | d) that the test yields a negative result for a patient with-out the disease. The sensitivity and specificity of a diagnostic test are generallyestablished in one or more clinical studies [1]. The medical literature lists thesensitivities and specificities for most diagnostic tests [3].

3.2 Sensitivity and Specificity in a Probabilistic Network

To provide for modelling the reliability characteristics of diagnostic tests, a prob-abilistic network has to distinguish between variables that represent test resultson the one hand and variables that represent the unobservable truth on theother hand. Only by such an explicit distinction can the relationship between atest’s results and the underlying truth be represented. As an example of the twotypes of variable, Fig. 2(a) depicts a fragment of the oesophagus network. Thevariable Metas-liver describes the true absence or presence of liver metastasesin a patient; the variables Lapa-liver and CT-liver represent the results froma laparoscopic examination and from a CT-scan of the patient’s liver, respec-tively. As the true absence or presence of liver metastases directly influences theresults of the two diagnostic tests, the relationships between the variables aredirected from Metas-liver to the test variables. More formally, the representedrelationships capture the assumption that the results of the two diagnostic testsare conditionally independent given the true value of Metas-liver. We would liketo note that this assumption of conditional independence is commonly made inmedical decision making [7]. To conclude our example, the strength of the re-lationship between the variables Metas-liver and CT-liver, for example, is cap-tured by the four parameter probabilities p(CT-liver = yes | Metas-liver = yes),


Metas-liver

Lapa-liverCT-liver

(a)

Metas-liveryes no

CT-liver yes 0.90 0.05no 0.10 0.95

(b)

Fig. 2. A fragment of the oesophagus network and some associated probabilities

p(CT-liver = yes | Metas-liver = no), p(CT-liver = no | Metas-liver = yes), andp(CT-liver = no | Metas-liver = no). We observe that the first and the last ofthese probabilities coincide with the sensitivity and specificity of the CT-scan;the other two probabilities are the complements of these characteristics. Thetable from Fig. 2(b) now shows the probabilities that have been specified for theoesophagus network: the CT-scan is stated to have a sensitivity of 0.90 and aspecificity of 0.95.

3.3 More Detailed Sensitivity and Specificity

The standard concepts of sensitivity and specificity are typically defined forbinary variables only. The restriction to binary variables has its origin in theuse of these concepts. Physicians use the results from diagnostic tests, on seeinga patient, to decide whether or not to treat, or to decide whether or not toorder another series of tests. These decisions in essence are binary and have tobe taken within a relatively short time span. For taking such decisions, moredetailed characteristics appear to be confusing rather than helpful [1]. There isno mathematical reason, however, for the restriction to binary variables.

While practical for human decision making, the use of just binary variablesmay be an oversimplification of reality in view of computer-supported decisionmaking. In a probabilistic network, for example, domain knowledge is typicallyrepresented in more detail, where statistical variables, whether modelling testresults or the underlying truth, may have as many values as needed to accom-modate the domain’s intricacies. For modelling reliability characteristics thatinvolve at least one non-binary variable, the standard concepts of sensitivityand specificity no longer suffice. We have to further detail these concepts toprovide for more than two values. Because the sensitivity and specificity of a di-agnostic test are well-defined in terms of probability theory, we can build uponthis mathematical foundation for their refinement.

We start by addressing a non-binary disease variable D with values d1, . . . , dn,n > 2, and a binary test variable T with the values t and t. As an example, weconsider from the oesophagus network the variable Invasion-organs, modellingthe true depth of invasion of an oesophageal tumour into organs adjacent tothe oesophagus. The variable can adopt one of five values: none, trachea, me-diastinum, diaphragm and heart. In order to establish the depth of invasionbeyond the oesophageal wall, typically a number of diagnostic tests are per-


Table 1. The parameter probabilities for the variable Bronchoscopy

Invasion-organsnone trachea mediastinum diaphragm heart

Bronchoscopy yes 0.04 0.92 0.10 0.01 0.25no 0.96 0.08 0.90 0.99 0.75

formed. The purpose of a bronchoscopy, to this end, is to reveal invasion of theprimary tumour into the trachea. The values of the variable Bronchoscopy areyes, indicating that invasion into the trachea is visible, and no. The strength ofthe relationship between the two variables is captured by the parameter prob-abilities p(Bronchoscopy | Invasion-organs). The probabilities that have beenspecified for the oesophagus network are shown in Table 1.

We observe that the sensitivity of the bronchoscopy to tumour invasion intothe trachea again coincides with one of the parameter probabilities: the sensitiv-ity equals p(Bronchoscopy = yes | Invasion-organs = trachea) = 0.92. The speci-ficity Pr(Bronchoscopy = no | Invasion-organs �= trachea), however, is not cap-tured by a single parameter. The four probabilities conditioning Bronchoscopy =no on the values none, mediastinum, diaphragm and heart for Invasion-organs,however, can be looked upon as more detailed specificities, from which the stan-dard overall specificity can be reconstructed.

For a non-binary disease variable D with values d1, . . . , dn, n > 2, and abinary test variable T with the values t, suggesting the presence of disease di,and t, suggesting absence of di, we have for the overall specificity of the test that

Pr(t | di) = Pr(t |∨

j=1,...,n,j �=i

dj) =Pr(t ∧ (

∨j=1,...,n,j �=i dj))

Pr(∨

j=1,...,n,j �=i dj)=

=Pr(

∨j=1,...,n,j �=i(t ∧ dj))

Pr(∨

j=1,...,n,j �=i dj)=

∑j=1,...,n,j �=i Pr(t ∧ dj)∑

j=1,...,n,j �=i Pr(dj)=

=

∑j=1,...,n,j �=i p(t | dj) · Pr(dj)∑

j=1,...,n,j �=i Pr(dj)(1)

The probabilities p(t | dj), j = 1, . . . , n, j �= i, in the formula coincide withparameter probabilities for the variable T . Note that these probabilities conveythe same type of information as the test’s specificity, be it in more detail. Furthernote that the more detailed specificities p(t | dj) do not suffice for computing thestandard overall specificity of the test: in addition, the prior probabilities Pr(dj)for the disease variable D are required. These prior probabilities, however, arereadily computed from the network.

To return to our example, we are interested in the overall specificity of thebronchoscopy, Pr(Bronchoscopy = no | Invasion-organs �= trachea). From thenetwork, we find the prior probabilities Pr(none) = 0.9193, Pr(mediastinum) =0.0234, Pr(diaphragm) = 0.0202, and Pr(heart) = 0.0073. We thus find


0.96 · 0.9193 + 0.90 · 0.0234 + 0.99 · 0.0202 + 0.75 · 0.00730.9193 + 0.0234 + 0.0202 + 0.0073

� 0.96 (2)

for the specificity of the bronchoscopy.The previous observations can be extended to apply to non-binary test vari-

ables as well as non-binary disease variables. We consider a non-binary diseasevariable D with values d1, . . . , dn, n > 2, as before and a non-binary test variableT with values t1, . . . , tn, n > 2; for ease of exposition, we assume that there is aone-to-one relation between a disease di and a test result ti. An example fromthe oesophagus network is the variable Shape, modelling the macroscopic shapeof a primary oesophageal tumour, and the variable Gastro-shape, modelling theshape visible on a gastroscopic image. The parameter probabilities p(ti | di) forthe variable T can now be looked upon as the more detailed sensitivities of thetest to the various different diseases. We take the overall sensitivity to be theweighted sum ∑

i=1,...,n

p(ti | di) · Pr(di) (3)

of these more detailed sensitivities. Note that for computing the overall sensitiv-ity once again the prior probabilities Pr(di) for the disease variable are required.The more detailed specificities Pr(ti | di) are computed from

Pr(ti | di) =

∑j=1,...,n,j �=i

∑k=1,...,m,k �=i p(tk | dj) · Pr(dj)∑

j=1,...,n,j �=i Pr(dj)(4)

building upon the observations mentioned before. For the overall specificity ofthe test, we once again take the weighted sum∑

i=1,...,n

Pr(ti | di) · Pr(di) (5)

of these more detailed specificities. Since the values of the disease variable aremutually exclusive and collectively exhaustive, and a similar observation appliesto the values of the test variable, we have that the sensitivities and specificitiesfor tests with a one-to-many or a many-to-one relation of their values to thevalues of the disease variable can be readily computed as well.

To conclude, we would like to note that, while the standard characteristicsof diagnostic tests can typically be found in the literature, the more detailedsensitivities and specificities will not be readily available. These characteristicswill therefore have to be assessed by experts in the domain of application.

3.4 Stratified Sensitivity and Specificity

In the domain of cancer of the oesophagus, the reliability characteristics of someof the diagnostic tests depend on the patient’s ability to swallow food. The re-sults of a barium swallow with fluoroscopy to determine the presence or absence


Table 2. The parameter probabilities for the variable X-fistula

Fistula yes noPassage solid puree liquid none solid puree liquid none

yes 0.88 0.88 0.88 0.85 0 0 0 0X-fistula no 0.10 0.10 0.10 0.05 1.00 1.00 1.00 0.90

non-det 0.02 0.02 0.02 0.10 0 0 0 0.10

of a tracheoesophageal fistula, for example, are less reliable in a patient in whomthe passage of food is seriously impaired. To further extend on the idea of in-cluding more detailed sensitivities and specificities in a probabilistic network, wenow address modelling reliability characteristics that are dependent upon othervariables than just the disease and test variables.

To capture the influence of a third variable on a test’s reliability charac-teristics, a probabilistic network has to explicitly model the dependence in itsgraphical structure. In the oesophagus network, for example, the variable X-fistula, modelling the result of the barium swallow, has an incoming arc notjust from the variable Fistula but also from the variable Passage. The relation-ship between the three variables is quantified by the parameter probabilitiesp(X-fistula | Fistula ∧ Passage). The probabilities that have been specified forthe oesophagus network are shown in Table 2. Unlike in the previous examples,the sensitivity of the barium swallow does no longer coincide with one of theparameter probabilities. While the probabilities p(X-fistula = yes | Fistula =yes ∧ Passage) for the various values of the variable Passage convey the sametype of information as the standard sensitivity, they do so for different groups ofpatients. We say that the sensitivity is stratified over the variable Passage. Theoverall sensitivity and specificity of the barium swallow can now be reconstructedfrom these stratified sensitivities and specificities.

We address a binary disease variable D and a binary test variable T . Toindicate the different subgroups of patients, we define a new variable G withvalues g1, . . . , gm, m > 1. The stratified sensitivities of the test are now givenby the parameter probabilities p(t | d ∧ gi), i = 1, . . . , m, for T . The overallsensitivity of the test is taken to be∑

i=1,...,m

p(t | d ∧ gi) · Pr(gi | d) (6)

Similar observations apply to the stratified specificities of the test and its overallspecificity. Moreover, the concepts of stratified sensitivity and specificity can beextended to apply to non-binary disease and test variables as presented in theprevious section.

4 Predictive Value

In the previous section, we have studied the reliability characteristics of diag-nostic tests, and focused on modelling these in a probabilistic network. The


sensitivity and specificity of a diagnostic test indicate how likely the test is toyield a result that matches the true presence or absence of the disease understudy. Once a test result has become available, however, we are no longer inter-ested in the probabilities of finding a positive or a negative result: we are muchmore interested in the effects of the test result found on the probability distribu-tion over the disease variable. This effect is captured by the concept of predictivevalue. The standard concept of predictive value is once again defined in termsof a binary disease variable D and a binary test variable T . The predictive valueof a positive test result now is the probability Pr(d | t) of the disease indeedbeing present in a patient with a positive test result. The predictive value of anegative test result is the probability Pr(d | t) of the disease being absent in anegatively-tested patient.

While the sensitivity and specificity of a diagnostic test are modelled explic-itly in a probabilistic network, the predictive value of its results are not. Thepredictive value of a positive test result, however, can be computed through thefollowing formula [2]:

Pr(d | t) =p(t | d) · Pr(d)

p(t | d) · Pr(d) + (1 − p(t | d)) · (1 − Pr(d))(7)

The formula expresses the predictive value in terms of the sensitivity and speci-ficity of the test. Since these are explicitly specified in the network, we only needthe prior probability of the disease to compute the predictive value. As we havementioned before, this probability is readily computed from the network. Similarobservations hold for the predictive value of a negative test result.

To conclude, we would like to observe that we have modelled a test’s reliabil-ity characteristics with respect to the underlying truth and not with respect tothe main disease variable. For example, the reliability characteristics of an X-rayof a patient’s chest have been expressed in terms of the presence of secondarytumours in the lungs and not in terms of the stage of the patient’s cancer. Asa consequence, the predictive value of a test result pertains to the underlyingtruth and does not provide directly for establishing the main diagnosis. Since thevarious different tests are carefully embedded within the context of an overallnetwork, however, the predictive value of a test result serves to indirectly updatethe probabilities of the main disease variable.

5 Conclusions

We have addressed the issue of modelling reliability characteristics of diagnostictests in probabilistic networks. In the medical domain, the reliability of a testis generally expressed in terms of its sensitivity and specificity. We have arguedthat to capture these characteristics, a probabilistic network has to explicitlydistinguish between variables that represent test results and variables that rep-resent the underlying truth. Moreover, the standard concepts of sensitivity andspecificity have to be further detailed to accommodate for non-binary diseaseand test variables. We have also argued that stratified reliability characteristics


are required for different groups of patients, when the result of a test does notjust depend on the underlying truth but on a third variable as well. We haveshown that the standard concepts can be readily reconstructed from the moredetailed, and possibly stratified, sensitivities and specificities.

Once a test result has become available, we are no longer interested in thetest’s sensitivity and specificity, but rather in the effects of the result on theprobability distribution over the disease variable. We have argued that this pre-dictive value can be computed from the probabilistic network, using the detailedsensitivities and specificities. While in a probabilistic network predictive valuespertain to the underlying truth and do not provide directly for establishing themain diagnosis, the context of the overall network provides for indirectly up-dating the probability distribution over the main disease variable with the testresults entered.

Acknowledgements

The research of the two authors was partly supported by the Netherlands Or-ganisation for Scientific Research. We are most grateful to Babs Taal and BertheAleman from the Netherlands Cancer Institute, Antoni van Leeuwenhoekhuis,who spent much time and effort in the construction of the oesophagus network.We would also like to thank Cilia Witteman for the many fruitful discussionswe had about the cognitive aspects of diagnostic reasoning and human decisionmaking.

References

1. R.H. Fletcher, S.W. Fletcher, and E.H. Wagner. Clinical Epidemiology. TheEssentials. Williams & Wilkins, Baltimore, 1996.

2. H.C. Sox, M.A. Blatt, M.C. Higgins, and K.I. Marton. Medical DecisionMaking. Butterworth Publishers, Boston, 1988.

3. Diagnostisch Kompas. College voor Zorgverzekeringen, Amstelveen, 1999.4. F.V. Jensen. An Introduction to Bayesian Networks. UCL Press, London,

1996.5. L.C. van der Gaag, C.L.M. Witteman, S. Renooij, and M. Egmont-Petersen.

The effects of disregarding test characteristics in probabilistic networks. In:S. Quaglini, P. Barahona, and S. Andreassen (eds.) Artificial Intelligence inMedicine. LNAI 2101, Springer-Verlag, Berlin, 2001, pp. 188 – 198.

6. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman and B.G.Taal. Probabilities for a probabilistic network: a case study in oesophagealcancer. Artificial Intelligence in Medicine, vol. 25(2), 2002, pp. 123 – 148.

7. M.C. Weinstein and H.V. Fineberg. Clinical Decision Analysis. W.B. Saun-ders Company, Philadelphia, 1980.


Bayesian Learning of the Gas Exchange Properties of the Lung for Prediction of Arterial Oxygen Saturation

David Murley1, Stephen Rees1, Bodil Rasmussen2, and Steen Andreassen1

1 Center for Model Based Medical Decision Support Systems Aalborg University, Aalborg, Denmark

2 Department of Anaesthesia, Aalborg Hospital, Aalborg, Denmark [email protected]

Abstract. This paper describes how real-time Bayesian learning of physiologi-cal model parameters is used to predict arterial oxygen saturation at the bedside. The efficacy of using these predictions as a decision support tool in a system for estimating gas exchange parameters of the lung (ALPE) was tested retrospec-tively. For the predictions to offer effective decision support they need to be accurate and safe. These qualities were tested for two patient groups, using two different test strategies for each group. The prediction accuracy when used in combination with the predictions’ safety margin was found to be adequate in all the test cases. Thus the method described can be used as the basis for effective model-based decision support in ALPE.

1 Introduction

Oxygen delivery to the tissues can be reduced by insufficient blood supply to the tissues, or low oxygen concentration in the blood. Low oxygen concentration in the blood is often caused by abnormalities in the exchange of gas between alveoli and lung capillaries. These abnormalities present in various patient groups including: patients residing in the ICU on mechanical ventilation; patients with pulmonary con-gestion due to left sided heart failure; and in patients following routine major surgery [1-4]. Therefore, measurement of the gas exchange properties of the lungs could be an important part of the monitoring and prevention of hypoxia.

Recently a system has been developed for measurement of the lungs’ gas exchange properties from non-invasive clinical data (the Automatic Lung Parameter Estimator, ALPE) [5]. ALPE automatically captures, records, and displays data from a ventila-tor, gas analyser and a pulse oximeter. ALPE then processes these data to produce a description of pulmonary gas exchange using a two parameter model of oxygen trans-port [6]. These two parameters represent: the percentage of blood flowing through non-ventilated regions of the lung (shunt); and the alveoli ventilation-perfusion mis-match (fA2). fA2 is the fraction of ventilation to a lung compartment containing 90% of the non-shunted blood flow. This model has been shown to fit data measured in a variety of patients [7, 8].

Identification of the two parameters describing gas exchange requires measure-ment, at steady state, of a spread of arterial oxygen saturation (SpO2) values. Produc-ing a sufficient spread of SpO2 values (85-100%) requires a number of step changes in

Bayesian Learning of the Gas Exchange Properties 265

inspired oxygen fraction (FiO2). After each step change in FiO2 several minutes are required until steady state is achieved. This means that a procedure including five step changes in FiO2 usually takes ten to fifteen minutes to perform [5]. Selection of the most appropriate set of FiO2 values is therefore important to minimize the duration of the procedure.

Currently the clinician selects the FiO2 values using their intuition and experience of using ALPE to achieve a sufficient spread of SpO2 values. This selection is difficult when reducing FiO2, where a poor choice of FiO2 may reduce SpO2 more than ex-pected. The result is that when decreasing FiO2 caution is taken to ensure that SpO2 remains above 85%. This results in a sub-optimal number of FiO2 steps being taken.

Therefore, there is a need for a system which helps guide the clinician to the next most appropriate FiO2. A system which quickly predicts what a patient’s probable SpO2 will be at any FiO2 could guide the clinician to select the next most appropriate FiO2. Such a system could reduce the number of FiO2 steps taken, and hence reduce the time required to perform an ALPE procedure.

This paper describes the use of Bayesian learning in ALPE for the real-time predic-tion of SpO2 at any chosen FiO2. This Bayesian version of the ALPE system (ALPE Bayes) is tested retrospectively using clinical data to investigate the accuracy and safety of the SpO2 predictions, and hence test if the system is able to give guidance to the clinician on selecting the next most appropriate FiO2 level.

2 Materials and Method

2.1 Bayesian Learning in ALPE

This section describes the application of Bayesian learning in the ALPE system. That is the specification of a priori parameter distributions, the update of the joint parame-ter distribution on data collection using Bayes theorem, and the use of the joint pa-rameter distribution to predict the probability distribution of SpO2.

A Priori Distributions. A number of patient groups exist for which measurement of the gas exchange properties of the lung may be important. Different groups can have very different lung function, for example, patients with adult respiratory distress syn-drome will typically present differently from those following routine major surgery. Knowledge of inter-group variation is currently represented in ALPE as a priori dis-tributions for gas exchange parameters (shunt and fA2) for five previously studied patient groups [6]. These are normal subjects, patients with heart failure, pre and post operative patients, and long term intensive care patients.

The a priori parameter distributions of shunt and fA2 are assumed to be normally distributed and are divided into a set of n discrete values. Figure 1(a) shows plots of the a priori distributions for the post operative group of patients. For this group the possible shunt values range from 0% at the normal end of the scale, to 24% at the abnormal end. The possible fA2 values vary from 0.9 at perfectly matched ventila-tion-perfusion, down to 0.05 at extremely poorly matched ventilation-perfusion.

The two a priori paramete distributions are combined to form a joint a priori distri-bution, p( ), using the calculated covariance for each patient group.

266 David Murley et al.

Fig. 1. Plots of oxygen model’s parameter distributions.

( ) 1p)(p1 1

==Θ ∑∑=

=

=

=

ni

i

nj

jijθ

(1)

Where,

⎥⎦

⎤⎢⎣

⎡=

j

iij fA

shunt

2θ

(2)

Measurement Update. The joint a priori distribution, p( ), is initially updated by measuring the ventilation variables ( 1) at the FiO2 level at which the patient presents. The likelihood of the ventilation variables, p( 1| ), is then calculated using the set of all possible parameter values ( ), the oxygen transport model, and the measurement error. Bayes theorem is then applied to combine the likelihood with the a priori joint parameter distribution to give an a posteriori joint distribution, p( | 1):

)p(

))p( | p( ) | p(

1

11 ω

ωω ΘΘ=Θ (3)

The a posteriori joint probability distribution is then marginalized in order to plot the individual a posteriori distributions of shunt and fA2. Equation 4 illustrates how the marginal probability of a single shunt value, shunt1, is calculated.

1(a) Post-operative a priori parameter distributions.

1(b) Parameter distributions after first update.


∑=

=

=nj

j 0j11 ) fA2 ,shunt p()p(shunt

(4)

Figure 1(b) shows the distribution plots for shunt and fA2 following the first up-date of the a priori distributions (figure 1(a)). Comparing figure 1(b) with 1(a) it can be seen that both the distributions of possible parameter values have shifted towards the abnormal end of their scales. On a computer with a Pentium III 850Mhz proces-sor and 128MB of RAM the measurement update process takes a fraction of a second.

On delivery of the next FiO2 level, and subsequent measurement of data, this up-date process is repeated using the a posteriori joint parameter probability as the new a priori distribution. Using the a posteriori in this way is a core concept of Bayesian learning [9]. After taking ‘m’ measurements the relationship between the a posteriori and the a priori can be expressed as follows:

)p(

) ,,... | )p( | p( ) ,,... | p(

m

121-mm12m ω

ωωωωωωω ΘΘ=Θ (5)

Prediction of Oxygen Saturation. Using the joint distribution, p( ), with the oxy-gen model and measurement error of SpO2, simulations can be performed to calculate the probability of a set of SpO2 measurement values ( ) given any inspired oxygen fraction ( (FiO2)x ).

∑ ΘΘΨ=Ψ ))p(,)O(F | p())O(F|p( x2ix2i (6)

By applying equation 6 across the range of FiO2 values a probability contour map of predicted SpO2 can be plotted (figure 2). Figure 2 shows the probability contours of the SpO2 prediction for a post-op cardiac patient after one measurement update (point A). The line in the middle of the shaded area represents the predicted most-likely SpO2 at the corresponding end tidal oxygen fraction (FeO2). The shaded areas repre-sent 68% and 95% confidence intervals for the prediction of SpO2. The lower bound-ary of a confidence interval can be used to indicate a prediction’s safety margin. The safety margin is the percentage certainty that the measured SpO2 will be above a pre-defined safe SpO2 threshold at a specified FeO2. For an ALPE procedure the safe SpO2 threshold for all patients is 85%. For the case illustrated in figure 2, the lower bound-ary of the 68% confidence interval bisects the 85% SpO2 line at a FeO2 of approxi-mately 15%. Therefore, at a FeO2 of 15% there is an 84% safety margin in the predic-tion. Thus before changing FiO2 the clinician is able assess the predicted risk of SpO2 dropping below a patient’s threshold level.

In addition to the contour plot when the clinician inputs any FiO2 value error bars are plotted to highlight the SpO2 distribution at the corresponding FeO2. The error bars are plotted at the same points in the SpO2 probability distribution as the contours. The relative sizes of the error bars are a simplified representation of the relative probabil-ity masses in the SpO2 probability distribution. Using a computer with a Pentium III 850Mhz processor and 128MB of RAM it typically takes a fraction of a second to plot an error bar, and 1 to 2 seconds to plot the SpO2 probability contours.


Fig. 2. Oxygen saturation prediction following one measurement update at point A.

2.2 Retrospective testing of ALPE Bayes

The aim of this testing was to investigate the viability of the decision support function of ALPE Bayes. This function aids the clinician in selecting the next FiO2 level by predicting the SpO2 distribution at any FiO2. Testing therefore involved assessing the accuracy and safety margin of these SpO2 predictions.

The SpO2 prediction depends on the a priori parameter distributions, and the size of FiO2 steps taken. It was therefore necessary to test the SpO2 predictions for different patient categories with different a priori distributions, and at different FiO2 step sizes.

The SpO2 predictions were tested using data recorded pre- and post-operatively from five patients undergoing cardiac surgery. Pre- and post-operative a priori parameter distributions are defined in ALPE Bayes. By comparing the predictions from these two groups the effect of the a priori parameter distributions on the SpO2 predictions was assessed.

The effect of FiO2 step size on the SpO2 prediction accuracy was tested using two different strategies applied to each of the recorded cases. In strategy 1 the accuracy of SpO2 predictions was investigated when a single large FiO2 step was taken, for exam-ple stepping from point A to point C in figure 3. This prediction is referred to as ‘Strat 1-C’ below, i.e. test strategy 1 point C. Moving to point C represents an ag-gressive step down from the FiO2 at which the patient presented (point A). Moreover, only point A is used to update the joint parameter distribution, p( ), before making the prediction for point C.

In test strategy 2 the accuracy of SpO2 predictions at two points was investigated when an intermediate FiO2 step, point B, is included (Strat 2-B). This means that points A and B will be used to update the joint distribution, p( ), before making the prediction for point C (Strat 2-C). Thus the effect of the additional measurement update on the prediction accuracy at point at point C will be tested.

For all the tested SpO2 predictions the prediction error was calculated as the differ-ence between the most-probable SpO2 prediction and the corresponding measured SpO2 value (prediction - measurement). The prediction errors were then categorized ac-cording to patient group (pre- or post-operative) and test strategy prediction point (Strat 1-C, or Strat 2-B, or Strat 2-C). Then for each category the sample mean pre-

Probability contours

Error Bars

A


diction error, and the corresponding 95% confidence interval for the population mean prediction error, were calculated. The 95% confidence interval corresponded to a t-distribution with four degrees of freedom. The confidence intervals for the population mean prediction errors were compared inter, and intra, each patient group.

Fig. 3. Illustration of measurement points used in test strategies.

The safety margin of each SpO2 prediction was assessed by calculating the differ-ence between the SpO2 on the lower boundary of the 68% confidence interval and the measured SpO2. There is 84% certainty in the prediction that the measured SpO2 will be above the lower boundary of the 68% confidence interval. Therefore, if the SpO2

prediction is to be safely used as a decision support aid the measured SpO2 should be above the lower 68% confidence interval. If this can be shown to be true then where the lower 68% confidence interval is above a patient’s threshold SpO2 (85% in ALPE) it could be used for safely selecting the next FiO2 step.

Ethical approval for studying the five patients undergoing cardiac surgery was ob-tained from the Ethics Committee of North Jutland and Viborg Counties. Informed written and oral consent was obtained in all cases.

3 Results

3.1 Accuracy of Most-Probable SpO2 Prediction

Pre-operative Group. Figure 4 shows a summary of the prediction error results for the pre-operative group of patients. These patients were not ventilated thus the first measurement point (A) was made at a FiO2 of approximately 21%.

The graph in figure 4 is split into three columns, one for each of the predictions tested. ‘Strat 1 – C’ refers to the test strategy 1 prediction at point C (figure 3). ‘Strat 2 – B’ denotes test strategy 2 point B, and ‘Strat 2 – C’ signifies test strategy 2 point C. The vertical extent of the lines plotted in each of the columns represents the 95% confidence interval for the population mean prediction error. The horizontal bar in the middle of each of the intervals represents the sample mean prediction error.

B

C

A


Fig. 4. Pre-operative group: 95% confidence intervals of the population mean prediction error for the 3 predictions tested.

In test strategy 1, ‘Strat 1 – C’, the average single step decrease in FiO2 of 6.8% gave an average decrease in SpO2 of 5.0%. The 95% confidence interval for the popu-lation mean prediction error varies from -2.19% (bottom bar) to 0.88% (top bar). The sample mean prediction error (middle bar) is -0.66% (a pessimistic prediction bias).

In test strategy 2 the prediction accuracy was tested following two relatively small steps in FiO2: A to B, then B to C. The average change in FiO2 from point A to B was 4.5% for a corresponding average change in SpO2 of 2.3%. For the prediction at point B, ‘Strat 2 – B’, the 95% confidence interval of the population mean prediction error varies from -0.62% to 0.88%, and the sample mean prediction error is 0.13%. The average decrease in FiO2 from point B to C was 2.6% and the average decrease in SpO2 was 2.3%. For the prediction at point C, ‘Strat 2 – C’, the 95% confidence interval of the population mean prediction error varies from -1.95% to 0.60%, and the sample mean prediction error is -0.67%; a similar prediction accuracy to ‘Strat 1 – C’.

Post-operative Group. For ‘Strat 1 – C’ (figure 5) the average initial FiO2 value was 39.5% (point A), and the average FiO2 decrease of 21.4% produced an average SpO2 decrease of 7.2%. The 95% confidence interval for the population mean prediction error varies from -0.06% to 3.72%, and the sample mean prediction error is 1.83%. This overly optimistic bias in the prediction accuracy means that for these cases it could be unsafe for ALPE Bayes to give advice based solely on the predicted most-likely SpO2. The prediction’s safety margin must also be considered (section 3.2).

For ‘Strat 2 – B’ the average reduction in FiO2 was 11.1% with a resultant average decrease in SpO2 of 1.5%. The 95% confidence interval of the population mean pre-diction error varies from -1.95% to 0.60%, and the sample mean prediction error is 0.51%. For ‘Strat 2 – C’ the average reduction in FiO2 was 11.3% with a resultant average decrease in SpO2 of 6.5%. The 95% confidence interval of the population mean prediction error varies from -1.10% to 2.78%, and the sample mean prediction error is 0.84%. So, strategy 2 produced more accurate predictions than strategy 1.

-2.50

-2.00

-1.50

-1.00

-0.50

0.00

0.50

1.00

Strat 1 - C Strat 2 - B Strat 2 - C

Pre

dict

ion

erro

r (p

red-

mea

s)


Fig. 5. Post-operative group: 95% confidence intervals of the population mean prediction error for the 3 predictions tested.

3.2 Prediction Safety Margin

In test strategy 1, for all predictions, the lower boundary of the prediction’s 68% con-fidence interval was below the measured SpO2. However, the prediction precision varied which meant that the lower 68% boundary was not always above the ALPE safety threshold (SpO2 = 85%). This is illustrated by the results for the post-operative group summarised in Table 1. For cases 2 and 3 SpO2 on the lower 68% boundary was 89.03% and 87.61% respectively. So for these two cases ALPE Bayes would have confirmed that it was safe to use strategy 1. While for the other three cases a more conservative strategy would have been suggested, for example strategy 2.

Table 1. Post-operative, strategy 1, prediction results.

Case number

Measured S

pO

2

FiO

2 Prediction

error S

pO

2 on lower

68% boundary

1 90 16.2 1.55 76.34

2 92 16.7 1.65 89.03

3 89 20.8 3.89 87.61

4 86 22.9 2.37 44.16

5 90 29.1 -0.30 62.33 In test strategy 2 for all the pre-operative cases the prediction’s lower 68% inter-

vals were below the measured SpO2 value. In four out of the five post-operative cases the prediction’s lower 68% interval was below the measured SpO2 value. In the other post-operative case the 68% interval was higher than the measured SpO2 by an insig-nificant amount for both predictions, 0.37% for the first (point B) and 0.06% for the second (point C). These results confirm that the lower boundary of the 68% interval could be used as a dependable indicator of the prediction’s safety.

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

Strat 1 - C Strat 2 - B Strat 2 - C

Pre

dict

ion

erro

r (p

red-

mea

s)


4 Discussion

ALPE has been designed to use a spread of SpO2 measurements to identify a patient’s pulmonary gas exchange parameters. The spread of SpO2 values is produced by reducing FiO2 in a number of steps. Each step in FiO2 lasts several minutes as the patient’s gas exchange must be in a steady state before a SpO2 measurement is taken. Therefore, selecting the optimum number of FiO2 steps is important to minimize the duration of the ALPE procedure. Selecting the optimum number of FiO2 steps is de-manding because it is difficult to safely predict the change in SpO2 a decrease in FiO2 will produce. Therefore, there is a need for a decision support system which can safely predict the change in SpO2 for any decrease in FiO2. The ALPE Bayes system described in this paper has the potential to fulfill this decision support function by predicting the probability distribution of SpO2 at any FiO2. For ALPE Bayes to realize this decision support potential its predictions of SpO2 need to be accurate and safe. The prediction accuracy and safety depend on the a priori parameter distributions used, so the SpO2 predictions were tested for pre- and post-operative patient groups. Moreover, the prediction accuracy and safety are also dependent on the FiO2 step size, so two different test strategies with varying step sizes were used.

The prediction accuracy for the pre-operative group was better than for the post-operative group at all of the prediction points. This was partly due to the greater ho-mogeneity of the pulmonary gas exchange parameters of this group. A factor which is represented in ALPE by the greater precision of this group’s a priori parameter distributions. The precision of the parameter distributions can also be improved by the measurement update process (figure 1). The effect on the prediction accuracy of this improved precision can be seen by comparing the prediction accuracy for the post-operative group (figure 5) at ‘Strat 2 – C’ (0.84%) with that at ‘Strat 1 – C’ (1.83%). This demonstrates that the additional update of the joint parameter distribu-tion in strategy 2 improved the accuracy of the predictions for the post-operative group. However, for the pre-operative group (figure 4) the prediction accuracy at ‘Strat 2 – C’ (0.67%) was about the same as at ‘Strat 1 – C’ (0.66%). This demon-strates that additional measurement updates of the joint parameter distribution may not always be required to improve prediction accuracy.

The other factor effecting the relative prediction accuracies is the size of the de-crease in FiO2 and the corresponding change in SpO2 this produces. At a crude level of analysis the bigger the change in FiO2 the more difficult it is to accurately predict what effect this will have on SpO2. This can be seen when comparing the prediction errors for the two different test strategies, e.g. compare ‘Strat 2 – B’ with ‘Strat 1 – C’ in figure 5. Here the prediction accuracy is improved when approximately half the FiO2 step size is used in test strategy 2. However, comparing ‘Strat 2 – B’ and ‘Strat 2 – C’ in figure 5 the FiO2 step sizes are similar, but the prediction accuracies are different. This shows that the prediction accuracy is dependent on the rate of change of SpO2 for a unit change in FiO2, i.e. the gradient of the oxygen saturation curve. For the post-operative group (figure 5) the average gradient of the oxygen saturation curve effect-ing the prediction at ‘Strat 2 – B’ is less than a quarter of the average gradient effect-ing the prediction accuracy at ‘Strat 2 – C’. So the accuracy of the SpO2 prediction is worse where the gradient of the oxygen saturation curve is higher.

In the tests of ALPE Bayes’ prediction safety the majority of the lower 68% confi-dence intervals of the SpO2 prediction were found to be below the measured SpO2.


Although there were exceptions to this for one post-operative case, these exceptions were found to be negligible. Therefore, the point where the lower 68% confidence interval bisects a SpO2 safety threshold could be used to inform the setting of FiO2.

The results have shown that ALPE Bayes can adequately predict the SpO2 at a new FiO2. Moreover, the lower boundary of a prediction’s 68% confidence interval can be used to indicate the lowest potential SpO2 value. Thus the predictions of ALPE Bayes could be used to reduce the number of FiO2 steps taken during an ALPE procedure, and thus reduce the duration of the procedure.

Acknowledgements

This work was partially supported by grants awarded by the Danish Heart Foundation, the Danish Research Academy and by the IT-committee under the Danish Technical Research Council.

References

1. Melot C. Contribution of multiple inert gas elimination techniques to pulmonary medicine. 5. Ventilation perfusion relationships in acute respiratory failure. Thorax 1994; 49: 1251-1258.

2. Johnson RL. Gas exchange efficiency in congestive heart failure. Circulation 2000; 101: 2774-2776.

3. Smith DC, Crul JF. Early postoperative hypoxia during transport. Br J Anaest 1988; 61: 625-627.

4. Rosenberg J, Dirkes H, Kehlet H. Late postoperative episodic oxygen desaturation and heart rate variations following major abdominal surgery. Br J Anaesth 1989; 63: 651-654.

5. Rees SE, Kjærgaard S, Thorgaard P, Malczynski J, Toft E, Andreassen S. The Automatic Lung Parameter Estimator (ALPE) system: Non-invasive estimation of pulmonary gas ex-change parameters in 10-15 minutes. J Clin Monit Comput 2002; 17; 43-52.

6. Rees SE, Kjærgaard S, Andreassen S. Mathematical Modelling of Pulmonary Gas Ex-change. In: Modelling Methodology for Physiology and Medicine. Eds. E.R Carson and C.Cobelli, Academic Press, 2001, pp 253-277.

7. Roe PG, Galdelrab R, Sapsford DJ, Jones JG. Intra-operative gas exchange and post-operative hypoxaemia. Eur J Anaesthesiol 1997; 14: 203-210.

8. Kjærgaard S, Rees SE, Nielsen JA, Freundlich M, Thorgaard P, Andreassen S. Changes in postoperative pulmonary function following gynaecological surgery. Acta Anaesthesiol Scand 2001; 3:n 349-356.

9. Duda RO, Hart PE. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973.


Hierarchical Dirichlét Learning – Filling in the Thin Spots in a Database

Steen Andreassen1, Brian Kristensen2, Alina Zalounina1, Leonard Leibovici3, Uwe Frank4, and Henrik C. Schønheyder5

1 Center for Model-based Medical Decision Support, Aalborg University Fredrik Bajers Vej 7 C1, DK-9220 Aalborg East, Denmark

2 Department of Clinical Microbiology, Aarhus University Hospital Nørrebrogade 44, DK-8000 Århus C, Denmark

3 Department of Medicine, Beilinson Campus, Rabin Medical Center 49100 Petah-Tiqva, Israel

4 Freiburg Universiy Hospital, Institut für Umweltmedizin und Krankenhaushygiene Hugstetter Straße 55, D-79106 Freiburg, Germany

5 Department of Clinical Microbiology, Aalborg Hospital P.O. Box 365, DK - 9100 Aalborg, Denmark

Abstract. Estimation of probabilities by classical maximum likelihood estima-tors can give unreliable results when the number of cases is small. A Bayesian approach, where prior probabilities with Dirichlet distributions are used to tem-per the estimates, can reduce the variance enough to make the estimates useful. This is demonstrated by using this approach to estimate mortalities of severe in-fections from different sites, lungs, skin urinary tract, etc. The prior probabili-ties are provided in a hierarchical way, i.e. by deriving them from the same da-tabase, but without distinguishing between different sites of infection.

1 Introduction

The estimation of probabilities from databases can be quite tricky. Even when a data-base with a substantial number of cases is available, the database may contain “thin spots”, i.e. areas where there are not enough cases to answer relevant questions with satisfactory accuracy. We will use a database with 2040 cases of bacteraemia col-lected in 1992-1994 in the County of Northern Jutland to illustrate this problem. The database contains mortalities for the bacteraemic patients, dependent on whether or not the patients received covering antibiotic therapy and dependent on the site of infection. The analysis of the mortalities are motivated by the need to assess the ef-fects of antibiotic therapy. It is assumed that covering antibiotic therapy reduces the mortality of infections and it is mainly this reduction in mortality that provides the “driving force” for antibiotic therapy. It is therefore important to have reasonably accurate estimates of the reduction in mortality that can be achieved with covering antibiotic therapy, since this therapeutic benefit must be balanced against the cost of the therapy. The cost of therapy consists of the cost of purchasing and administering the antibiotics, the cost of side effects and the ecological cost, i.e. the loss of therapeu-tic options for future patients due to bacterial resistance promoted by the use of anti-biotics. This balancing of the benefit against the cost is an integral part of the clinical

Hierarchical Dirichlét Learning – Filling in the Thin Spots in a Database 275

decision process [6]. In the Treat project, where a decision support system for antibi-otic therapy has been constructed, this balancing is considered a decision theoretic problem, where the size of the therapeutic benefit directly reflects into the choice of antibiotics [1, 2, 4, 5, 7], thus making the need for credible estimates of mortalities explicit.

The purpose of this paper is to propose a pragmatic way of reducing the problems caused by the lack of a sufficient number of cases. The problem arises because the mortality of an infection depends on the sites of the infection, e.g. lung infections have a higher mortality than urinary tract infection. When the cases are divided into subgroups according to site of infection, some of the subgroups become relatively small, and the classical statistical estimators may give estimates that are too uncertain to be useful. In many cases a Bayesian approach, in this case hierarchical Dirichlét learning, may be useful for estimation of probabilities from databases. In each sub-group the parameter of interest, i.e. the reduction in mortality that can be achieved with covering antibiotic therapy is assigned a prior distribution. These parameters are assumed to be drawn from a distribution that has a mean equal to the mean of all groups. The subgroup-parameters thus a priori inherit the mean from the whole group, which is why this approach is called hierarchical [8]. The label Dirichlét is due to the use of the Dirichlét distribution for the specification of the priors.

2 Classical Statistical Estimators

Covering empirical antibiotic therapy is expected to reduce the mortality associated with infections. Table 1 gives the 30-day mortalities for the patients with bacteraemia, dependent on whether or not the patients received covering empirical antibiotic ther-apy. Empirical therapy is the antibiotic therapy given for the first few days of the infectious episode, until results from the blood culture becomes available. With non-covering therapy, the mean and standard deviation of the overall mortality across all sites of infection was Cmn = 26.1 ± 1.5%1. With covering therapy the mortality was reduced to Cmc = 18.5 ± 1.1%. These numbers are maximum likelihood estimates of mortality, calculated as:

Cmn = Mn / Nn = 234 / 896 = 26.1% (1)

and

Cmc = Mc / Nc = 212 / 1144 = 18.5%, (2)

where Mn is the number of observed mortalities amongst the Nn patients receiving non-covering therapy and Mn / Nn and Mc is the number of observed mortalities amongst the Nc patients receiving non-covering therapy. The standard deviations were calculated as the standard deviation of the binomial distribution, e.g.: 1 We need to distinguish between the true (but unknown) value of the mortality and two differ-

ent estimators for this mortality. The true value of mortality for the whole population (aver-age population) will be called m. The prefix C will be used to distinguish the classical maxi-mum likelihood estimator Cm, and the prefix D will be used to distinguish the Dirichlét estimator Dm. The superscript j will be used to refer to a specific site of infection, such that mj , Cmj and Dmj refer to the true site specific mortality, to the classical and to the Dirichlét estimators of the site specific mortality, respectively.

276 Steen Andreassen et al.

sd(Cmn) = sqrt(Cmn * (1- Cmn) / Nn) . (3)

These numbers lend support to the assumed beneficial effects of covering antibiotic therapy. The reduction in mortality is Cm = Cmn - Cmc = 7.6 ± 1.9%, where the standard deviation was calculated from the normal distribution approximation to the binomial distribution. This is a statistically significant reduction of mortality (p = 0.00003).

Table 1 also gives the mortalities distributed on sites of infection. The estimated mortality and its standard deviation for an infection at site j for a patient receiving covering antibiotic therapy is calculated as:

Cmjn = Mj

n / Njn sd(Cmj

n) = sqrt(Cmjn * (1- Cmj

n) / Njn) (4a,b)

Cmjc = Mj

c / Njc , sd(Cmj

c) = sqrt(Cmjc * (1- Cmj

c) / Njc) (5a,b)

where Mjn is the number of observed mortalities amongst the Nj

n patients receiving

non-covering therapy for an infection at site j and Mjc is the number of observed mor-

talities amongst the Njc patients receiving covering therapy for an infection at site j. As

above the standard deviations were calculated as the standard deviation of the bino-mial distribution.

The reductions in mortality for each site were calculated as:

Cmj = Cmj

n - Cmjc, (6)

with the standard deviation:

sd( Cmj ) = sqrt(sd(Cmj

n )2 + sd(Cmj

c )2) . (7)

As can be seen from Table 1, the accuracy in the determination of the differences in mortality is quite modest. For three sites, bone, central nervous system and skin, the standard deviations are actually larger than the mean, making it impossible even to state that the therapy has a positive effect on the mortality. At the other hand, the large uncertainty offers some comfort for the site called intravascular devices. If the numbers were taken at face value, covering antibiotic therapy would actually increase mortality by 8.3%, which is hard to believe. There are medical reasons to believe that the mortality of infections caused by intravascular devices is quite small, but antibi-otic therapy should definitely not increase mortality. A practical consequence of this would be that covering antibiotic therapy should never be given for infections related to intravascular devices.

These considerations reveal that the classical maximum likelihood estimators do not provide acceptable estimates of infection related mortalities.

3 Hierarchical Dirichlét Learning

The quality of a classical estimator is assessed by considering its properties, for ex-ample bias and standard variation, assuming that it is applied repeatedly to several samples drawn from the same distribution. As pointed out in the section on classical estimators, the estimates of the reduction of mortality achieved by covering antibiotic


therapy may have a quite high standard variation for some of the sites of infection, actually to the point where the estimates are not useful. To reduce the standard devia-tion of the estimates, we will try to apply a Bayesian approach, where prior opinions about the estimators are used to reduce their standard deviation. In the Bayesian view the parameter to be estimated, for example m

j

n, is seen as a stochastic variable for which a priori probability distribution, p(m

j

n), must be specified. When a sample D, drawn from the distribution, becomes available, the task of the learning process is to update this probability distribution in accordance with Bayes theorem, in this case:

Table 1. Maximum likelihood and Dirichlét estimates of mortalities for bacteraemia from nine sites during 1992-1994, 30 days after the onset of the infectious episode. Numbers are given for 9 sites of infection for non-covering and covering empirical antibiotic therapy

Max. likelihood Dirichlét adjusted

Non-cov. Covering Non-cov. Covering

Sitej Nj

Nj

n Mj

n Nj

c Mj

c Cmj

n Cmj

c ΔCmj Dmj

n Dmj

c ΔDmj mean sd mean sd mean sd mean sd mean sd mean sd

Bone 36 12 2 24 2 16.7 10.8 8.3 5.6 8.3 12.1 25.4 1.6 17.1 1.3 8.3 2.0

Cardiovasc. 56 32 6 24 1 18.8 6.9 4.2 4.1 14.6 8.0 24.8 1.7 16.6 1.1 8.3 2.1

CNS 70 5 1 65 10 20.0 17.9 15.4 4.5 4.6 18.4 25.9 1.5 17.6 1.6 8.3 2.2

Gastroint. 384 186 46 198 38 24.7 3.2 19.2 2.8 5.5 4.2 25.3 1.9 18.9 1.7 6.4 2.5

Intravasc. 169 108 14 61 13 13.0 3.2 21.3 5.2 -8.3 6.2 20.6 1.6 19.3 1.7 1.3 2.4

Primary 519 265 90 254 64 34.0 2.9 25.2 2.7 8.8 4.0 31.1 1.9 22.7 1.8 8.4 2.6

Respiratory 236 63 28 173 36 44.4 6.3 20.8 3.1 23.6 7.0 31.5 2.1 19.8 1.7 11.8 2.7

Skin 82 41 8 41 6 19.5 6.2 14.6 5.5 4.9 8.3 24.7 1.8 17.7 1.5 7.0 2.3

Urinary 488 184 39 304 42 21.2 3.0 13.8 2.0 7.4 3.6 23.4 1.8 15.4 1.4 8.0 2.3

Total 2040 896 234 1144 212 26.1 1.5 18.5 1.1 7.6 1.9 26.1 1.3 18.5 1.0 7.6 1.6

p(mj

n | D) = p(D | mj

n) * p(mj

n) / p(D). (8)

If a Bayesian is asked to provide a single number as an estimate, he may choose to give the expectation of the a posteriori probability distribution,

E[mj

n | D] = � mj

n * p(mj

n | D) d(mj

n) . (9)

We propose a hierarchical approach, which means that in the absence of evidence to the contrary, we assume that the mortality for a given site is the same as the average for all the sites, i.e. that the prior distributions for the individual sites of infection, p(m

j

n), are derived from the mortalities, averaged over all sites of infection, p(mn). We also propose to use a Dirichlét approach, in which an imaginary sample is added to each true sample. For example, to learn the Dirichlét modified mortality for patients receiving non-covering treatment with infections from site j we add a sample of size

j

n to the original sample of size Nj

n, such that the total sample size becomes j

n + Nj

n. In the imaginary sample

j

n the number of patients dying is j

n, such that the total number of patients dying becomes

j

n + Mj

n, and the number of patients surviving is j

n, such that the total number of patients surviving becomes

j

n + Lj

n The hierarchical approach is introduced by assuming that the mortality in the imaginary sample is the same as the mortality averaged over all sites of infection, i.e. it is assumed that:


j

n = j

n * mn (10)

This implies that the Bayesian estimator for Dirichlét modified mortality is2:

Dmj

n = (j

n + Mj

n) / (j

n + Nj

n ) . (11)

It follows from equations 3 and 10 that the standard deviation of j

n is:

sd(j

n) = j

n * sqrt(mn * (1- mn) / Nn) (12)

and similarly from equations 4a, 4b and 3:

sd(Mj

n) = sqrt(mj

n * (1- mj

n) * Nj

n) . (13)

Using the normal distribution approximation to the binomial distribution therefore gives:

sd(Dmj

n) = sqrt(sd(j

n) 2 + sd(M

j

n) 2

) / (j

n + Nj

n ) (14) = sqrt(

j

n

2 * mn * (1- mn) / Nn + mj

n * (1 - mj

n) * Nj

n) / (j

n + Nj

n) .

The corresponding concepts are defined for covering treatment and we thus have:

Dmj

c = (j

c + Mj

c) / (j

c + Nj

c) (15)

and

sd(Dmj

c) = sqrt(j

c

2 * mc * (1- mc) / Nc + mj

c * (1 – mj

c) * Nj

c) / (j

c + Nj

c) . (16)

Expressions for the hierarchical Dirichlét estimator and its standard deviation have now been established. If we allow classical estimators to replace the true parameter values in equations 10 through 16, then the only unknown parameters are

j

n and j

c. For now we will arbitrarily set

j

n = j

c = 150, thus allowing equations 10 through 16 to be calculated. Table 1 gives these Dirichlét adjusted mortalities and their standard deviations. The standard deviations are substantially smaller than without the Dirichlét modification, but the price paid for this is of course that all mortalities are now biased towards the average mortalities. This also applies to the reductions in mortality. The Dirichlét adjusted reduction in mortality can be defined as:

Dmj

= Dmj

n – Dmj

c (17)

with the standard deviation:

sd( Dmj

) = sqrt(sd(Dmj

n) 2

+ sd(Dmj

c)

2

) . (18)

2 We have assumed the sample D in equation 8 to be drawn from a binomial distribution and

then, as proposed in a review paper by Heckermann [3], it is a good idea to let the apriori dis-

tribution for p(mj

n) follow the conjugated prior of the binomial distribution, which is the Beta

distribution. Bayes theorem (equation 8) is now used to update the aposteriori distribution

p(mj

n | D) and equation 9 is used to calculate its expectation, i.e. the hierarchical Dirichlet es-

timator Dmj

n. Without going through the algebra, we can just state that according to the result

given by Heckermann [3], the estimator Dmj

n is given by equation 11.


It can be seen that Dmj

is positive for all sites and that it ranges between 1.3% and 11.8% and with a standard deviation that, except for the site “intravascular devices”, is smaller than the mean value.

4 The Size of the Imaginary Sample

The choice of the hyperparameters determines the extent to which each site of infec-tion inherits the average mortality of all the sites. If the size of an imaginary sample, for example

j

n = j

n + j

n is large, then the mortality for all sites is dominated by the average mortality, and if

j

n is small, possibly zero, then the Dirichlét estimate be-comes equal to the maximum likelihood estimate. The choice of

j

n can therefore be considered as a balance, where a too large imaginary sample will introduce too large a bias towards the average, and a too small imaginary sample will cause the estimator to be too noisy.

To formalize this consideration we return to a classical point of view, where we evaluate the estimators, based on their performance, when applied repeatedly to sam-ples drawn from the given distribution. Usually the bias and variance (or standard deviation) are used to assess the quality of the estimators. As it is the reduction in mortality that as mentioned provides the driving force for antibiotic therapy, we shall focus on the properties of the classical maximum likelihood estimator, Cm

j

(eq. 6), and the Bayesian hierarchical Dirichlét estimator, Dm

j

(eq. 17). The bias of an estimator is defined as the difference between the mean of the esti-

mator and the true value of the parameter. For Cmj

we therefore have:

bias( Cmj

) = mean( Cmj

) – (mj

n – mj

c) (19)

Since it is well known that Cmj

n and Cmj

c are unbiased estimators of mj

n and mj

c, respec-tively, it follows that Cm

j

is an unbiased estimator, i.e. bias( Cmj

) = 0. The stan-dard deviation of Cm

j

is given by eq. 7. For Dm

j

we have:

bias( Dmj

) = mean( Dmj

) – (mj

n – mj

c)

= mean(Dmj

n) - mean(Dmj

c) – (mj

n – mj

c) . (20)

From the expressions for Dmj

n (eq. 11) and Dmj

c (eq. 15) the means can be determined. For example for Dm

j

n:

mean(Dmj

n) = mean((j

n + Mj

n) / (j

n + Nj

n))

= (j

n * mn + Nj

n * mj

n) / (j

n + Nj

n) . (21)

It is clear that mean(Dmj

n) � mj

n and that therefore neither Dmj

n nor Dmj

are unbiased estimators. The standard deviation of Dm

j

is given by eq. 18. We are now equipped to perform a simulation study of the bias and standard devia-

tion of the estimators, with the purpose of determining suitable values for j

n and j

c.


The bias and standard deviation of the estimators depend not only on j

n and j

c, but also on 8 other parameters, 4 describing the average population, mn and mc, Nn and Nc,

and 4 describing the site of infection, Nj

n and Nj

c, mj

n and mj

c. Initially we shall choose parameter values for the average population that resemble those in the database: Nn = Nc = 1000, mn = 0.26 and mc = 0.18. The average reduction in mortality attributable to covering antibiotic therapy is thus: m = mn - mc = 0.08.

For the site specific parameters we choose a site of about average size, Nj

n = Nj

c =100, a mortality for covering therapy that is identical to the average mortality for covering therapy , i.e. m

j

c - mc = 0.18, and a mortality for non-covering therapy mj

n = 0.31. The site specific reduction in mortality attributable to covering antibiotic ther-apy is thus: m

j

= mj

n - mj

c = 0.13. We further let the sizes of the imaginary samples for the non-covering and the covering therapy be the same:

j

n = j

c. With this choice of parameters, Fig. 1 shows the mean of the hierarchical Dirichlét

estimator, mean( Dmj

), as a function of j

n (or j

c). Dmj

is unbiased for small values of

j

n , i.e. mean( Dmj

) is equal to mj

. For larger values of j

n , Dmj

is biased and converges towards the average mortality m. For small values of

j

n the standard de-

viation, sd( Dmj ), is equal to the standard deviation of the classical estimator,

sd( Cmj ), but then decreases for larger values of

j

n. The undesirable bias of the esti-

Fig. 1. mj

, mean( Dmj

), m, sd( Cmj

), sigma and sd( Dmj

) as a function of j

n and

j

c for the

parameter values: Nn = N

c = 1000, m

n = 0.26 and m

c = 0.18, N

j

n = N

j

c =100, m

j

c = 0.18 and m

j

n =

0.31


mator is thus compensated for by the decrease in its standard deviation. To find a compromise between these two effects, we introduce:

sigma = sqrt(bias( Dmj )

2

+ sd( Dmj )

2

) . (27)

This equation expresses the idea that to the decision maker, who is only interested in having a robust estimate of the reduction in mortality, it does not matter, if the error comes from bias or from the noise in the estimate. Sigma has a minimum for

j

n = 127, indicating that for the parameters chosen,

j

n = j

c = 127 may be a good choice. Since the minimum of sigma as a function of

j

n and j

c depends on the 8 parameters mentioned above, it is complicated to make a complete analysis of how all parameters influence this minimum. However, some parameters do not substantially affect the minimum.

The number of cases for a given site, Nj

n and Nj

c , does not influence the position of the minimum. It stays at

j

n = j

c = 127 for Nj

n and Nj

c ranging from 5 to 1000. This is convenient, since it allows the same choice of

j

n and j

c for all sites, irrespective of the number of cases in the site.

Also the size of the populations, Nn and Nc , does not substantially influence the position of the minimum of sigma, as long as Nn and Nc remains several times larger than m

j

n and mj

c. For Nn and Nc ranging from 300 to 100000 makes the values of j

n and j

n that minimize sigma range from 100 to 144. The position of the minimum is relatively independent of the mortality for covering

therapy, mc, as long as the difference in mortality between covering and non-covering therapy, m = mn - mc = 8%, is constant. For mc ranging from 6% to 40%, the values of

j

n and j

c that minimize sigma, range from 121 to 135.

It also applies that the position of the minimum is reasonably independent of mjc, as

long as the site specific difference in mortality between covering and non-covering therapy, m

j

= mj

n - mj

c = 13%, is constant. For mj

c ranging from 6% to 40%, the val-ues of

j

n and j

c that minimize sigma, range from 74 to 172. In contrast, the position of the minimum is very dependent on the difference between the average reduction in mortality, m, and the site specific reduction in mortality, m

j

. Fig. 2 shows the posi-tion of the minimum as a function of m

j

- m. For mj

- m = 0.10 the minimum is at

j

n = j

c = 35, and for m equal to mj

the values of j

n and j

c that minimize sigma go towards infinity. The appropriate values for

j

n and j

c thus strongly depends on what is considered a typical value for m - m

j

. Some guidance can be taken from Table 1. The standard deviation of the values given in the column labeled m

j

is 0.08. This estimate of the standard deviation is inflated by the presence of the noise in the estimates of the individual values of m

j

, and 0.08 can thus be considered an upper limit for typical intersite variation.

Based on these considerations it can be concluded that if the difference between the reductions of coverage related mortality is in the range of 4-6%, then values of

j

n and

j

c in the range from 100 to 200 seems appropriate, while a larger difference of 8 to 10% will make a choice of

j

n and j

c in the range from 60 to 100 appropriate. As mentioned above, the value chosen for

j

n and j

n in Table 1 is 150.


Fig. 2. The values of j

n and

j

c that minimize sigma as a function of the difference between the

average and the site specific reduction in mortality, mj

- m

5 Discussion

The reduction in mortality that can be achieved by covering antibiotic therapy is the driving force for antibiotic therapy. It is therefore of interest to have accurate esti-mates of this reduction, since too high estimates may lead to ”over-treatment”, where the costs of the therapy may not be justified by the reduction in mortality that can be achieved and too low estimates may lead to “under-treatment”, where the patient may be given inadequate treatment. The reduction of mortality depends on the site of the infection, and it is therefore of interest to have an estimate of the site-specific reduc-tion of mortality. Unfortunately the subdivision of the cases into groups according to the site of infection reduces the number of cases per subgroup to the extent that the classical estimators become useless for at least some of the groups, because of the large standard deviations of the estimators. In this paper we have explored the use of a hierarchical Dirichlét estimator that reduces the standard deviations of the estimators, but at the expense of biasing them towards the mean of all groups. A function, sigma, was introduced that takes both the variance contributed by the bias and the variance contributed by the noise of the estimators into account.

Suitable values for the parameters of the Dirichlét priors for the mortalities were determined by seeking a minimum for sigma. The analysis showed that the number of imaginary cases to be inherited from the whole population to the individual groups depends almost solely on the difference between the reduction of mortality achieved in the individual group and the reduction of mortality in the whole population. When this difference goes from 4% to 10% the number of cases to be inherited goes from 200 to 60.


Acknowledgements

This work was supported by a grant from the Ph.D. programme at Aalborg Hospital, Denmark, a grant from Scandinavian Society for Antimicrobial Therapy, a grant from the European Commission for the TREAT-project under the IST programme (IST-1999-11459), and a grant from The Danish Technical Research Council (2051-01-0011). The Bacteraemia Register at the Department of Clinical Microbiology was supported by grants from the County of Northern Jutland and Det Obelske Familie-fond.

References

1. Andreassen, S., Leibovici, L., Schønheyder, H.C., Kristensen, B., Riekehr, C., Kjær, A.G., Olesen, K.G.: A decision theoretic approach to empirical treatment of bacteraemia originat-ing from the urinary tract. (eds.: Horn, W. et al.). Lecture Notes in Artificial Intelligence, Vol. 1620, Proceedings of Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making, (AIMDM ‘99). Springer-Verlag (1999) 197-206.

2. Andreassen, S., Riekehr, C., Kristensen, B., Schønheyder, H.C., Leibovici, L: Using prob-abilistic and decision-theoretic methods in treatment and prognosis modeling. Artif. Intell. Med. 15 (1999) 121-134.

3. Heckerman, D.: A tutorial learning with Bayesian networks. Technical Report MSR-TR-95-06. Redmond, U.S.A. (1995).

4. Kristensen, B., Andreassen, S., Leibovici, L., Riekehr, C., Kjær, A.G., Schønheyder, H.C.: Empirical treatment of bacteraemic urinary tract infection. Evaluation of a decision support system. Dan. Med. Bull. 46 (1999) 349-53.

5. Kristensen, B., Larsen, S., Schønheyder, H.C., Leibovici, L., Paul, M., Frank, U., Andreas-sen, S.: A decision support system (DSS) for antibiotic treatment improves empirical treat-ment and reduces costs. Proceedings of 41st Interscience Conference on Antimicrobial Agents and Chemotherapy (ICAAC), Chicago, Illinois, USA, December 16-19, 2001, 476 (2001).

6. Leibovici, L., Shraga, I., Andreassen, S.: How do you choose antibiotic treatment? Br. Med. J. 318 (1999) 1614-1616.

7. Leibovici, L., Fishman, M., Schønheyder, H.C., Riekehr, C., Kristensen, B., Shraga, I., An-dreassen, S.: A causal probabilistic network for optimal treatment of bacterial infections. IEEE Trans. Knowledge and Data Eng. 12 (2000) 517-528.

8. Spiegelhalter, D.J., Myles, J.P., Jones, D.R., Abrams, K.R.: Bayesian methods in health technology assessment: a review. Health Tech. Assess. 2000, 4 No. 38 (2000).

A Bayesian Neural Network Approachfor Sleep Apnea Classification

Oscar Fontenla-Romero, Bertha Guijarro-Berdinas, Amparo Alonso-Betanzos,Ana del Rocıo Fraga-Iglesias, and Vicente Moret-Bonillo

Department of Computer Science, University of A CorunaCampus de Elvina s/n, 15071, A Coruna, Spain

{oscarfon,cibertha,ciamparo,civmoret}@udc.eshttp://www.dc.fi.udc.es/lidia

Abstract. In this paper a method for sleep apnea classification is pro-posed. The method is based on a feedforward neural network trainedusing a bayesian framework and a cross-entropy error function. The in-puts of the neural network are the first level-5-detail coefficients obtainedfrom a discrete wavelet transformation of the samples of the thoracic ef-fort signal corresponding to the apnea. In order to train and validate thepresented method, 120 events from 6 different patients were used. Thetrue error rate was estimated using a 10-fold cross validation. The pre-sented results were averaged over 100 different simulations and a multiplecomparison procedure was used to model selection. The mean classifica-tion accuracy obtained over the test set was 83.78% ± 1.90.

1 Introduction

An sleep apnea is defined as a cessation of airflow during sleep lasting at least10 seconds or more. The involuntary periodic repetition of these respiratorypauses constitutes one of the most frequent sleep disorders named the sleepapnea syndrome (SAS) [1]. Generally speaking, there exist three different sleepapnea patterns:

– Obstructive apneas (OA): This is the more frequent pattern, characterizedby the presence of thoracic effort for continuing breathing during the air flowcessation.

– Central apneas (CA): These are characterized by a cessation of both respi-ratory movements and airflow during, at least, 10 seconds.

– Mixed apneas (MA): This pattern is a combination of the previous two,defined by a central respiratory pause followed by an obstructive ventilatoryeffort.

The most effective method for the diagnosis of the SAS is based on noctur-nal polysomnography. It consists on a polygraphic recording during sleep of theelectrophysiological and neumological signals. Thus, this method uses the elec-troencephalogram (EEG), electrocardiogram (ECG), electro-oculogram (EOG),


A Bayesian Neural Network Approach for Sleep Apnea Classification 285

1.827 1.828 1.829 1.83 1.831 1.832 1.833 1.834 1.835 1.836 1.837

x 105

0

50

100

150

200

250

air

flow

samples

1.827 1.828 1.829 1.83 1.831 1.832 1.833 1.834 1.835 1.836 1.837

x 105

0

50

100

150

200

250

thor

acic

effo

rt

samples

(a)

(b)

apnea

Fig. 1. (a) Airflow and (b) thoracic effort signals.

electromyogram (EMG), airflow, thoracic breathing movements and arterial oxy-gen saturation signals. The overall analysis of these records allows to identifyand classify apneic events and, finally, to elaborate a diagnosis. In particular, thedetection of apneic events is carried out over the airflow signal, while the tho-racic breathing movements signal is used for the classification of the apnea intothe obstructive, central or mixed group. Figure 1 shows an example of a sleepapnea together with the two signals employed in the detection and classificationtasks.

Normally, patients suffer from more than one type of apneic episodes. Thepredominance of one type over the others determines the classification. Thereare several treatments of SAS that should be tailored based on, among otherquestions, the type of sleep apnea exhibited by the patient. Thus, the correctclassification of the sleep apnea is extremely important to select a proper di-agnosis. However, the analysis of the polysomnographic record is a tedious andvery time-consuming task. For this reason, during the last years they have beenappearing some computerized systems with the aim to automatize both the de-tection and the classification of sleep apneas [2–5]. Among them, SAMOA [6] isan intelligent system that automatically analyzes the polysomnographic recordand provides a diagnosis of the sleep apnea syndrome. In this paper, we describea new approach to the classification of sleep apneas that will be, eventually, in-tegrated into this system. This new approach, based on the use of wavelets andbayesian neural networks, will solve some of the problems detected during thefirst validation of SAMOA.

2 Development of the Sleep Apnea Classifier

The global scheme of the automatic apnea classifier presented in this paper iscomposed of three stages as shown in figure 2. Previously, a detection stage, al-

286 Oscar Fontenla-Romero et al.

Classification phase

Detectionphase

Preprocessing stage:Wavelet

Decomposition

Classifier:Bayesian neural

network

Type ofthe apnea

I coefficients

Samples of thedetected apneain the thoracic

effort signalAirflowsignal

Thoracic effort signal

Limits ofthe apnea Samples

selector

Fig. 2. Structure of the proposed classification method.

Samples ofthe apnea

a1

d 1

a2

d 2

a3

a4

a5

d 3

d 4

d 5

Preprocessing stage:Wavelet Decomposition

Fig. 3. Wavelet decomposition tree.

ready described in [7], receives the thoracic effort signal and returns the locationof each apnea. Subsequently, a preprocessing stage receives the raw samples ofthe detected apnea from the thoracic effort signal and extracts the features thatwill serve as inputs to the classifier. Finally, the classification stage will labelthe detected apnea as central, obstructive or mixed. The preprocessing and theclassification stages are the core of this article and will be described bellow.

2.1 Preprocessing Stage

There are two main reasons for which the raw samples of the signal can not bedirectly used as inputs to the classifier:

– Each apnea has different duration (from 10 seconds to one minute or evenmore), thus the number of samples is variable for each event while the numberof inputs to the classification system must be fixed.

– Usually, due to the sampling frequency, the number of sleep apnea’s samplesvaries from 125 to 750. Clearly, this number of inputs is very high and so apreprocessing technique is required to reduce the number of inputs.

In this work, a discrete wavelet transformation [8] was used as a preprocess-ing phase to reduce and fix the number of inputs. The wavelet transformationprovides a decomposition of a given signal into a set of approximation (ai) anddetail (di) coefficients of level i. The decomposition process can be iterated, withsuccessive approximations being decomposed in turn, so that a signal is brokendown into many lower-resolution components (see figure 3). Also, to obtain thisdecomposition some different types of wavelets functions can be used.

In this work, a set of coefficients (in absolute value) from the transformationwill be used as input of the classification system. The samples of the sleep apnea


50 100 150 200 250 300 3500

50

100

150

200

250

obstructive apnea

thor

acic

effo

rt

0 5 10 150

100

200

300

400

coefficients

0 50 100 1500

50

100

150

200

250central apnea

thor

acic

effo

rt

0 5 10 150

5

10

15

coefficients

0 50 100 150 200 2500

50

100

150

200

250

mixed apnea

thor

acic

effo

rt

0 5 10 150

50

100

150

200

250

coefficients

(a)

(b)

(c)

samples

samples

samples

Fig. 4. Example of the three types of sleep apnea and the corresponding level-5-detailcoefficients in absolute value.

event are processed to obtain a level-1 transformation (a1 and d1 coefficients).Subsequently, each set of ai coefficients is decomposed into a set of approximationai+1 and detail di+1 coefficients. Experimentally, it was determined that thelevel-5-detail coefficients, d5, are the set of inputs that obtains the best apneaclassification results, using the symlet of level 7 as the wavelet function to makethe decomposition. Figure 4 shows an example of each class of apnea and thecorresponding first 16 coefficients in absolute value.

2.2 The Bayesian Neural Network Classifier

The classifier employed in this work was a feedforward artificial neural networkwith one hidden layer, as it has been demonstrated that, with appropriate num-ber of hidden neurons, one hidden layer is enough to model any function [9].The input of the network is formed by the vector x = (x1, . . . , xI)T , where Iis the number of input variables. In this case, x is a vector composed by thefirst I coefficients of the d5 decomposition of the wavelet transformation of thesamples of the apnea in the thoracic effort signal. This input of the network wasnormalized to have zero mean and standard deviation equal to one.

The output k of the network, yk, is defined by the following equations:

yk = f (2)(ak); ak =H∑

j=1

w(2)kj hj + w

(2)k0 ; k = 1, . . . , C

hj = f (1)

(I∑

i=1

w(1)ji xi + w

(1)j0

); j = 1, . . . , H

(1)


where C is the number of outputs, H is the number of hidden neurons, thesuperindex indicates the number of the layer, w are the weights of the neuralnetwork, and f (1) and f (2) are, respectively, the activation functions of the neu-rons of the hidden and output layers. In this work, we have used the hyperbolictangent function for f (1) and the softmax activation function for f (2) that makesit possible to interpret the outputs as probabilities [10] and it is defined as:

f (2)(ak) =exp(ak)∑C

k′=1 exp(ak′). (2)

To obtain the optimal set of weights of the network that minimizes a costfunction, a bayesian framework approach was used [11, 12]. In this supervisedlearning method the following error or cost function is employed:

E(w) = ED(w) + αEW (w), (3)

where all the weights of the network were compactly represented by the vectorw for a simplicity of notation. In equation (3) the ED(w) term measures theerror obtained when the output of the network is compared with the expectedoutput. In this work, the cross-entropy error function has been used as it is themore suitable for classification problems [13]. Therefore, ED(w) is defined as

ED(w) = −ND∑n=1

C∑i=1

t(n)i ln(yi(w,x(n))) (4)

where ND is the number of training data, C is the number of classes, {(x(n), t(n))}is the set of training input-output pairs and t(n), the expected output, is givenby

t(n)k =

{1 if x(n) ∈ Ck

0 otherwise(5)

where k = 1, . . . , C and Ck is the set of patterns in the class k.The second term in equation (3), EW (w), is a regularization function, called

weight decay [14], defined as:

EW (w) =12

NW∑j=1

w2j (6)

where NW is the number of weights of the neural network and wj is the j − thcomponent of w. The use of this term avoids the overfitting problem and thusit enhances the network generalization [13, 14].

Finally, the α parameter in equation (3) is called the adaptive regularizationhyperparameter and it is updated as in [11]:

αnew =NW − αoldTrace(A−1)

2EW (w∗)(7)


where A is the Gauss-Newton approximation to the Hessian matrix and w∗ isthe optimum weight vector. To obtain w∗, it was employed the UCMINF method[15] that is a Quasi-Newton optimization algorithm with a soft line search and aBFGS (Broyden, Fletcher, Goldfarb, Shanno) updating on the inverse Hessian.Once the weights have converged, the hyperparameter α is updated, and thisprocess is repeated until the hyperparameter converges.

3 Experimental Results

To obtain the training set, 6 different recordings from 6 patients were available.The signals were sampled with a frequency of 12.5Hz. The apneas contained inthese recordings were classified by an expert in the field (217 obstructive apneas,40 central apneas and 82 mixed apneas). To obtain a balanced training set, 120apneas were selected (40 of each class), thus ND = 120. All the central patternswere used while the other 40 events of each class were randomly selected. Dueto the small size of the training set, in order to estimate the true error rate ofthe classifier, a 10-fold cross validation was used.

To choose the best network the following model selection method was em-ployed, where M is the number of different models:

1 for m = 1 to M1.1 Take the whole data set and generate N different 10-fold cross vali-

dation sets in order to have disjoint and different partitions (randomlyselected) of the training set. Also, in each simulation, employ differentinitial conditions of the model (weights of the neural network).

1.2 Train a model (neural network) with a certain degree of complexity(number of hidden neurons and number of inputs) and obtain N accuracymeasures over the test set: Tm = {Tm1, . . . TmN}.

2 end3 Apply a Kruskal-Wallis test [16], a nonparametric version of the classical one-

way analysis of variance (ANOVA), to check if there are significant differencesamong the means of the M models for a level of significance γ.

4 If there are differences among the means, then apply a Multiple ComparisonProcedure (MCP) [17] to find the set of models whose errors are not signif-icantly different from that corresponding to the model with the maximummean accuracy rate. From this set select the simplest model (lowest com-plexity). In this work, a Tukey’s honestly signicant criterion [17] was usedas multiple comparison test.

Following the previous steps, with N = 100, several neural networks weretrained using from 12 to 16 coefficients (inputs of the network) and from 4 to 14neurons in the hidden layer. Figures 5 and 6 show the obtained results, using a10-fold cross validation, for the training and test set, respectively.

These figures show the box-whiskers plots for each network’s topology. Thebox corresponds to the interquartile range, the bar represents the median, andthe whiskers extend to the minimum and maximum values. Outliers are datawith values beyond the ends of the whiskers and they are represented by the


89 90 91 92 93 94 9512, 4

56789

10111213

12, 1413, 4

56789

10111213

13, 1414, 4

56789

10111213

14, 1415, 4

56789

10111213

15, 1416, 4

56789

10111213

16, 14

Accuracy (%)

Coe

ffici

ents

, hid

den

neur

ons

Fig. 5. Box-whiskers plots for the training data using a 10-fold cross validation and100 different experiments.

74 76 78 80 82 84 86 8812, 4

56789

10111213

12, 1413, 4

56789

10111213

13, 1414, 4

56789

10111213

14, 1415, 4

56789

10111213

15, 1416, 4

56789

10111213

16, 14

Accuracy (%)

Coe

ffici

ents

, hid

den

neur

ons

Fig. 6. Box-whiskers plots for the test data using a 10-fold cross validation and 100different experiments.

plus sign. In these figures, the x-axis represents the classification accuracy andthe y-axis is formed by a pair indicating the number of coefficients used as inputsin the network and the number of hidden neurons. In figure 6 it can be seen thatthe 13-6-3 topology obtains the best median accuracy.


1500 2000 2500 3000 3500 4000 450016,14

1312111098765

16, 415,14

1312111098765

15, 414,14

1312111098765

14, 413,14

1312111098765

13, 412,14

1312111098765

12, 4

Coe

ffici

ents

, hid

den

neur

ons

Fig. 7. Plot for the multiple comparison procedure.

Subsequently, after the 100 experiments for each model, the Kruskal-Wallistest was applied to check if there are statistically differences among the mean testaccuracies. The p-value obtained was 0 for a significance level of 95%. Therefore,the initial hypothesis (all means are equal) can clearly be rejected. Afterwards,the multiple comparison procedure was performed to make all-pairwise com-parisons among each model. Figure 7 graphically represents the comparison forthose topologies whose mean accuracy are significantly different from the best(13 inputs and 6 hidden neurons). Those topologies whose interval is not overlap-ping the dashed line are significantly different from the best topology, thereforethey can be discarded. Among the other topologies whose interval is overlappingthe dashed line, the simplest must be chosen. Therefore, the 13-4-3 topology wasselected as the model to use as sleep apnea classifier. The mean test accuracy ob-tained for the selected topology (13-4-3) in the 100 10-fold cross validations was83.78%± 1.90. Also, the mean accuracy and the confidence interval obtained foreach one of the classes was 80.90% ± 2.53 (obstructive), 80.48% ± 3.65 (mixed)and 89.95% ± 2.71 (central). The corresponding confusion matrix is shown intable 1. As it can be seen, the maximum number of errors appears between theobstructive and mixed class and between the central and mixed class. This is alogical result due to the fact that the mixed class is a mixture of the other twoclasses.

Finally, up to the authors’ knowledge, only two previous methods were pro-posed for sleep apnea classification. The first work, apply a radial basis functionneural network for an integrated detection-classification task [18]. The methoddescribed in here surpasses their results, although both methods are only par-tially comparable due to the fact that the results in [18] correspond to detection


Table 1. Confusion matrix showing the number of cases classified in each class for thetest set using a 10-fold cross validation and 100 different experiments.

Neural network’s outputObstructive Mixed Central

Obstructive 32.36 ± 1.01 6.10 ± 0.88 1.54 ± 0.70Real Mixed 2.91 ± 0.93 32.19 ± 1.46 4.90 ± 1.07

Central 1.40 ± 0.83 2.62 ± 0.63 35.98 ± 1.08

plus classification, and not only to the latter, as in the case of the method pre-sented in this paper. Also, in [7] a neural system for the sleep apnea classificationbased on the raw samples of the signal is proposed. The accuracy obtained, overthe test set, was 75.32%.

4 Conclusion

In this paper, a new method for sleep apnea classification has been proposed.The method is based on a feedforward neural network trained using a bayesianframework and a regularized cross-entropy function. The input of the neuralnetwork is formed by the coefficients of a discrete wavelet decomposition appliedto the raw samples of the apnea. The experimental results obtained, using 120apneas from 6 different patients, have demostrated the validity of the proposedmethod.

Acknowledgements

This research has been supported by the Xunta de Galicia (project PGIDT-01PXI10503PR).

References

1. Thorpy, M.J.: Handbook of Sleep Disorders. Marcel Dekker, Inc., New York. Basel(1990)

2. Dagum, P., Galper, A.: Forecasting sleep apnea with dynamic networks. In: Proc.9th Conf. on Uncertainty in Artificial Intelligence, Morgan Kaufmann (1993) 64–71

3. Rauscher, H., Popp, W., Zwick, H.: Computerized detection of respiratory eventsduring sleep from rapid increases in oxyhemoglobin saturation. Lung (1991) 335–342

4. George, C.F., Millar, T.W., Kryger, M.H.: Identification and quantification ofapneas by computer-based analysis of oxygen saturation. Am Rev Respir Dis 137(1988) 1238–1240

5. Oliveira, J.M., Tome A. M., Cunha, J.P.: Sleep data integration and analysis -an object oriented approach. Proc. 15th Annual International Conference of theIEEE Engineering in Medicine and Biology Society (1993) 455–456


6. Cabrero-Canosa, M., Castro-Pereiro, M., Grana-Ramos, M., Hernandez-Pereira,E., Moret-Bonillo, V., Martın-Egana, M., Verea-Hernando, H.: An intelligent sys-tem for the detection and interpretation of sleep apneas. Expert System withApplications (2003) 335–349

7. Hernandez-Pereira, E., Carrillo-Rozas, N., Cabrero-Canosa, M., Moret-Bonillo, V.:Deteccion y clasificacion de apneas de sueno mediante wavelets y redes de neuronasartificiales. In Lasaosa, P.L., Gasso, S.O., Gascon, G.M., Cortes, J.P.M., eds.: Proc.XX Congreso Anual de la Sociedad Espanola de Ingenıerıa Biomedica. (2002) 199–202

8. Rao, R.M., Bopardikar, A.S.: Wavelet Transformations. Introduction to Theoryand Applications. Addison-Wesley, Reading, MA (1998)

9. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks areuniversal approximators. Neural Networks 2 (1989) 359–366

10. Bridle, J.: Probabilistic interpretation of feedforward classification network out-puts, with relationships to statistical pattern recognition. In Fogelman-Soulie, F.,Herault, J., eds.: Neurocomputing: Algorithms, Architectures and Applications,New York, Springer-Verlag (1990) 227–236

11. MacKay, D.J.C.: A practical bayesian framework for backprop networks. NeuralComputation 4 (1992) 448–472

12. MacKay, D.J.C.: The evidence framework applied to classification networks. NeuralComputation 4 (1992) 720–736

13. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press,New York (1995)

14. Hinton, G.E.: Learning translation invariant recognition in massively parallel net-works. In Nijman, A.J., de Bakker, J., Treleaven, P.C., eds.: PARLE Conferenceon Parallel Architectures and Languages Europe, Berlin, Springer (1987) 1–13

15. Nielsen, H.: UCMINF - An algorithm for unconstrained, nonlinear optimization.Technical Report IMM-REP-2000-0019, Department of Mathematical Modelling.Technical University of Denmark (2000)

16. Hollander, M., Wolfe, D.A.: Nonparametric Statistical Methods. John Wiley &Sons (1973)

17. Hsu, J.C.: Multiple Comparisons. Theory and Methods. Chapman&Hall/CRC,Boca Raton, FL (1996)

18. Zemen, T., Clabian, M., Pfutzner, H.: Classification of sleep apnea events bymeans of radial basis function networks. In: International ICSC/IFAC Symposiumon Neural Computation (NC’98). (1998) 351–357

Probabilistic Networksas Probabilistic Forecasters�

Linda C. van der Gaag and Silja Renooij

Institute of Information and Computing Sciences, Utrecht UniversityP.O. Box 80.089, 3508 TB Utrecht, The Netherlands

{linda,silja}@cs.uu.nl

Abstract. To establish its clinical value, a probabilistic network is typ-ically subjected to an evaluation study using real patient data from thefield of application. The results of such a study are often summarised inthe percentage of correctly predicted outcomes. In this paper, we pro-pose the use of a forecasting score as an alternative way of expressingthe clinical value of a network. Such a score takes not just the predictedoutcome into consideration but also the associated distribution of uncer-tainty. We illustrate the use and interpretation of the Brier forecastingscore for a real-life probabilistic network in oncology.

1 Introduction

An increasing number of decision-support systems are being designed that aimat supporting the tasks of medical diagnosis and prognostication. More andmore of these systems build upon a probabilistic network for capturing andreasoning about the uncertainties involved in these tasks. A probabilistic networkis a concise representation of a joint probability distribution and provides forefficiently computing any probability of interest over its variables [1].

To establish the clinical value of a probabilistic network that is developedfor a medical field of application, it is typically subjected to an evaluation studyusing real patient data. Such a study amounts to entering the data available foreach patient into the network, computing the most likely diagnosis or prognosis,and comparing this outcome against a given standard of validity. The percentageof correctly predicted outcomes is then taken to convey the clinical value of thenetwork. For example, a percentage correct of 85% is taken to indicate thatthe network establishes the correct outcome for 85 out of every 100 patients. Apercentage correct cannot be interpreted just like that, however, as it pertainsto a specific data collection. Each data collection is likely to include errors, toreflect biases, and to show the effects of random variation. These factors affectthe percentage correct for the network under study, yet the percentage does notexpress the extent to which they do so.

� This research is (partly) supported by the Netherlands Organisation for ScientificResearch (NWO).


Probabilistic Networks as Probabilistic Forecasters 295

While for computing a network’s percentage correct a single outcome perpatient is established, the network in essence does not yield a single, determin-istic outcome. Instead, it produces a posterior probability distribution for theoutcome variable. Since the percentage correct only considers the most likelyoutcome, it disregards the uncertainty expressed by the posterior distribution.To incorporate this uncertainty in the assessment of a network’s clinical value,we propose the use of a forecasting score from the field of statistical forecasting.We illustrate the use and interpretation of such a score by means of an evaluationstudy of a real-life probabilistic network in the field of oesophageal cancer.

The paper is organised as follows. In Sect. 2, we briefly describe the oesoph-agus network and the available patient data. Sect. 3 presents the results froman evaluation study of the network in terms of its percentage correct. Sect. 4introduces the Brier score as an alternative way of summarising the results fromthe study. The paper ends with our concluding observations in Sect. 5.

2 The Oesophagus Network and the Patient Data

With the help of two experts in gastrointestinal oncology from the NetherlandsCancer Institute, Antoni van Leeuwenhoekhuis, we constructed a probabilisticnetwork in the field of oesophageal cancer. The network details the character-istics of an oesophageal tumour and captures the pathophysiological processesassociated with its growth. The advance of the cancer is summarised in its stage,which can be either I, IIA, IIB, III, IVA, or IVB, in progressive order. The net-work currently includes 42 statistical variables and almost 1000 (judgmental)probabilities [2], and provides for computing the most likely stage of a patient’scancer based upon his or her symptoms and test results.

For studying the clinical value of the oesophagus network, the medical recordsof 156 patients diagnosed with oesophageal cancer were available from the Antonivan Leeuwenhoekhuis; these data had not been used in the construction of thenetwork. For each patient between 6 and 21 different symptoms and test resultsare available. Also recorded is the stage of the patient’s cancer as established bythe attending physician. In our evaluation study, we take these stages for thestandard of validity to compare the outcomes of our network against.

3 The Percentage Correct and Its Shortcomings

Using the available patient data, we conducted an evaluation study of the oe-sophagus network. We entered, for each patient, all symptoms and test resultsavailable and computed the most likely stage for the patient’s cancer; we thencompared this stage against the one mentioned in the patient’s medical record.The results are summarised in the table of Fig. 1, on the left. We find that thenetwork establishes the correct stage for 133 of the 156 patients, that is, we finda percentage correct of 85%.

The numbers of correctly and incorrectly staged patients, as shown in Fig. 1,do not convey any information about the uncertainty in the outcomes computed

296 Linda C. van der Gaag and Silja Renooij

network networkI IIA IIB III IVA IVB I IIA IIB III IVA IVB

I 2 0 0 0 0 0 0.21 – – – – –IIA 0 37 0 1 0 0 – 0.28 – 1.52 – –

data IIB 0 1 0 3 0 0 – 1.17 – 0.98 – –III 1 10 0 36 0 0 1.40 0.89 – 0.26 – –IVA 0 0 0 4 35 0 – – – 0.75 0.08 –IVB 0 0 0 3 0 23 – – – 0.87 – 0.06

Fig. 1. Results from the evaluation study: the numbers of correctly and incorrectlystaged patients (left) and the average Brier scores (right)

Stage for patient 1000.01590.08820.82450.0714

IIIAIIBIIIIVAIVB

Stage for patient 2000.00020.36160.34980.2884

IIIAIIBIIIIVAIVB

Stage for patient 30.02220.37530.04590.37140.09160.0936

IIIAIIBIIIIVAIVB

Fig. 2. The posterior distributions over the six possible stages for three patients; themedical records state stage IVA for patient 1 and stage III for patients 2 and 3

from the oesophagus network. We recall that the network yields, for each patient,a posterior probability distribution over the possible stages of his or her cancer;as an example, Fig. 2 shows the probability distributions that are yielded forthree real patients. Now, such a computed distribution may clearly point to asingle most likely stage. The medical record of patient 1, for example, mentionsstage IVA for his cancer. Stage IVA is indeed yielded by the network as themost likely stage; moreover, it is predicted with high probability, indicating thatthere is little doubt as to the true stage of this patient’s cancer. The computedposterior distribution, however, may also reveal considerable uncertainty. Themedical record of patient 2, for example, mentions stage III. The network indeedfinds III for the most likely stage, but not without considerable doubt: it assignsrelatively high probabilities to the stages IVA and IVB as well. For patient 3,the medical record also states stage III, yet the network yields stage IIA. Theprobability computed for stage III, however, is almost equal to the probability ofstage IIA. The percentage correct reported for the network does not express thesedistributions of uncertainty over the various different stages. For the patientsshown in Fig. 2, the network’s predictions are classified simply as correct for thefirst two patients and as incorrect for patient 3.

Probabilistic Networks as Probabilistic Forecasters 297

4 The Forecasting Score

As illustrated in the previous section, the percentage correct as a summary ofevaluation results does not take the uncertainties of a network’s predictions intoaccount. We feel that for assessing the clinical value of a real-life probabilisticnetwork, not just the most likely outcome but also the posterior distributionover all possible outcomes should be studied. To this end, we observe that prob-abilistic networks in essence are probabilistic forecasters. For the oesophagusnetwork, for example, the posterior distribution over the six possible stages thatis computed for a specific patient, can be viewed as a forecast for the true stageof this patient’s cancer. An alternative way of establishing the clinical value ofa probabilistic network now is to assess its quality as a forecaster.

In the field of statistical forecasting, various different scores for expressingthe quality of a probabilistic forecaster have been developed, among which theBrier score is the best known [3]. We illustrate the basic idea of this score forour oesophagus network. For each patient i, the network yields a forecast that iscomposed of the posterior probabilities pij over the stages j = I , . . . , IVB. TheBrier score Bi of this forecast is defined as

Bi =∑

j=I,...,IVB

(pij − sij)2

where sij = 1 if the medical record of patient i states stage j, and sij = 0otherwise. If the network would yield the correct stage with certainty, then theassociated Brier score would be equal to 0; for an incorrect deterministic forecast,the score would be 2. The Brier score thus ranges between 0 and 2, and the betterthe forecast, the lower the score.

The Brier scores of the forecasts for the three patients from Fig. 2 are B1 =0.04, B2 = 0.61, and B3 = 0.56, respectively. These scores reveal that theforecast for patient 1 is of high quality. The forecasts for patients 2 and 3, on theother hand, appear to be of lesser quality. We recall that the forecast for patient3 is equivocal as a result of two stages being almost equally likely. For patient 2,there is even more uncertainty in the forecast, as there are three almost equallylikely stages. These observations are reflected in the associated Brier scores: thescore for patient 3 indicates higher quality than the score for patient 2. While,in terms of the numbers of correctly and incorrectly staged patients, the forecastfor patient 2 is correct and the forecast for patient 3 is incorrect, the use of theBrier score results in a more balanced quality assessment.

Now, to assess the quality of the oesophagus network as a probabilistic fore-caster, we once again conducted an evaluation study using the available patientdata. We entered, for each patient, all symptoms and test results available andcomputed the posterior probability distribution over the possible stages of thepatient’s cancer; we then computed the Brier score of the resulting forecast, giventhe stage mentioned in the patient’s medical record. The table of Fig. 1 sum-marises, on the right, the averaged Brier scores. The low scores on the diagonalsignify that the associated forecasts are of high quality. The higher scores besidethe diagonal indicate forecasts of lesser quality. The relatively poor quality of

298 Linda C. van der Gaag and Silja Renooij

these forecasts may have its origin in uncertainty as to which stage is the trueone, as for example for the patients 2 and 3 discussed above. A higher score canalso result, however, from a forecast that associates a high probability with anincorrect stage and may thus point to a possible modelling error in the network.

The quality of a real-life probabilistic network can now be expressed in anoverall score that averages the scores of the separate forecasts yielded for a givencollection of patients. For the oesophagus network, we find an overall Brier scoreof 0.29 for the available patient data. To interpret this number, we compare itagainst the overall scores found for three more or less uninformed forecasters. Thefirst of these forecasters does not use any domain knowledge: for each patient,it simply returns a uniform probability distribution over the six possible stages.This forecaster has an overall Brier score of 0.83. The second forecaster yields,for each patient, the prior distribution over the possible stages computed fromthe network. This forecaster has an overall Brier score of 0.80 and is thereforeslightly more informed than the uniform forecaster. The third forecaster, toconclude, yields, for each patient, the prior distribution over the stages recordedin the data collection. This forecaster has an overall Brier score of 0.76, whichis slightly lower than the overall score of the second forecaster as a result of itsbias towards the data. The much lower Brier score of the oesophagus networknow conveys that the network builds upon its knowledge of oesophageal cancerto arrive at relatively good forecasts.

5 Conclusions

The clinical value of a probabilistic network that is developed for a medical ap-plication, is typically established by subjecting it to an evaluation study usingreal patient data. We argued that the percentage correct that is generally com-puted from such a study, hides the distribution of uncertainties over the possibleoutcomes and consequently hides the network’s doubt as to the true outcome.We suggested the use of a forecasting score to yield a more balanced value as-sessment for a probabilistic network. We showed that such a score takes not justthe most likely outcome but all possible outcomes with their associated uncer-tainties into consideration and thereby provides useful information in additionto the percentage correct.

References

1. F.V. Jensen (1996). An Introduction to Bayesian Networks. UCL Press, London.2. L.C. van der Gaag, S. Renooij, C.L.M. Witteman, B.M.P. Aleman, and B.G.

Taal (2002). Probabilities for a probabilistic network: A case-study in oesophagealcancer. Artificial Intelligence in Medicine, vol. 25, pp. 123 – 148.

3. H.A. Panofsky and G.W. Brier (1968). Some Applications of Statistics to Meteo-rology. The Pennsylvania State University, University Park, Pennsylvania.

Finding and Explaining Optimal Treatments

Concha Bielza1, Juan A. Fernandez del Pozo1, and Peter Lucas2

1 Decision Analysis Group, Technical University of MadridCampus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain

{mcbielza,jafernandez}@fi.upm.es2 Institute for Computing and Information Sciences, University of Nijmegen

Toernooiveld 1, 6525 ED Nijmegen, The [email protected]

Abstract. Influence diagrams are modern decision-theoretic represen-tations that can be used to model medical decision-making problems.The output of evaluating an influence diagram are decision tables withoptimal decision alternatives. For real-life clinical problems the result-ing tables can be really big, so that understanding what they say isnearly impossible. KBM2L lists are new list-based structures suitable forminimising memory storage space of these tables, and at the same timesearching for a better knowledge organisation. In this paper, we studythe application of KBM2L lists for finding the optimal treatments forgastric non-Hodgkin lymphoma.

1 Introduction

An influence diagram (ID) is a modern decision-theoretic formalism, frequentlyadopted as a basis for the construction of decision-support systems (DSS) [1].The results of solving, or evaluating, an ID are decision tables containing thedecision alternatives with maximum expected utility, for every combination ofvariables. The evaluation algorithm determines which these variables are.

Medical doctors may find it difficult to accept the recommendations suggestedin those tables if they do not understand the reasons behind their content. If thesystem were able to provide explanations to medical decisions, new insights intothe problem might be obtained, and it might also serve for system validation.

Considering that the table sizes are exponential in the number of variables,finding explanations is a hard task, also from a purely computational viewpoint.Turning the huge tables into more compact tables will bring out memory savings.If the resulting compact tables offer insight into the original tables, then findingexplanations and optimising the storage space of the decision tables are to someextent the same problem.

KBM2L lists1 were introduced to address this problem [2]. Finding good ex-planations is investigated by a number of disciplines such as knowledge-basedsystems and machine learning. Our approach bears some resemblance to tech-niques for knowledge extraction [3–5], and to the identification of relevant nodes1 Knowledge Based Multidimensional Matrix, transformed into a List.


300 Concha Bielza, Juan A. Fernandez del Pozo, and Peter Lucas

for each decision node in an ID [6]. As explained in detail in [2], at the startof the algorithm there are already correctly classified cases represented in tableform, which may be interpreted as representing types of patients, and we try toextract reasons underlying this classification.

In order to investigate the usefulness of this method within a medical setting,an ID regarding the treatment of non-Hodgkin lymphoma of the stomach, shortlygastric NHL, previously developed by one of the authors in collaboration withexpert clinical oncologists, was chosen as an experimental vehicle [7].

2 Using KBM2L Lists for Explanation Purposes

Let the variables or attributes in the set {A0, . . . , An} be totally ordered, andlet Di = |δ(Ai)| be the cardinality of the domain of Ai. A base is a vector whoseelements are the attributes in a specific order. A decision table can then betaken as a multidimensional matrix, and mapped to a linear array or list in away similar to sequential memory allocation in computers [8]. Given a cell of thetable with index c = (c0, c1, . . . , cn), we define f : N

n+1 → N, such that

f(c0, c1, ..., cn) = c0

n∏i=1

Di + c1

n∏i=2

Di + · · · + cn = q (1)

where q is the c-offset with respect to the first element of the table in a given

base; wi =n∏

j=i+1Dj = wi+1Di+1 is called the i-th attribute weight, for i =

0, 1, . . . , n − 1 and wn = 1.Given the adjacency relationship provided by (1) among the table cells, we

try to shorten the equivalent list. Then, much in the way as sparse matricesare managed, the new compact list will only store one index (or equivalently itsoffset) per set of consecutive cells having the same optimal alternative. We willchoose its last index (offset). This last index together with the shared optimalalternative, representing a set of records or cases, is called item. The shorterresulting list composed of items is called a KBM2L list [2]. An item is denotedby 〈index or offset, alternative|, where the 〈 symbol reflects that the item offsetsincrease monotonously, and the | symbol reflects knowledge granularity.

The common components of all the item cases will be called index fixed part.Cases that do not share the same values lead to the variable part. Attributes inthe fixed part take equal values and somehow explain why the optimal alterna-tive is also equal throughout the item. They can be interpreted as the optimalalternative explanation.

The regularity patterns regarding table contents depend on their internalorganisation. Thus, the attributes can be arranged in different orders or bases,always maintaining the same information but leading to different KBM2L sizes.[2] provides efficient heuristics to guide the search for a good base, i.e. one thatminimises the number of items, in the space of the possible permutations ofattributes.

Finding and Explaining Optimal Treatments 301

3 The Gastric Non-Hodgkin Lymphoma Problem

Primary gastric non-Hodgkin lymphoma is a relatively rare disorder, accountingfor about 5% of gastric tumours. This disorder is caused by a chronic infectionwith the bacterium Helicobacter pylori. Lack of widespread expertise with thisdisease makes it necessary that some form of decision support is needed to assistclinicians in the treatment of these patients. Models used in the past only predictthe prognosis of the disease, but cannot be used to select treatment.

To overcome these limitations, the last author with the help of two oncolo-gists constructed a number of IDs [7]. These models are only meant to be used forpatients with histologically confirmed gastric NHL. We have taken the most com-plex version with 3 decision nodes for this study. helicobacter-treatment(ht) corresponds to the decision (No/Yes) to prescribe antibiotics against H.pylori. The second decision concerns carrying out surgery (s), i.e. the totalor partial resection of the stomach. Its alternatives are: curative, i.e. completeremoval of the locoregional tumour mass, palliative, i.e. incomplete removal, ornone. The last decision, ct-rt-schedule (ctrts), concerns the selection ofchemotherapy (Chemo), radiotherapy (Radio), chemotherapy followed by ice-berg radiotherapy (Ch.Next.Rad), and none. The ID is discussed in detail in [7].It consists of 17 chance nodes, one value node, 3 decision nodes, and 42 arcs.Attributes required for the reading of the next section are shown in Table 1.

Table 1. Attributes and domains

Attributes Domains

general-health-status (ghs) Poor, Average, Good

clinical-stage (cs) I, II1, II2, III, IV

bulky-disease (bd) Yes, No

histological-classification (hc) Low.Grade, High.Grade

helicobacter-pylori (hp) Absent, Present

clinical-presentation (cp) None, Hemorrhage, Perforation, Obstruction

This is a realistic clinical model, reflecting the current scientific evidence inmedical literature about this disorder. The model exceeds common prognosticmodels based on logistic regression, as it is part of a DSS that can answermany different clinical questions. It can be used both to determine the optimaltreatment for individual patients, and to predict prognosis and generate patient-specific risk profiles. However, understanding the treatment advice generated bya DSS for the whole patient population is not so straightforward. Clinicianswould benefit from having clear and concise explanations of the results of thesystem, which would justify these results and help them understand. In addition,this would yield an alternative way for validating a system.

A preliminary evaluation of the gastric NHL model’s accuracy had alreadybeen accomplished by means of a double blinded clinical study. The presentresearch, where we use KBM2L lists to better understand the treatment basisof the gastric NHL model, complements this earlier study.

302 Concha Bielza, Juan A. Fernandez del Pozo, and Peter Lucas

4 Results

Evaluation of the ID yielded three decision tables. Our algorithm was appliedto each KBM2L list to improve the initial bases. The ht, s, and ctrts listsrequired 12, 78, and 168 base changes, run in CPU times of 2.5, 164, and 2238.3seconds, respectively, moving from 17, 385 and 678 items to 5, 21 and 218 items.

Then, we sequentially chained the three tables associated with the previouslists to produce a single global table with the complete knowledge. As a result,the base is B0 = (bd,hp,ghs,cs,cp,hc) and its KBM2L list has 340 items. Notethat in this global table, we have more possible decisions than in the simpler listsrelated to the individual treatments, i.e. up to 2 ·3 ·4 = 24, although only 13 areobtained. Palliative surgery is not applied or some other combinations either.

Taking into account the cardinality of each attribute domain, the number ofpossible combinations is 480. Therefore, the knowledge represented by the 340items seems to be considerably fragmented, and there are reasons for optimi-sation. Again, after 107 base changes, run in 273.5 seconds, we get a shorterlist and refine the knowledge about the decisive attributes. The list has 195items, a sizeable improvement (a reduction of 42.6%). The optimal base isBfinal = (hp,hc,cp,cs,bd,ghs).

Table 2 presents a portion of the optimal KBM2L list; we will next considersome of the more noteworthy items. The fixed part of each item, i.e. its expla-nation, is shown in bold face. This list can be read as representing 195 rulesindicating the optimal global policy as a function of the key attributes. The fur-ther to the left the attribute is, the more important it becomes (a higher weightwith respect to the base, see equation (1)).

Table 2. The optimal KBM2L list (#item and item description)

. . . . . .80 〈HP: Absent, HC: High.Grade, CP: Perforation, CS: IV, BD: No, GHS: Average,

(HT: No, S: Curative, CTRTS: Chemo)|81 〈HP: Absent, HC: High.Grade, CP: Perforation, CS: IV, BD: No, GHS: Good,

(HT: No, S: Curative, CTRTS: Ch.Next.Rad)|. . . . . .125 〈HP: Present, HC: Low.Grade, CP: Perforation, CS:III, BD: No, GHS: Good,

(HT: No, S: Curative, CTRTS: Chemo)|126 〈HP: Present, HC: Low.Grade, CP: Perforation, CS:IV, BD: Yes, GHS: Good,

(HT: Yes, S: None, CTRTS: None)|127 〈HP: Present, HC: Low.Grade, CP: Perforation, CS: IV, BD: No, GHS: Good,

(HT: Yes, S: Curative, CTRTS: Chemo)|. . . . . .

Consider rule 80, (ht = No, s = Curative,ctrts = Chemo), in comparisonto rule 81, (ht = No,S = Curative,ctrts = Ch.Next.Rad). Both rules includeonly one case since their variable parts are empty. The patient state ghs hasvalues ‘Average’ and ‘Good’, respectively, which explains the difference betweenthe two rules. The reason is that if a patient’s health status is good, a moreaggressive treatment can be selected. The difference between rule 125, (ht =No, s = Curative, ctrts = Chemo), and rule 127, (ht = Yes, s = Curative,

Finding and Explaining Optimal Treatments 303

ctrts = Chemo), can be explained by noting that the clinical stage (cs) ofthe disease is different for both items. Clearly the system has decided that thetreatment for the most advanced stage of the disease (cs = IV) for the slowlyprogressing low-grade version of gastric NHL should be more agressive than forthe less advanced stage (cs = III), where both disease stages are essentiallyincurable. This obviously might be open to debate amongst oncologists.

Clinician may be not only interested in comparing rules but also in finding outunder which circumstances a treatment is applied. Consider, for example, treat-ment T ≡ (ht = No, s = Curative,ctrts = Ch.Next.Rad), which belongs to 4items, including the already mentioned item 81. Focusing on this treatment, anew base organisation is sought distinguishing only a pair of possible treatmentsT and ¬T . The resultant base is now (ghs,hc,cp,bd,cs,hp), with only 3 items(lengths 456, 4 and 20, resp.). Note that attribute ghs has gained in importance.The cases associated with T are now grouped into one item and the resultingexplanation is ghs = Good,hc = High.Grade,cp = Perforation,bd = No.

5 Final Remark

Medical experts involved in the ID construction process may study whether thegenerated explanations for the optimal treatment agree with their own knowl-edge, and parts of the ID may then be improved accordingly. Our current researchis directed towards performing sensitivity analysis within our framework.

Acknowledgments

Research supported by Ministry of Science and Technology, Project DPI2001-3731.

References

1. Shachter, R.D.: Evaluating Influence Diagrams. Op. Res. 34 6 (1986) 871–8822. Fernandez del Pozo, J.A., Bielza, C., Gomez, M.: A List-Based Compact Repre-

sentation for Large Decision Tables Management. EJOR (2003) to appear3. Duda, R., Hart, P., Stork, D.: Pattern Classification. 2nd edn. Wiley, NY (2001)4. Kohavi, R.: Bottom-Up Induction of Oblivious Read-Once Decision Graphs. In:

Bergadano, F., De Raedt, L. (eds.): Machine Learning: ECML-94. Lecture Notesin Computer Science, Vol. 784. Springer-Verlag, Berlin (1994) 154–169

5. Pawlak, Z.: Rough Set Approach to Knowledge-Based Decision Support. EJOR 99(1997) 48–57

6. Lauritzen, S., Nilsson, D.: Representing and Solving Decision Problems with Lim-ited Information. Manag. Sci. 47 9 (2001) 1235–1251

7. Lucas, P., Boot, H., Taal, B.: Computer-Based Decision-Support in the Manage-ment of Primary Gastric NHL. Met. Inf. Med. 37 (1998) 206–219

8. Knuth, D.E.: The Art of Computer Programming, Vol. 1: Fundamental Algorithms.Addison-Wesley, Reading (1968)

Acquisition of Adaptation Knowledgefor Breast Cancer Treatment Decision Support

Jean Lieber1, Mathieu d’Aquin1, Pierre Bey2,Amedeo Napoli1, Maria Rios3, and Catherine Sauvagnac4

1 Orpailleur research group, LORIA, (CNRS, INRIA, Nancy Universities)BP 239, 54 506 Vandœuvre-les-Nancy{lieber,daquin,napoli}@loria.fr

2 Reseau Oncolor, avenue de Bourgogne, 54 511 Vandœuvre-les-Nancy, France, andInstitut Curie, 26, rue d’Ulm, 75248 Paris, France

[email protected] Centre Alexis Vautrin, Services d’Oncologie, avenue de Bourgogne

54 511 Vandœ[email protected]

4 Laboratoire d’ergonomie du CNAM, 41, rue Gay-Lussac 75 005 [email protected]

Abstract. The elaboration of a treatment in cancerology depends on decisionprotocols. These protocols are often adapted rather than used straightforwardly.This paper deals with the acquisition of the knowledge exploited during proto-col adaptations. It shows that this knowledge acquisition process can be basedon similarity paths, that are used for representing the matchings between deci-sion problems (e.g., source and target problems within a case-based reasoningprocess).

1 Introduction

Case-based reasoning (CBR) consists in reusing the solutions of already solved prob-lems in order to solve a new problem [15]. Such a reasoning relies on a retrieval phase(selection of a memorised solved problem with its solution) and an adaptation of thesolution of the retrieved problem, in order to solve the new problem. In many CBR sys-tems, the adaptation is based on complex and domain-dependent adaptation knowledgewhich has to be acquired and modelled.

This paper presents the acquisition and modelling of adaptation knowledge for thesystem KASIMIR/CBR whose application domain is breast cancer treatment. Beyondthis application, our ambition is to present some elements of an adaptation knowledgeacquisition methodology.

Section 2 describes how breast cancer treatment is managed in the Alexis Vautrinhospital (cancer therapy centre). The Kasimir project, context of this study, is presentedin section 3. The principle of the adaptation process is presented in section 4. Section 5plays a central role in this paper: it presents the adaptation knowledge acquisition andmodelling. The discussion of section 6 comments the contribution of this work. Sec-tion 7 concludes the paper.


Acquisition of Adaptation Knowledge for Breast Cancer Treatment Decision Support 305

2 Breast Cancer Treatment in Alexis Vautrin Hospital

In Alexis Vautrin hospital, breast cancer treatment is based on a protocol. The linksbetween the physicians and the protocol are schematically presented in figure 1.

Fig. 1. The protocol and how it is used.

➀ The protocol is created by a pluridisciplinary group of experts in breast cancer,who use the principles of the so-called evidence-based medicine [2] for the most fre-quent situations of patients with breast cancer. This means that these experts exploitpublished studies about breast cancer. Another task of this group of experts is to updateperiodically the protocol taking into account new knowledge in oncology.

➁ The protocol can be considered as a set of rules helping the physicians in theirdaily practice. It determines the “standard way” to consider and treat the patient, oroptions when no standard is available.

➂ Unfortunately, the straightforward use of the protocol gives satisfaction in onlyabout 70 % of the cases. The other cases –the “out of protocol cases”– are (a) the casesfor which the rules do not provide any answer (or provide incomplete answers) and(b) the cases for which the solutions proposed by the rules raise some difficulties (con-traindication, impossibility of applying completely a treatment, etc.). Most of the time,the out of protocol cases are handled by adaptation of the protocol rules. A role of theBTDC –breast therapeutic decision committee– is to perform these adaptations. Thiscommittee gathers every week several specialists involved in breast cancer (special-ists in medical treatment, surgery, radiotherapy, etc.). The acquisition and modelling ofadaptation knowledge involved during the meetings of the BTDC constitute the subjectof this paper.

➃ The adaptations performed during the meetings of the BTDC may be the cause ofprotocol evolutions [16]. Indeed, if an adaptation is applied systematically for certaintypes of cases, it should be possible to integrate it into the protocol (modification of athreshold used in a rule, use of new parameters about the patient, replacement of a ruleby two more accurate new rules, etc.). This remark has led to a collaboration between

➀

➁

➂

➃

protocolexperts BTDC

physician

➀ Creation and updating of the protocol based on evidence-based medicine principles➁ Straightforward (“daily”) use of the protocol➂ Adaptation-based use of the protocol➃ Protocol evolutions involved by the adaptations performed during meetings of the BTDC

306 Jean Lieber et al.

specialists of cancer, of ergonomics and of computer science, in order to design a pro-tocol evolution helping system based on the examination of the adaptations performedin the BTDC. Note that this vision of the protocol evolution is incomplete –other typesof evolutions exist– and schematic –the BTDC is not the unique entity taking a decisionabout evolution, in particular, the group of experts plays also a role at that level.

3 Towards a Knowledge Management Systemof Breast Cancer Treatment Protocol

In order to model the protocol evolution by analysis of the protocol adaptations that hasbeen performed for specific cases, it is necessary to model (1) the protocol and (2) theknowledge on which adaptation is based.

The modelling of the protocol has led to the system KASIMIR/RBR. It can be con-sidered that the protocol is represented in this system by a set of rules R = (Prem −→Cclo). Prem and Cclo respectively are the premiss and the conclusion of the rule R.Prem is a set of conditions for the selection and the application of the rule R. Cclo

is the therapeutic solution. The development of KASIMIR/RBR has been done in ageneric perspective. The representation of the adaptations performed during the BTDC

sessions must give birth to the system KASIMIR/CBR. The general organisation ofKASIMIR/CBR has been planned, as described in [8]. This system will perform a CBRtask. The cases from the case base are the protocol rules R = (Prem −→ Cclo) (fora discussion about this unusual application of CBR, in which rules are considered ascases, see [8]). Prem represents a generic problem and corresponds to a generic pa-tient. Cclo is a generic therapeutic solution of the problem Prem. KASIMIR/CBR hasto suggest a set of possible adaptations of the protocol for a specific target problem.

In a more distant future, a third system should take into account the adaptationknowledge in order to propose evolutions of the protocol. Since this knowledge changeswith time, an adaptation knowledge acquisition methodology should be useful. One ofthe objectives of this paper is to propose some elements of such a methodology.

The Kasimir project is presented with more details in [9].

4 Adaptation Principle

Before the description of adaptation knowledge acquisition, the principle of the imple-mentation of adaptation, as it is planned, has to be described. This principle has beendeveloped during the conception and implementation of the RESYN/CBR system ofsynthesis planning in organic chemistry [11].

CBR aims at solving problems in an application domain. Let tgt, be a problem tobe solved (a target problem). Let (srce, Sol(srce)) be a case retrieved from the casebase that must be adapted to solve tgt: srce is a problem and Sol(srce) is a solutionof srce. Adapting Sol(srce) in order to solve tgt consists in building a solutionSol(tgt) of tgt derived from Sol(srce).

The first adaptation step consists usually in matching srce and tgt, i.e., in pointingout how these problems are similar and how they are dissimilar. In our approach, thematching result is a similarity path, i.e. a sequence of relations


pb0 r1 pb1 r2 pb2 . . . pbq−1 rq pbq

such that:

– The pbi’s are problems and the ri’s are binary relations between problems;– pb0 = srce and pbq = tgt;– For each i ∈ {1, 2, . . . q}, a piece of adaptation knowledge is available for adapting

the solution Sol(pbi−1) of pbi−1 into a solution Sol(pbi) of pbi.

The second adaptation step simply consists in “following” the similarity path inthe solution space, involving the adaptation chain: 1◦/ Sol(srce) = Sol(pb0) intoSol(pb1), 2

◦/ Sol(pb1) into Sol(pb2), ... q◦/ Sol(pbq−1) into Sol(pbq) = Sol(tgt).Implementing the adaptation function requires (a) the implantation of matching

that points out a similarity path, and (b) the acquisition and the modelling of adap-tation knowledge. This knowledge, as seen above, aims at the design of Sol(pbi) fromSol(pbi−1), knowing on one hand pbi−1 and pbi, and on the other hand the relation ri

relating the two problems. This relation determines the adaptation function Arito be

used:Ari

:(pbi−1, Sol(pbi−1), pbi

) �→ Sol(pbi)

Thus the adaptation knowledge is composed of ordered pairs (ri,Ari) called reformu-lations [12]. A reformulation (r,Ar) can be seen as an “adaptation rule”:

if pb r pb′ // pb is related to pb′ by rthen Ar(pb, Sol(pb), pb′) = Sol(pb′) // Sol(pb) is adapted into Sol(pb′) by Ar

The problems pb1, pb2, ... pbq−1 are reified during the matching process. For KA-SIMIR/CBR, these intermediate problems corresponds to virtual patients: they are in-troduced during the reasoning.

Finally, it must be noticed that an adaptation has a cost indicating that the solu-tion Sol(tgt) of tgt may be worse than the solution Sol(srce) of srce. The precisemeaning of this cost depends on the application domain. For KASIMIR/RBR, this costis characteristic of the risk, taken during adaptation, of a bad treatment choice. A refor-mulation can be accompanied by informations on its cost. In particular, a method forcomputing a numerical cost evaluating the adaptation is needed and it is used to select,during the retrieval phase, the case that is the least costly to adapt. Furthermore, somequalitative informations about this cost may be useful for the explanation of the reason-ing to the user; it enables to highlight the advantage and disadvantage of the applicationof a reformulation. For KASIMIR/CBR, these arguments are in concern in particularwith the therapeutic risk associated with a treatment.

The acquisition of reformulations is described in the next section, with an exampleillustrating the different issues presented above.

5 Study of BTDC Adaptations

This section aims at describing the activity of adaptation knowledge acquisition. Themain steps of the adaptation knowledge acquisition are presented in section 5.1. Sec-tion 5.2 presents a detailed example. Section 5.3 presents briefly some pieces of adap-tation knowledge that have been acquired.


5.1 Adaptation Knowledge Acquisition Sessions

The adaptation of the protocol are performed during the meetings of the BTDC (cf. sec-tion 2, ➁). Summaries of these meetings have been written and analysed by a psycho-ergonomist (see [16]). The adaptation knowledge acquisition sessions consisted in thestudy of these summaries in presence of experts in cancerology, of a psycho-ergonomistand of computer science specialists. Schematically, such a session can be decomposedinto four phases:

phase 1: Presentation of the summary by the psycho-ergonomist, with corrections andprecisions from the experts.

phase 2: Discussion and explanation of the reasoning leading to an adaptation.phase 3: Re-description of this reasoning by the computer specialists and discussions

on the variations of this reasoning.phase 4: Analysis of the reasoning from the perspective of general adaptation knowl-

edge propositions (this last phase usually takes place after the session).

It must be noticed that the specialist of psycho-ergonomics is also a physician, factthat facilitates her interactions with the experts and the communication between ex-perts and computer specialists, giving her a status of interpreter. A previous work on aknowledge-based system in organic synthesis in chemistry has shown the usefulness ofsuch an interpreter [13]. In these works, it is important that the experts have some ideaabout the modelling. Indeed, contrasting to the approach “cognitician-expert”, wherethe first person monopolises the power related to the computer, it is essential that the ex-pert has some knowledge and some consciousness of the tools used, of their advantagesand limits, especially for the knowledge representation formalisms and reasoning types.So, during the transfer of expertise, the traditional problems of misunderstanding be-tween computer specialists and the experts are attenuated if not completely suppressed:the former cannot promise to the latter all the things that the latter would expect tohave. This knowledge acquisition approach is distributed and honest, in the sense thatit considers that the expert is a real associate, who has a role to play in the modelling ofthe knowledge.

5.2 A Detailed Example

The example presented in this section is an actual example with two modifications. First,the name of the patient has been changed: he has been called Jules. Second, the case hasbeen slightly modified to simplify the description of the corresponding adaptation. (Inthis context, the term “case” is taken in a medical sense and corresponds to the notionof target problem in CBR). In fact, this case has been treated in its whole complexity.Furthermore, some pieces of information were omitted because they did not play anydirect role in the reasoning.

Jules is a man with a cancer at the left breast. The first characteristics making himan out of protocol case is his sex. Indeed, the huge majority of persons suffering frombreast cancer are women, so the protocol –coming for the main part from statisticalstudies– has been elaborated for them. The idea is then to do as if Jules was a womenand to reason with this working hypothesis (which may be temporary). Note that the


use of expressions like “We do as if...” by the experts points out the possible presenceof adaptation knowledge.

Another characteristic of Jules is that his tumour localisation in his left breast isunknown. This raises a difficulty since it is important, from the radiotherapist viewpoint,to know whether the tumour is external, central or internal. More precisely, the mostpessimistic assumption –the one that makes the radiotherapy needing more precautions–is that the tumour is internal or central. The experts make this assumption. Thus, if theyare wrong, it would only involve that useless precautions would have been taken.

To summarise, two characteristics making Jules an out of protocol case have beensuccessively (and temporarily) suppressed. This can be reformulated by introducingtwo virtual patients: (1) a virtual patient Julie who is just like Jules but is a women, (2)a virtual patient Juliette who is just like Julie except for the tumour localisation (thelocalisation of Julie tumour is unknown whereas the localisation of Juliette tumour isinternal or central). Juliette corresponds to the protocol, meaning that there is a ruleof the protocol R = (Prem −→ Cclo) such that Prem holds for Juliette –denoted byPrem ⇐ Juliette (the conditions Prem are entailed by the description Juliette). Thus thefollowing similarity path relates the protocol to Jules:

Prem ⇐ Juliette ps Julie cs Jules

where ps and cs are relations between problems and where

Jules = [sex = male tumour localisation = unknown · · · ]Julie = [sex = female tumour localisation = unknown · · · ]

Juliette = [sex = female tumour localisation = internal or central · · · ]Prem is a generic patient (or a class of patients) for which the treatment Sol(Prem) =Cclo is a radiotherapy taking into account the internal or central position of the tumourand a hormonotherapy using tamoxifen.

When the similarity path is built –from Jules to Prem, reading from right to left–, thereverse path in the solution space must be followed, i.e., from the treatment Sol(Prem)of Prem to a treatment Sol(Jules) of Jules, reading from left to right:

similarity path in the problem space<<<

��

Juliette

��

Julie

��

Jules

��Juliette Julie Jules

>>>modification path in the solution space

Reformulation (⇐, A⇐). The treatment Sol(Prem) can be applied to Juliette sincePrem ⇐ Juliette. The piece of knowledge reified by the reformulation (⇐,A⇐) can bewritten: “A treatment designed for a general case can be applied to a specific case ofthis general case.” (This reformulation is not a new piece of knowledge: it is the basisof the deductive reasoning of KASIMIR/RBR.)


Reformulation (ps, Aps). Juliette is a “pessimistic specialisation” of Julie: she ischaracterised by the fact that the tumour position of Julie has been precised for Julietteand that this position is the one that makes the radiotherapy the more complex (withoutmodifying the other treatments). Therefore, the treatment Sol(Juliette) is transferredwithout modification for Julie. This reformulation (ps,Aps) models the “Wald pes-simistic criterion” [1] which states that the decisions must be evaluated on the basis oftheir worst consequences. The relation ps can be read as “is a pessimistic specialisationof” and Aps is a straightforward copy of treatment.

Reformulation (cs, Acs). Finally, some questions are raised about the applicabil-ity of the treatment Sol(Julie) of Julie to Jules, her male equivalent. These questionsdeal with the consequences of the change of sex on the applicability of the treatmentcomponents. Following the principles developed in [3], we are interested on the depen-dencies between the descriptor “sex” of the problems and the descriptors “radiother-apy”, “hormonotherapy”, etc., of the solutions. In [3], the dependencies are defined byΔy

Δxwhere Δx is the variation of a problem descriptor x and Δy is the variation of

a solution descriptor y. For Julie and Jules, we are interested inΔradiotherapy

Δsexand in

Δhormonotherapy

Δsex. The knowledge given by the experts indicates that these dependen-

cies are null: the radiotherapy and the hormonotherapy recommended for Julie remainrecommended for Jules.

The reformulation (cs,Acs) is based on the dependenciesΔθ

Δsex, where θ is a par-

ticular treatment. The discussion on the variations (cf. phase 3 of 5.1) allows to makeprecise these dependencies. In this example, we try to establish what are the treatments“invariant under the change of sex” and, for the other ones, how they can be adapted.For instance, the hormonotherapy consisting in an ablation of the ovaries is not invari-ant under the change of sex. This treatment is substituted by a treatment that, for a man,brings some similar expected benefits.

5.3 Some Other Pieces of Adaptation Knowledge that Have Been Acquired

Studies of adaptations performed during the BTDC sessions, like the one described inthe previous section, have led to several reformulations. From a study to another, somereformulations have reappeared, which enable to make them more precise. Above, twoacquired pieces of adaptation knowledge are briefly presented. More details about themtogether with the needs in representation they involve can be found in [10], which is thelong version of this paper.

Some adaptations are based on the knowledge about the expected benefits and theundesirable effects of a treatment on a patient. Usually, the protocol gives an optimalcompromise between these positive and negative effects of a treatment (given the cur-rent state of the art in medicine), but, e.g. in case of contraindications, this is not alwaystrue. For instance, if the patient has blood coagulation troubles, the haemorrhagic risktaken during a surgery is an undesirable effect with a big importance. In such circum-stances, the surgery may be changed in order to lower this risk.


Another adaptation type is linked with the threshold effect. Indeed, when a numer-ical patient characteristic (e.g., the age) is close to a decision threshold of the protocol,the decision is doubtful (in particular, because of the uncertainty on this threshold): bothdecisions should be proposed to the user.

6 Discussion

This section discusses two issues related to this work. First, some elements of an adap-tation knowledge acquisition methodology generalised from this study are proposed.Then, some related work are presented. A more detailed discussion is given in [10].

6.1 Elements of an Adaptation Knowledge Acquisition Methodology

Some elements of a methodology for an acquisition process of adaptation knowledgeinvolving experts (and, if possible, an “interpreter”) and the study of specific adapta-tions, are summarised. It must be noticed that these elements of methodology must beevaluated on a larger scale and in other application domains.

The first issue –maybe the most important– is the decomposition of adaptation basedon the notions of similarity path and of intermediate problems between the source andtarget problems, which involves adaptation knowledge expressed by reformulations.The adaptation knowledge acquisition that we describe is based on informal descrip-tions of adaptation processes performed by experts. For each of these adaptation pro-cesses, the steps of knowledge acquisition is as follows:

– Re-description of the adaptation process in several steps by introducing intermedi-ate problems pb1, pb2, . . .pbq−1 and their respective solutions Sol(pb1),Sol(pb2), . . .Sol(pbq−1). Recall that pb0 = srce is the source problem and thatpbq = tgt is the target problem.The elicitation of the intermediate problems is often made from the right to the left,i.e., from pbi to pbi−1. For example, when the expert makes a working hypoth-esis on pbi−1 (“We do as if some conditions of pbi−1 were changed”), it can beexpressed by introducing pbi.

– For each i ∈ {1, 2, . . . q}, analysis of the adaptation step(pbi−1, Sol(pbi−1), pbi

) �→ Sol(pbi)

This analysis aims at giving a reformulation (ri,Ari) which is either a reformula-

tion belonging already to the adaptation knowledge base, or a new one.

The second issue is linked with the problem and solution representations. Indeed,it is useful not only to represent what a solution is but also in what it answers well (ornot) the problem it is supposed to solve. For KASIMIR/CBR, this is for example theknowledge linked with the expected benefits and the undesirable effects of a treatment.

The third issue concerns the dependencies between problem descriptors x and so-lution descriptors y, as seen above in section 5.2, about the reformulation (cs,Acs).

These dependencies can be symbolised by the ratesΔy

Δxand involve questions such that

“How does y vary when x varies?” that are useful to question the expert.


6.2 Related Work on Adaptation Knowledge Acquisition and Modelling in CBR

The studies on adaptation knowledge acquisition and modelling seem to be rather rare.In [4] the different knowledge types useful for CBR and, in particular, for the adap-tation phase, are described. The different adaptation tasks (add, suppress, substitute,reorganise, etc.) are presented there and discussed at a general level. They are useful asa guide but that must be made precise in a given applicative framework.

In [14], the knowledge about the changes in a medical context is represented. Thiswork is very different from ours since the changes of knowledge are at the level of thedomain terminology (add, replacement and suppression of terms, changes in the hier-archy, etc.), whereas our approach concerns the therapeutic adaptations, and therefore,the changes in the treatment rules.

The papers [5] and [6] describe two approaches of adaptation knowledge acquisitionby learning from the case base of a CBR system. These two approaches are differentfrom ours since they are based on two different knowledge acquisition sources: a casebase for the formers and experts for the latter. Nevertheless, the idea to examine fromthis point of view the protocol, with automatic or interactive tools, seems to be interest-ing and thus constitutes a possible future work.


This paper presents the adaptation knowledge acquisition and modelling for the sys-tem KASIMIR/CBR. This system will have to adapt a breast cancer treatment protocolfor specific cases not covered by a straightforward use of the protocol. The notions ofsimilarity path, of intermediate problem and of reformulation play an important rolefor these acquisition and modelling. The similarity paths and the intermediate prob-lems (corresponding to virtual patients) allow to decompose the adaptations performedin simpler steps that can be modelled by reformulations involving general adaptationknowledge.

A first future work is to fulfill the knowledge representation needs involved by theacquired adaptation knowledge. It will also be necessary to instantiate the conceptualmodel schema, i.e. to establish the knowledge on which the reformulations rely (repre-sentation of pessimistic specialisations, equivalence between expected benefits of treat-ments, treatment variations function of the sex, etc.). This instantiation work is currentlyunder development and is associated with the implementation of KASIMIR/CBR.

The use of this knowledge in order to be able to automatically perform these adap-tations is another future work. The central problem is the similarity path elaboration.For the system RESYN/CBR [11], a technique combining hierarchical classificationand search in a state space –the so-called smooth classification– has been used. Thistechnique should be reusable for KASIMIR/CBR but this still requires a precise study.A first version of KASIMIR/CBR taking into account only the threshold effect thanksto fuzzy hierarchical classification has already been successfully developed [7].

A last future work consists in studying how the protocol examination can be helpfulto suggest adaptation knowledge, following the learning approaches described in [6]and [5] and discussed in section 6.2.


References

1. D. Dubois, H. Prade, and R. Sabbadin. Decision-theoretic foundations of qualitative possi-bility theory. European Journal of Operational Research, 128:459–478, 2001.

2. Evidence-based medicine working-group. Evidence-based medicine. A new approach toteaching the practice of medicine. JAMA, 17:268, 1992.

3. B. Fuchs, J. Lieber, A. Mille, and A. Napoli. An Algorithm for Adaptation in Case-BasedReasoning. In Proceedings of the 14th European Conference on Artificial Intelligence(ECAI-2000), Berlin, Germany, pages 45–49, 2000.

4. B. Fuchs and A. Mille. A Knowledge-Level Task Model of Adaptation in Case-Based Rea-soning. In K.-D. Althoff, R. Bergmann, and L. K. Branting, editors, Case-Based ReasoningResearch and Development — Third International Conference on Case-Based Reasoning(ICCBR-99), LNAI 1650, pages 118–131. Springer, Berlin, 1999.

5. K. Hanney and M. T. Keane. Learning Adaptation Rules From a Case-Base. In I. Smithand B. Faltings, editors, Advances in Case-Based Reasoning – Third European Workshop,EWCBR’96, LNAI 1168, pages 179–192. Springer Verlag, Berlin, 1996.

6. J. Jarmulak, S. Craw, and R. Rowe. Using Case-Base Data to Learn Adaptation Knowl-edge for Design. In Proceedings of the 17th International Joint Conference on ArtificialIntelligence (IJCAI’01), pages 1011–1016. Morgan Kaufmann, Inc., 2001.

7. J. Lieber. Strong, Fuzzy and Smooth Hierarchical Classification for Case-Based ProblemSolving. In F. van Harmelen, editor, Proceedings of the 15th European Conference on Arti-ficial Intelligence (ECAI-02), Lyon, France, pages 81–85. IOS Press, Amsterdam, 2002.

8. J. Lieber and B. Bresson. Case-Based Reasoning for Breast Cancer Treatment DecisionHelping. In E. Blanzieri and L. Portinale, editors, Advances in Case-Based Reasoning —Proceedings of the fifth European Workshop on Case-Based Reasoning (EWCBR-2k), LNAI1898, pages 173–185. Springer, 2000.

9. J. Lieber, M. d’Aquin, P. Bey, B. Bresson, O. Croissant, P. Falzon, A. Lesur, J. Leveque,V. Mollo, A. Napoli, M. Rios, and C. Sauvagnac. The Kasimir Project: Knowledge Manage-ment in Cancerology. In Proc. of the 4th International Workshop on Enterprise Networkingand Computing in Health Care Industry (HealthComm 2002), pages 125–127, 2002.

10. J. Lieber, M. d’Aquin, P. Bey, A. Napoli, M. Rios, and C. Sauvagnac. Adaptation KnowledgeAcquisition, a Study for Breast Cancer Treatment. Research report available onhttp://www.loria.fr/equipes/orpailleur/, LORIA, January 2003.

11. J. Lieber and A. Napoli. Using Classification in Case-Based Planning. In W. Wahlster,editor, Proceedings of the 12th European Conference on Artificial Intelligence (ECAI’96),Budapest, Hungary, pages 132–136. John Wiley & Sons, Ltd., 1996.

12. E. Melis, J. Lieber, and A. Napoli. Reformulation in Case-Based Reasoning. In B. Smyth andP. Cunningham, editors, Fourth European Workshop on Case-Based Reasoning, EWCBR-98,Lecture Notes in Artificial Intelligence 1488, pages 172–183. Springer, 1998.

13. A. Napoli, C. Laurenco, and R. Ducournau. An object-based representation system for or-ganic synthesis planning. Int. Journal of Human-Computer Studies, 41(1/2):5–32, 1994.

14. D. E. Oliver, Y. Shahar, M. A. Musen, and E. H. Shortliffe. Representation of Change inControlled Medical Terminologies. Artificial Intelligence in Medicine, 15(1):53–76, 1999.

15. C. K. Riesbeck and R. C. Schank. Inside Case-Based Reasoning. Lawrence Erlbaum Asso-ciates, Inc., Hillsdale, New Jersey, 1989.

16. C. Sauvagnac. La construction de connaissances par l’utilisation et la conception deprocedures. Contribution au cadre theorique des activites metafonctionnelles. Thesed’Universite, Conservatoire National des Arts et Metiers, 2000.


Case Based Reasoning for Medical Decision-Support in a Safety Critical Environment

Isabelle Bichindaritz1, Carol Moinpour2, Emin Kansu2, Gary Donaldson2, Nigel Bush2, and Keith M. Sullivan3

1 University of Washington, Tacoma, Washington 2 Fred Hutchinson Cancer Research Center, Seattle, Washington

3 Duke University Medical Center, Durham, North Carolina

Abstract. Case-based reasoning systems applied to safety-critical environments justify specific measures to ensure that the assistance provided is not dangerous to human life. This article presents a case-based reasoning system developed for medical decision-support in a safety-critical environment, the CARE-PARTNER system. Based on the evaluation of the reliability of the system, it stresses the importance to differentiate between reliability and safety, and how case-based reasoning can ensure safety. A knowledge-based approach has en-abled to model what types of information are safety-critical, and to adapt the decision-support system results, by introspective reasoning, based on the knowledge type, the user, and the task.

1 Introduction

CARE-PARTNER is a computerized decision-support system [2] on the World-Wide Web (WWW). It is applied to the long-term follow-up (LTFU) of patients having undergone a stem-cell transplant (SCT) at the Fred Hutchinson Cancer Research Center (FHCRC) in Seattle, after their return in their home community [10]. Home care providers use CARE-PARTNER to place contacts with LTFU on the Internet, and receive from the system decision-support advice in a timely manner for transplant patients follow-up. An essential characteristic of CARE-PARTNER is that it proposes to implement evidence-based medical practice [6] by applying clinical guidelines developed by FHCRC for the care of their patients. CARE-PARTNER resorts to a multimodal reasoning framework for the cooperation of case-based reasoning (CBR) and rule-based reasoning [7].

Providing decision-support advice in a medical domain is not without risk. Based upon the study of similar systems, a safety insurance plan has been put in place for the system. Important components of this plan are a procedural level plan, a knowl-edge level plan, and a software engineering level plan. The increasing reliability of the system, shown in evaluation results, does not prevent the risk that the system may be unsafe because it is not possible to ascertain that it will not make a mistake [8, 9].

The second section summarizes the CARE-PARTNER system. The third section presents an evaluation. The fourth section reviews the set of problems encountered in safety critical systems in medicine. The fifth section explains the process of planning and implementing a safety critical system, taking CARE-PARTNER as an example. The sixth section concludes this article.

Case Based Reasoning for Medical Decision-Support in a Safety Critical Environment 315

2 The CARE-PARTNER System

CARE-PARTNER assists its users in the performance of their clinical tasks by pro-viding decision-support advice based upon proven and validated practice, and thus helps implement evidence-based medicine [6]. CARE-PARTNER proposes decision-support advice for diagnosis, treatment, and follow-up - using a general framework for reasoning from knowledge sources of varied quality which means that their knowledge is based upon varied evidence, but also that the evidence associated to each piece of knowledge can vary through experience [5] (see Fig. 1).

Working memory

Target case

Interpretation

Search

ConflictReuse

Update

ConflictSet

Solution

Memorization

PHYSICIANINTERFACE

PATIENTRECORD

PROBLEM

SOLUTION

KNOWLEDGEBASE

cases, pathways, rules Working memory

Working memory

Fig. 1. Multimodal reasoning framework

It answers online questions about care of transplant patients that home care provid-ers used to submit by phone to LTFU nurses, who relayed to LTFU clinicians before getting back to the home care providers with clinical answers.

Another aspect of the system is that it provides an electronic contact management system to replace the phone and paper-based previous system, with obvious advan-tages for research purposes and documentation purposes.

CARE-PARTNER decision-support provides following medical recommenda-tions:

��Interpretation for each laboratory test and procedure result. ��List of steps of laboratory tests and/or procedures for diagnosis assessment. ��List of steps of planning actions for treatment. ��List of differential diagnoses, ranked by likelihood; these diagnoses are often not

incompatible, since several diagnoses co-occur to cover all the signs and symp-toms exhibited by the patient. These diagnoses are listed as pathways in the sys-tem.

��List of pertinent documents hyperlinked to the previous elements.

316 Isabelle Bichindaritz et al.

Fig. 2. Example of a LiverChronicGVHD clinical pathway

CARE-PARTNER reasoning cycle (see Fig. 1) is a multimodal reasoning cycle for the cooperation of case-based reasoning, rule-based reasoning, and information re-trieval. Its reasoning steps are generalizations of the steps defined in these respective methodologies. Like in most medical domains, knowledge takes several forms: prac-tice guidelines, practice pathways, practice cases, and medical textbooks. A practice pathway (see Fig. 2) covers the same type of knowledge as a guideline, but special-ized in the management of diagnosis and treatment related to LTFU. It has been cre-ated by a group of LTFU experts exclusively for the system. Pathways correspond to prototypical cases, and are represented as cases in the system.

3 CARE-PARTNER Evaluation

A sample evaluation of CARE-PARTNER decision-support performance has been performed by team statisticians, and is provided in Table 1. Following a rigorous statistical evaluation plan, the details of which are beyond the scope of this paper, it shows the rating of the system by two independent expert clinicians according to the


criteria Fails to meet standards / Adequate / Meets all standards. On 163 different clinical situations or cases, corresponding to contacts between the system and a clini-cian about three patients, the system was rated 82.2% as Meets all standards, and 12.3% as Adequate, for a total of 94.5% of results judged clinically acceptable by the medical experts. Table 1 also shows that the advice provided by the system covers most of the clinicians’ tasks: labs and procedure results interpretation, diagnosis as-sessment plan, treatment plan, and pathways information retrieval. Pathways repre-sent prototypical cases retrieved by the system, and correspond to diagnostic catego-ries (see Fig. 2 for an example).

Table 1. CARE-PARTNER CDSS Evaluation Form Inter-Rater Agreement and Summary Ratings for Two Raters over Three Patients

Another part of the evaluation dealt with measuring the progress of the system when solving new contact cases. As noted earlier, case-based reasoning gives the system the ability to learn. This important characteristic of the system has been evalu-ated on three patients complete charts. The performance of the system has signifi-cantly improved between patient 1 and 3: 1. First patient (49 contacts, 93.7% Meets all standards/Adequate), 2. Second patient (75 contacts, 88.9% Meets all standards/Adequate), where the

proportion of Meets all standards has significantly increased over the Adequate re-sults.

3. Third patient (54 contacts, 98.6% Meets all standards/Adequate). This evaluation of the system learning capability shows a definite improvement of

its clinical problem solving results, stemming from the case-based reasoning main system component. Since the system learns from its failures, cases added to its mem-ory have a validated solution. The ongoing use of the system as a contact manage-ment media permits the system to track constantly the patients follow-up, and pro-vides the case-based reasoning system with the information required to perform introspective reasoning, improve its reasoning capabilities, and enrich its memory of cases.

Applicable Cases Concordant Cases

Number

Percent Agreement

Rating

Kappa coefficient

of agreement

Number

Fails to meet

standards

Adequate

Meets all standards

Labs 57 94.7 .71 54 3.7% 3.7% 92.6% Procedures 70 95.7 .83 67 8.9% 3.0% 88.1% Diagnosis 79 86.1 .74 68 16.2% 13.2% 70.6% Treatment 77 92.2 .81 71 9.9% 11.3% 78.8% Pathways 53 88.6 .71 47 8.5% 8.5% 83.0% Overall Appreciation

178 91.6 .77 163 5.5% 12.3% 82.2%


The good evaluation results of the system at 98.6% adequate indicate that the CARE-PARTNER knowledge-base has reached an excellent state of completeness. Because learning is a constant in a medical environment, where atypical cases con-tinue to occur, and because medical recommendations are frequently updated, the learning ability of the system permits the knowledge-base to evolve and keep up-to-date. Nevertheless, the use of CARE-PARTNER in clinical practice would not be possible without a 100% rate of at least adequate results over a much larger set of cases. Even if the rate of error occurrence is extremely low, it is not tolerable in a system in clinical practice. The example and analysis of Therac-25 system in the next section explains why reliability is not enough to ensure safety, and what steps need to be taken when developing a safety-critical system.

4 Safety-Critical Systems in Medicine

Medical applications of software systems have generated considerable attention in the medical and software development communities, in particular for their potential dan-ger. Even if in comparison with the volume of medical software, the number of safety-critical accidents has been low, a few such accidents have raised the level of quality required.

Probably the most famous case of medical software accident is the Therac-25 [9]. The Therac-25 is a computer-controlled radiation therapy machine that massively overdosed six patients, causing several deaths, which has been described as the worst accident in the 35-year history of medical accelerators, leading to the system recall.

One of the main lessons learnt from Therac-25 is that safeguards against software errors need to be put in place. Even if, like Therac-25, the medical device or software is used safely in 100% of 100,000 cases, some hazards may happen that need to be prevented.

Extensive analysis has been performed of the causes of the malfunction of this sys-tem, and has lead to defining a set of guidelines for safety-critical systems.

The set of causal factors identified in Therac-25 is the following[9]:

1. Overconfidence in Software from users, and software developers. 2. Confusing Reliability with Safety, Therac-25 was extremely reliable, working

almost 100,000 times with no fault, but still it was not safe. 3. Lack of Defensive Design, such as error detection and handling features. 4. Failure to Eliminate Root Causes, the system should be built with a control

mechanism to safeguard against its own errors. 5. Complacency of people, who have a tendency to trust medical technology be-

cause they think that the developers have safety guards in place. 6. Unrealistic Risk Assessment. 7. Inadequate Investigation or Follow-up on Accident Reports. 8. Inadequate Software Engineering Practices. 9. Software Reuse, requiring same testing and safeguard procedures. 10. Safe versus Friendly User Interfaces, keeping important controls and validation

mechanisms. 11. User and Government Oversight and Standards.


Leveson and Turnwe [9] note that if this set of factors is impressive, it conveys that it was a conjunction of software development practices that caused different types of failure. Fox [8] puts emphasis on how decision-support systems in medicine should pay careful attention to safety issues, and not merely to efficacy. Lessons learnt on safety-critical systems in medicine [8, 9] led to the development of a safety insurance plan for CARE-PARTNER.

5 CARE-PARTNER Safety Insurance Plan

CARE-PARTNER’s role is to provide decision-support assistance to home-care pro-viders and to facilitate communication between them and LTFU specialists. Caring for a transplant patient may be cumbersome for the home-care provider. Patients are immuno-compromised, and the fight between graft and host may cause a dangerous immune response. During the first months and years after transplant, patients are very sensitive to infections. Some symptoms specific of post-transplant diseases may be mistaken for other classical diseases, and the severity of a common disease such as bronchitis may be rapidly fatal to a transplant patient.

Thus this decision-support system is a safety-critical system in medicine. Evalua-tion results from table 1 show that the validity of CARE-PARTNER advice is not perfect. From what we have learnt with Therac-25 [8,9], reliability of the system will not ensure safety. Following we designed a safety insurance plan for CARE-PARTNER to safe-proof it against accidents. The focus shifted during the testing phase to building safeguards into the system, instead of spending equivalent time reaching a 100% level of reliability in clinical decision-support advice, since this level would anyway be reached by case-based learning.

CARE-PARTNER safety insurance plan takes into account several types of fac-tors: procedural, software engineering, and knowledge level.

Procedural Level It is well explained to medical users that the decision-support system should not be used in case of life-threatening situation. A paging system is available through LTFU phone answering system. Nevertheless, the system must detect such situations in case the home care provider has misjudged the criticality of a patient condition, which is explained in the knowledge level plan. The system detects such situations very pre-cisely because it uses a controlled vocabulary defined in its ontology.

Another procedure in place is that patients need to agree to be treated in part by the system. They fill in an informed consent form explaining well the system results and consequences. Patients may refuse to be followed-up by the decision-support system, and prefer to be followed up only by phone and mail contact between LTFU and their home care provider. Thus patients followed up by the system accept the risks implied by the system use, and they mostly do, but this does not change the necessity to en-sure safety in the system for ethical reasons. It is true that it sometimes is a trade-off between the benefits and the risks of the system, but this trade-off was not acceptable to the project team.


Fig. 3. System reasoning to filter out safety critical clinical situations at the Knowledge Level

Software Engineering Level CARE-PARTNER follows recommendations for safety critical systems listed above. Some specific aspects are that a comprehensive audit trail is in place, monitoring every action from the user, mostly those generating database access. The case base is part of the contact database, the prototypical cases are part of the knowledge base database, and their access is monitored as well. A fine-grained audit trail logs every access to the system, identifying the type of access, the subject and object of the ac-cess, and the role of the user accessing the system. This level of precision in the audit trail is required by current legislation about patient identifiable medical information, and is described in the system security plan.

Software development has been well documented, and followed Unified Modeling Language (UML) method for object-oriented development. This method provides a more formalized software development methodology over classical methodologies.

Another aspect is testing. Testing has been performed at gradual levels, first on se-lected patient charts to compare paper-based staff recommendations and system rec-ommendations. Then took place a pilot testing, then a clinical trial was authorized.

Building safeguards into the system takes place mostly at the knowledge level. Af-ter CBR reuse step (see Fig. 1) – an adaptation - and before presenting the adapted case to the user, CARE-PARTNER evaluates the case recommendations for safety.

The different dimensions of system recommendations present various risk levels:

No

Safety critical Sign or Symptom?

Safety critical Planning Action?

No

Page LTFU specialist

Provide non safety critical recommendations

End of decision-support advice for that contact

Beginning of decision-support advice self-critique, after case solution is found by the system

Yes

Yes


��Interpretation for each laboratory test or procedure result: these are not moni-tored for safety, but certainly they are monitored for errors. It is common in medicine for physicians to review labs and procedure results interpretations given that they depend upon the reference population.

��List of differential diagnoses, ranked by likelihood: this list may be safety criti-cal. For instance, if some of the signs and symptoms that should be covered by a diagnosis are not, this may be dangerous. Thus the system monitors that all criti-cal signs and symptoms have been covered by at least a diagnosis, otherwise a LTFU staff is paged to handle the case. Moreover some signs and symptoms are tagged as safety critical, such as Temperature level=Very-elevated.

��List of steps of laboratory tests and/or procedures for diagnosis assessment: these are not considered as safety critical because after review by the experts, they were not assessed as life-threatening.

��List of steps of planning actions for treatment: Planning actions may be safety critical. Placing a patient on a drug, or removing a drug may cause a health risk for a patient. So certain planning actions are tagged as safety critical.

��List of pertinent documents hyperlinked to the previous elements, such as guidelines, treatment protocols, or textbook excerpts: linked to previous ele-ments for documentation and explanation purposes, they need not be monitored for safety.

In summary, certain knowledge elements are tagged as critical in the knowledge-base, such as certain signs and symptoms (for example BleedingNOS, or Temperatu-reElevated when the elevation is high), and certain planning actions (for example StopMedication, or StartMedication for certain types of medication marked as dan-gerous). When any of this safety critical knowledge shows in an adapted case, a LTFU staff is paged to review the system advice and take further charge of the case (see Fig. 3).

In any case, the system recommendations provided to the hometown care provider are transmitted to LTFU, who can review the advice given, modify it, and/or contact the home-care provider. Having all information about patient contacts in electronic format is already an advance in LTFU clinical work, and CARE-PARTNER contact management system itself is very valuable over paper-based contact management.

In addition, the decision-support system speeds the response time from LTFU, making it instantaneous in most cases, all those that are not safety critical. By com-parison, response rate for phone contact takes several days, because the nurse taking phone calls first reports the case to the clinicians during regularly scheduled meet-ings, then gets back with an answer. Even for safety critical cases, the system pro-vides non-safety critical answers right away to the provider, and pages LTFU clini-cian for additional recommendations. Thus the system advocates a close cooperation between LTFU nurses and clinicians, and home care providers in the best interest of the patient, and still speeds up considerably the response time of LTFU.

6 Conclusion

CARE-PARTNER is a safety critical case-based reasoning system. By implementing a procedural, software engineering, and knowledge level safety insurance plan,


CARE-PARTNER has built-in safeguards to prevent errors that may lead to life-threatening accidents. Spending time incorporating these safeguards into the system is more valuable than spending the same time improving the reliability of the system to perfection, because reliability does not ensure safety. Since this problem is very gen-eral in medicine, it is very likely that the approach designed for this system could be applied to other safety-critical systems in medicine.

Case-based reasoning has been chosen in this application domain as a method of choice as it is often in medicine and the life sciences, in particular in new and/or rap-idly evolving domains, as is stem-cell transplantation. The case-based learning capa-bility of the system is a main advantage to keep the system up to date with the ongo-ing changes in clinical knowledge and practice, and also shortens the knowledge acquisition phase.

Case-based reasoning systems generally do not deal with the safety criticality as-pect of the recommendations that they provide because of one of the following rea-sons:

• Either they assume that their users will take their recommendations as pure advice, and will make their own judgment about whether or not to follow the system rec-ommendations. Taking this argument for granted is dangerous in a safety critical environment (see reasons 5 and 1 in Leveson and Turnwe [9], Section 4 above). It is very likely that such an expectation will be defeated one day, and may lead, in addition to irremediable personal injury, to discrediting the system and like sys-tems.

• Or they provide cases to users mostly to support their own decision-making proc-ess, thus leaving the reuse step to the user.

It is most of the time possible to find a non safety critical way of using a system that may be potentially safety critical. For instance, CARE-PARTNER has been ex-tended into an intelligent tutoring system (ITS) as a teaching assistant [3].

Nevertheless, it is often very valuable to provide active and timely decision-support advice in domains that may be safety critical. CARE-PARTNER for instance provides improvement of patients care along several dimensions. It is an electronic contact management system that records detailed contact information, with their complete clinical context, for care follow-up as well as for clinical research purposes, and is thus quite valuable as a medical knowledge management tool. It also provides just in time, evidence-based decision support recommendations for physicians not specialized in the stem-cell transplant domain. It fosters and monitors the application of evidence-based medicine, which is valued as advancement in medical scientific practice. Finally, with its safeguards in place, it is capable of discriminating when to resort to a human expert for recommendations, and when to provide these recom-mendations itself, enabling the cooperative work between the system, the home care provider, and the specialist.

Acknowledgments

This work was supported in part by grant R01HS09407 from the Agency on Health Care Policy and Research (AHCPR).


References

1. Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Varia-tions, and Systems. AI Communications. 7(1) (1994) 39-59

2. van Bemmel, J.H., Musen, M.A.: Handbook of Medical Informatics. Springer-Verlag, Ber-lin Heidelberg New York (1997)

3. Bichindaritz I., Sullivan K.M., Generating Practice Cases for Medical Training from a Knowledge-Based Decision-Support System, 6th European Conference on Case Based Rea-soning Workshop Proceedings,Workshop on Case-Based Reasoning in Education, Aber-deen, Scotland, (2002) 3-14

4. Bichindaritz, I., Conlon, E.: Temporal knowledge representation and organization for case-based reasoning. In: Proceedings TIME-96. IEEE Society Press (1996) 24-29

5. Bichindaritz, I., Sullivan, K.M.: Reasoning from Knowledge Supported by More or Less Evidence in a Computerized Decision Support System for Bone-Marrow Post-Transplant Care. In: AAAI Spring Symposium on Multimodal Reasoning. AAAI Press, Stanford, Calif (1998) 85-90

6. Bichindaritz I., Kansu E., Sullivan K.M., Case-Based Reasoning in CARE-PARTNER: Gathering Evidence for Evidence-Based Medical Practice, European Workshop on Case-Based Reasoning, SPRINGER-VERLAG Lectures Notes in Artificial Intelligence n° 1488,(1998) 334-345

7. Bichindaritz I., Kansu, E., Sullivan, K.M.: Integrating Case-Based Reasoning, Rule-Based Reasoning and Information Retrieval for Medical Problem-Solving. In: AAAI Workshop on CBR Integrations. AAAI Press, (1998) 9-16

8. Fox J.P., Clinical Decision Support Systems: a Discussion of Quality, Safety and Legal Li-ability Issues. In: Proceedings AMIA Symp. 2002, 265-9

9. Leveson, N., Turnwe C.S.: An Invertigation of the Therac-25 accidents. IEEEComputer, 26(7), July 1993 (1993) 18:41

10. Sullivan, K.M., Siadak, M.F.: Stem Cell Transplantation. In: Johnson, F.E., Virgo, K.S., Edge, S.B., Pellegrini, C.A., Poston, G.J., Schantz, S.P., Tsukamoto, N. (eds): Cancer Pa-tient Follow-Up. Mosby Year-Book Publications, St Louis (1997) 490-501


Constraint Reasoning in Deep Biomedical Models

Jorge Cruz and Pedro Barahona

Centro de Inteligência Artificial, Departamento de Informática Universidade Nova de Lisboa, 2829-516 Caparica, Portugal

{jc,pb}@di.fct.unl.pt

Abstract. Deep biomedical models are often expressed by means of differential equations. Despite their expressive power, they are difficult to reason about and make decisions, given their non-linearity and the important effects that the un-certainty on data may cause. For this reason traditional numerical simulations may only provide a likelihood of the results obtained. In contrast, we propose in this paper the use of a constraint reasoning framework able to make safe deci-sion notwithstanding some degree of uncertainty, and illustrate this approach in the diagnosis of diabetes and the tuning of drug design.

1 Introduction

Biomedical models provide a representation of the functioning of living organisms, making it possible to reason about them and eventually to take decisions about their state (diagnosis) or adequate actions (e.g. therapeutic) regarding some intended goals. Parametric differential equations are general and expressive mathematical means to model systems dynamics, and are suitable to express the deep modelling of many biomedical systems. Notwithstanding its expressive power, reasoning with such mod-els may be quite difficult, given their complexity. Analytical solutions are available only for the simplest models. Alternative numerical simulations require precise nu-merical values for the parameters involved, which are usually impossible to gather given the uncertainty on available biomedical data. This may be an important draw-back since, given the usual non-linearity of the models, small differences on the input parameters may cause important differences on the output produced.

To overcome this limitation, Monte Carlo methods rely on a large number of simu-lations, that may be used to estimate the likelihood of the different options under study. However, they cannot provide safe conclusions regarding these options, given the various sources of errors that they suffer from, both input precision errors and round-of errors accumulated in the simulations.

In contrast with such methods, constraint reasoning assumes the uncertainty of numerical variables within given bounds (e.g. intervals of real numbers) and propa-gates such knowledge through a network of constraints on these variables, in order to decrease the underlying uncertainty (i.e. width of the intervals). To be effective, con-straint reasoning methods must rely on advanced safe methods so that uncertainty is sufficiently bound as to be possible to make safe decisions.

Lack of space prevents a full explanation of the constraint reasoning techniques that we developed to handle differential equations [6,9]. Nevertheless, a brief intro-duction in section 2 shows the expressive power of the framework developed, and stresses the active use of certain constraints on actual, upper and lower values of the

Constraint Reasoning in Deep Biomedical Models 325

functions involved, on the time or the area under curve in which they exceed a certain threshold. These constraints can only be used passively, both on alternative constraint reasoning frameworks or more conventional numerical simulation methods.

This expressive power is illustrated in this paper in two medical applications, re-garding the diagnosis of diabetes and the tuning of drug design, presented in sections 3 and 4, respectively. We show in these examples how the active use of constraints of the types above is sufficient to make safe decisions regarding the intended goals. The paper ends with a summary of the main conclusions.

2 Continuous Constraint Satisfaction Problems

Many real world problems can be modelled as Constraint Satisfaction Problems (CSPs) defined by a triple (X,D,C) where X is a set of variables, each with an associ-ated domain of possible values in D, and C is a set of constraints on subsets of the variables [13]. A constraint specifies which values from the domains of its variables are compatible. A solution to the CSP is an assignment of values to all its variables, which satisfies all the constraints.

In continuous CSPs (CCSPs) variable domains are continuous real intervals and constraints are equality and inequality [10]. The interval constraints framework [3] combines propagation and search techniques from AI with methods from interval analysis [14] for solving CCSPs. Sound filtering algorithms are used to prune the domains of the CCSP variables guaranteedly loosing no possible solutions.

Partial information expressed by a constraint is used to eliminate incompatible val-ues from the domain of its variables. To enforce local consistency [2, 12], the reduc-tion of a variable domain is propagated to all constraints on that variable, which may further reduce the domains of other variables. The process terminates when a fixed point is attained and the domains cannot be further reduced.

Local consistency is partial, in that it is not sufficient to remove all inconsistent value combinations among the CCSP variables. Stronger higher order consistency requirements may be subsequently imposed establishing more global properties on the variable domains. Several alternative partial consistency criteria have been proposed [4] trying to offer the best trade-off between the computational cost of the enforcing algorithm and the pruning of the CCSP variable domains.

We have been developing global hull consistency, the strongest consistency crite-rion for pruning the initial CCSP variable domains into a single “box” (where each variable domain is represented by a single real interval) [7, 8]. It narrows the original domains into the smallest box that contains all possible canonical solutions (a canoni-cal solution is the smallest box that can be represented with some specified precision that cannot be proved inconsistent). Being a computational expensive criterion, an important property of its enforcing algorithm is its any time nature (partial pruning results are provided at any time during the narrowing process).

2.1 Constraint Satisfaction Differential Problems

The behaviour of many systems is naturally modelled by a system of first order Ordi-nary Differential Equations (ODEs), often parametric. ODEs are equations that in-volve derivatives w.r.t. a single independent variable, t, usually representing time.

An ODE system may be represented in vector notation as

326 Jorge Cruz and Pedro Barahona

),( tyfy =′ where vector function f determines, for an instantiation of y and t, the evolution of y within an increment of t, and may be seen as a restriction on the sequence of values that y can take over t. Since it does not fully determine the sequence of values of y (but rather a family of such sequences), initial / boundary conditions are usually pro-vided with a complete / partial specification of y at some time point t.

Classical numerical approaches for solving ODE problems compute numerical ap-proximations of the solutions and do not provide guarantees on their accuracy. In contrast, validated [15] and constraint methods [11] do verify the existence of unique solutions and produce guaranteed error bounds for the true trajectory.

In this paper, we use an extension of the interval constraints framework for includ-ing ODE systems as constraints within CCSPs. An ODE system and additional infor-mation is denoted a Constraint Satisfaction Differential Problem (CSDP). A CCSP that includes constraints defined as CSDPs is denoted an extended CCSP.

In a CSDP there is a special variable (xODE), whose domain is a set of functions, which is associated with an ODE system S for every t within the interval T through a special constraint, ODES,T(xODE). Variable xODE represents those functions that are solu-tions of S (during T) and satisfy all the additional restrictions. The other variables of the CSDP, denoted restriction variables, are all real valued variables used to model a number of constraints of interest in many applications.

A Value restriction, denoted Valuej, t(x), associates a variable x with the value of a trajectory component j at a particular time t, and can be used to model initial and boundary conditions. A maximum Value restriction, maximumValuej,T(x) associates x with the maximum value of a trajectory component j within a time interval T (mini-mum restrictions are similar). A time restriction Timej , T , ≥θ (x) associates x with the time within time period T in which the value of a trajectory component j exceeds a threshold �. Similarly, the area restriction Areaj , T , ≥θ (x) associates x with the area of a trajectory component j, within time period T, above threshold �.

The solving procedure for CSDPs that we developed maintains a safe enclosure for the set of possible solutions based on validated methods for solving ODE problems with initial conditions. The improvement of the quality of such enclosure is combined with the enforcement of the ODE restrictions through constraint propagation on a set of narrowing functions associated with the CSDP. Some are responsible for reducing the domain of a restriction variable according to the current trajectory enclosure. Oth-ers are responsible for reducing the uncertainty of the trajectory enclosure according to the domain of a restriction variable. Finally there are narrowing functions responsi-ble for reducing the uncertainty of the trajectory enclosure by the successive applica-tion of the validated method between consecutive time points.

The full integration of a CSDP within an extended CCSP is accomplished by shar-ing the restriction variables of the CSDP. The CSDP solving procedure is used as a safe narrowing procedure for reducing the domains of the restriction variables.

3 A Differential Model for Diagnosing Diabetes

Diabetes mellitus prevents the body from metabolising glucose due to an insufficient supply of insulin. A glucose tolerance test (GTT) is frequently used for diagnosing


diabetes. The patient ingests a large dose of glucose after an overnight fast and in the subsequent hours, several blood tests are made. From the evolution of the glucose concentration a diagnosis is made by the physicians.

Ackerman and al [1] proposed a well-known model for the blood glucose regula-tory system during a GTT, with the following parametric differential equations:

)()()(

21 thptgpdt

tdg −−= )()()(

43 tgpthpdt

tdh +−=

where g is the deviation of the glucose blood concentration from its fasting level; h is the deviation of the insulin blood concentration from its fasting level; p1, p2, p3 and p4 are positive, patient dependent, parameters.

Let t=0 be the instant immediately after the absorption of a large dose of glucose, g0, when the deviation of insulin from the fasting level is still negligible. According to the model, the evolution of glucose and insulin blood concentrations is described by the trajectory of the above system of differential equations, with initial values g(0)=g0 and h(0)=0, and depends on the parameter values p1 to p4.

Figure 1 shows the evolution of the glucose concentration for two patients with a glucose fasting level concentration of 110 mg glucose/100 ml blood. Immediately after the ingestion of an initial dose of glucose, the glucose concentration increases to 190 (i.e. g0 = 190-110 = 80). The different trajectories are due to different parameters.

Fig. 1. Evolution of the blood glucose concentration.

The general behaviour of the glucose trajectory (and insulin trajectory as well) os-cillates around, and eventually converges to, the fasting concentration level. The natu-ral period T of such trajectory is given (in minutes) by:

4231

2

ppppT

+= π

A criterion used for diagnosing diabetes is based on the natural period T, which is increased in diabetic patients. It is generally accepted that a value for T higher than 4 hours is an indicator of diabetes, otherwise normalcy is concluded. We next show how the extended CCSP framework can be used to support the diagnosis of diabetes, possibly interrupting the sequence of blood tests if a safe decision can be made.

In case A (thick line), typical nor-mal values were used:

p1=0.0044 p

2=0.04

p3=0.0045 p

4=0.03

In case B (thin line), parameters p

2

and p4 were reduced: p

2=0.03 p

4=0.015

0

35

70

105

140

175

210

0 60 120 180 240 300 360 420 480

time (minutes)

glu

cose

(m

g/1

00m

l)


3.1 Representing the Model and Its Constraints with an Extended CCSP

The above decision problem may be modelled by an extended CCSP with two con-straints. A first constraint, defined as a CSDP, relates the evolution of the glucose and insulin concentrations with the trajectory values obtained through the blood tests. A second constraint is a simple numerical constraint relating the natural period with the ODE parameters according to its defining expression.

The CSDP constraint is associated with the following ODE system S based on the original system of differential equations but with the parameters included as new components with null derivatives:

⎪⎩

⎪⎨⎧

=′=′=′=′+−=′−−=′

≡0)()()()(

)()()()()(

)()()()()(

6543

16252

24131

tstststs

tststststs

tststststs

S

If n blood tests were made at times t1,…,tn, the constraint is defined by the CSDP Pn, which includes the ODE constraint enforcing the trajectories to satisfy the ODE system S between t=0.0 and t=tn, together with Value constraints representing each known trajectory component value. Variables g0, h0, p1, p2, p3 and p4, are the initial values and variables gt1,…,gtn, the glucose component values at times t1,…,tn.

CSDP Pn=(Xn,Dn,Cn) where: Xn=<xODE, g0, h0, p1, p2, p3, p4, gt1 ,…, gtn, T > Dn=<DODE, Dg0, Dh0, Dp1, Dp2, Dp3, Dp4, Dgt1, …, Dgtn , DT> Cn= { ODES, [0.0 .. tn](xODE), Valueg, 0.0(g0), Valueh, 0.0(h0),

Valuep1, 0.0(p1), Valuep2, 0.0(p2), Valuep3, 0.0(p3), Valuep4, 0.0(p4),

Valueg, t1(gt1) ,…, Valueg, tn(gtn) , T = 2π/sqrt(p1p3+p2p4) }

3.2 Using the Extended CCSP for Diagnosing Diabetes

By solving the extended CCSP Pn with the initial variable domains set up to the avail-able information, the natural period T will be safely bounded, and a guaranteed diag-nosis can be made if T is clearly above or below the threshold of 240 minutes.

In the following we assume that the acceptable bounds for the parameter values are 50% above/below the typical normal values (p1=0.0044, p2=0.04, p3=0.0045, p4=0.03) and study two different patients, A and B, whose observed values agree with Figure 1.

The first blood test on patient A, performed 1 hour after the glucose, indicates a glucose deviation from the fasting level concentration of –29.8. The extended CCSP P1 (with a single blood test) is solved by enforcing global hull consistency on the following initial variable domains:

Dp1=[0.0022..0.0066], Dp2=[0.0200..0.0600], Dp3=[0.0022..0.0068], Dp4=[0.0150..0.0450],

Dg0=[80.0], Dh0=[0.0], Dg60=[–29.85..–29.75], DT=[–∞..+∞] Table 1 shows results for T obtained after 10, 30 and 60 minutes of execution time

(with 0.000001 precision) in a Pentium 4, 256MB RAM, running at 1500MHz.


Table 1. Narrowing results obtained for patient A from the information of the first blood test.

10 minutes 30 minutes 60 minutes T [140.5..233.3] [149.6..213.9] [154.9..206.0]

After 10 minutes of CPU time (in fact after 7 minutes), the natural period is proved

to be smaller than 240 minutes and a normal diagnosis can be guaranteed with no need of further examinations. When the next blood test were due, 60 minutes later, T was proved to be under 206, much less than the threshold for diagnosing diabetes.

In patient B, the observed glucose deviation at the same first blood examination is 17.9. The initial domains for the variables of P are thus the same of the previous case, except for the observed glucose value Dg60=[17.95..18.05].

Enforcing global hull consistency on P with such information alone, no safe diag-nosis can be attained before the next blood test (1 hour later). After 60 minutes of CPU time, T was proved to be within [236.4..327.9] and both diagnoses, normal or diabetic, are still possible, though diabetes is quite likely. Further information is re-quired, and a second test is performed, indicating a glucose concentration of -38.9. The extended CCSP P2 (two blood tests) is solved with the initial variable domains:

Dg60=[17.95..18.05], Dg120=[–38.95..–38.85], DT=[236.3..328.0]

In less than 20 minutes, T was proved to be above 240. One hour later, when the next examination would be due, T is clearly above such threshold (T∈[245.0..323.8]), and the patient is safely diagnosed as diabetic, requiring no further blood tests.

4 A Differential Model for Drug Design

Pharmacokinetics studies the time course of drug concentrations in the body, how they move around it and how quickly this movement occurs. Oral drug administration is a widespread method for the delivery of therapeutic drugs to the blood stream. This section is based on the following two-compartment model of the oral ingestion/gastro-intestinal absorption process (see [17] and [18] for details):

)()()(

1 tDtxpdt

tdx +−= )()()(

21 typtxpdt

tdy −=

where x is the concentration of the drug in the gastro-intestinal tract; y is the concentration of the drug in the blood stream; D is the drug intake regimen; p1 and p2 are positive parameters.

The model considers two compartments, the gastro-intestinal tract and the blood stream. The drug enters the gastro-intestinal tract according to a drug intake regimen, described as a function of time D(t). It is then absorbed into the blood stream at a rate, p1, proportional to its gastro-intestinal concentration and independently from its blood concentration. The drug, is removed from the blood through a metabolic process at a rate, p2, proportional to its concentration there.

The drug intake regimen D(t), depends on several factors related with the produc-tion of the drug by the pharmaceutical company. We assume that the drug is taken on a periodic basis (every six hours), providing a unit dosage that is uniformly dissolved


into the gastro-intestinal tract during the first half hour. Consequently, for each period of six hours the intake regimen is defined as:

0.65.0

5.00.0

0

2)(

≤<≤≤

⎩⎨⎧

=t

t

if

iftD

The effect of the intake regimen on the concentrations of the drug in the blood stream during the administration period is determined by the absorption and metabolic parameters, p1 and p2. Maintaining the above intake regimen, the solution of the ODE system asymptotically converges to a six hours periodic trajectory called the limit cycle, shown in Figure 2 for specific values of the ODE parameters.

Fig. 2. The periodic limit cycle with p1=1.2 and p

2=ln(2)/5.

In designing a drug, it is necessary to adjust the ODE parameters to guarantee that the drug concentrations are effective, but causing no side effects. In general, it is suf-ficient to guarantee some constraints on the concentrations over a limit cycle.

One constraint is to keep the drug concentration in the blood within predefined bounds, namely to prevent its maximum value (the Peak Concentration) to exceed a threshold associated with a side effect. Other constraint imposes bounds on the area under the curve of the drug blood concentration (known as AUC) guaranteeing that the accumulated dosage is high enough to be effective. Finally, bounding the total time that such concentration remains above or under some threshold is an additional requirement for controlling drug concentration during the limit cycle. Figure 3 shows maximum, minimum, area (≥ 1.0) and time (≥1.1) for the limit cycle of figure 2.

Fig. 3. Maximum, minimum, area and time values at the limit cycle (p1=1.2 and p

2=ln(2)/5).

We show below how the extended CCSP framework can be used for supporting the drug design process. We will focus on the absorption parameter, p1, which may be adjusted by appropriate time release mechanisms (the metabolic parameter p2, tends to

0

0,5

1

0 1 2 3 4 5 6

0,5

1

1,5

0 1 2 3 4 5 6

x(t) y(t)

t t

t

y(t) maximum

minimum

area (≥1)

time (≥1.1)


be characteristic of the drug itself and cannot be easily modified). The tuning of p1

should satisfy the above requirements during the limit cycle, namely: (i) The drug concentration in the blood bounded between 0.8 and 1.5; (ii) Its area under the curve (and above 1.0) bounded between 1.2 and 1.3; (iii) It cannot exceed 1.1 for more than 4 hours.

4.1 Representing the Model and Its Constraints with an Extended CCSP

The expressive power of the extended CCSP framework allows its use for represent-ing the limit cycle and the different requirements illustrated in figure 3. Due to the intake regimen definition D(t), the ODE system has a discontinuity at time t=0.5, and is represented by two CSDP constraints, PS1 and PS2, in sequence.

The first, PS1, ranges from the beginning of the limit cycle (t=0.0) to time t=0.5, and a second PS2, is associated to the remaining trajectory of the limit cycle (until t=6.0). S1 and S2 are the corresponding ODE systems, where p1 and p2 are included as new components with null derivatives and the intake regimen D(t) is a constant:

⎪⎩

⎪⎨⎧

=′=′−=′

+−=′≡

0)()(

)()()()()(

2)()()(

1

43

24132

131

tsts

tststststs

tststs

S ⎪⎩

⎪⎨⎧

=′=′−=′

−=′≡

0)()(

)()()()()(

)()()(

2

43

24132

131

tsts

tststststs

tststs

S

Both CSDP constraints are defined as shown below for PS1 (PS2 is similar). Besides the ODE constraint, Value, Maximum Value, Minimum Value, Area and Time restric-tions associate variables with different trajectory properties relevant in this problem. Variables xinit, yinit, p1 and p2 are the initial trajectory values, and xfin and yfin are the final trajectory values of the 1st and 2nd components. Variables ymax and ymin are the maxi-mum and minimum trajectory values of the 2nd component (drug concentration in the blood stream) for this period. Variables ya and yt denote the area above 1.0 and the time above 1.1 of the 2nd component in this same period.

CSDP PS1 = (X1,D1,C1) where: X1=<xODE, xinit, yinit, p1, p2, xfin, yfin, ymax, ymin, ya, yt > D1=<DODE, Dxinit, Dyinit, Dp1, Dp2, Dxfin, Dyfin, Dymax, Dymin, Dya, Dyt> C1= { ODES1, [0.0 .. 0.5](xODE),

Valuex, 0.0(xinit), Valuey, 0.0(yinit), Valuep1, 0.0(p1), Valuep2, 0.0(p2), Valuex, 0.5(xfin), Valuey, 0.5(yfin), MaximumValuey, [0.0 .. 0.5](ymax), MinimumValuey, [ 0.0 .. 0.5](ymin),

Areay, [ 0.0 .. 0.5], ≥1.0(ya), Timey ,[ 0.0 .. 0.5] , ≥1.1(yt)} The extended CCSP P connects in sequence the two ODE segments by assigning the same variables x05 and y05 to both the final values of PS1 and the initial values of PS2 (parameters p1 and p2 are shared by both constraints). Moreover, the 6 hours period is guaranteed by the assignment of the same variables x0 and y0 to both the initial values of PS1 and the final values of PS2. Besides considering all the restriction variables (ymax … yt) of each ODE segment, new variables for the whole trajectory yarea and ytime sum the values in each segment.


CCSP P=(X,D,C) where: X= <x0, y0, p1, p2, x05, y05, ymax1, ymax2, ymin1, ymin2, ya1, ya2, yarea, yt1, yt2, ytime> D=<Dx0, Dy0, Dp1, Dp2, Dx05, Dy05, Dymax1, Dymax2, Dymin1, Dymin2, Dya1, Dya2, Dyarea, Dyt1, Dyt2, Dytime>

C= { PS1(x0, y0, p1, p2, x05, y05, ymax1, ymin1, ya1, yt1), PS2(x05, y05, p1, p2, x0, y0, ymax2, ymin2, ya2, yt2), yarea = ya1 + ya2, ytime = yt1 + yt2}

4.2 Using the Extended CCSP for Parameter Tuning

The tuning of drug design may be supported by solving P with the appropriate set of initial domains for its variables. We will assume p2 to be fixed to a five-hour half live (Dp2=[ln(2)/5]) and p1 to be adjustable up to about ten-minutes half live (Dp1=[0..4]). The initial value x0, always very small, is safely bounded in interval Dx0=[0.0..0.5].

The assumptions about the parameter ranges together with the bounds imposed by the above requirements justify the following initial domains for the variables of P (all the remaining variable domains unbounded):

Dx0= [0.0 .. 0.5], Dy0 = [0.8 .. 1.5], Dp1 = [0.0 .. 4.0], Dp2 = [ ln(2)/5], Dymax1=[0.8 .. 1.5], Dymax2=[0.8 .. 1.5], Dymin1=[0.8 .. 1.5], Dymin2 = [0.8..1.5], Dyarea= [1.2 .. 1.3], Dytime= [0.0 .. 4.0]

Solving the extended CCSP P (enforcing global hull consistency), with a precision of 0.001, narrows the original p1 interval to [1.191..1.543] in less than 3 minutes. Hence, for p1 outside this interval the set of requirements cannot be satisfied.

This may help to adjust p1 but offers no guarantees on specific choices within the obtained interval. For instance, two canonical solutions for p1 [1.191.. 1.192] and [1.542..1.543] contain no real solution, since when solving the problem with a higher precision (0.000001), the domain of p1 is narrowed to [1.209233..1.474630] that does not include the above canonical solutions (obtained with the lower 0.001 precision).

Nevertheless, using CCSP P with different initial domains, may produce guaran-teed results for particular choices of the p1 parameter values. For example, for p1∈[1.3..1.4] (the manufacturing process, prevents p1 to be expressed with higher precision), and the following initial domains (the remaining are unbounded):

Dx0=[0.0..0.5], Dy0=[0.8..1.5], Dp1=[1.3..1.4], Dp2=[ln(2)/5] global hull consistency on P (with 0.001 precision) narrows the following, initially unbounded, domains to:

ymin1∈[0.881..0.891], ymax1∈[1.090..1.102], yarea∈[1.282..1.300],

ymin2∈[0.884..0.894], ymax2∈[1.447..1.462], ytime∈[3.908..3.967]. Notwithstanding the uncertainty, these results do prove that with p1 within [1.3..1.4], all limit cycle requirements are safely guaranteed (the obtained bounds are well within the requirements). Moreover, they offer some insight on the requirements showing, for instance, the area requirement to be the most critical constraint.

The above bounds were obtained in about 13 minutes. However, much faster re-sults may be obtained if the goal is simply to check whether the requirements are met. Since global hull consistency is enforced by an any time algorithm, its execution may be interrupted as soon as the requirements are satisfied (10 minutes in this case).


A better approach in this case would be to prove that the CCSP P with the initial domains Dx0=[0.0..0.5], Dy0=[0.8..1.5], Dp1=[1.3..1.4] and Dp2=[ln(2)/5] together with each of the following domains cannot contain any solution (again, the remaining do-main are kept unbound):

Dymax1=[1.5..+∞]; Dymax2=[1.5..+∞]; Dymin1=[−∞..0.8]; Dymin2=[−∞..0.8];

Dyarea=[1.3..+∞]; Dyarea=[−∞..1.2]; Dytime=[4.0..+∞]. By independently proving that no solutions exist for of the above problems, which

cover all non satisfying possibilities, it is proved that all the requirements are neces-sarily satisfied. This was achieved in less than 5 minutes.

5 Conclusion

This paper presents a framework to make decisions with deep biomedical models expressed by differential equations, with a constraint reasoning approach. In contrast to Monte Carlo and other stochastic techniques that can only assign likelihoods to the different decision options, and despite the uncertainty of medical information and approximation errors during calculations, the enhanced propagation techniques devel-oped (enforcing global hull consistency) allow safe decisions to be made. Whereas the traditional use of complex differential models for which there are no analytical solu-tions it is currently unsafe, the constraint reasoning framework extends the possibility of practical introduction of this type of models in medical decision making, specially when safe decisions are required.

References

1. Ackerman, E., Gatewood, L., Rosevar, J., Molnar, G.: Blood Glucose Regulation and Dia-betes. In: Concepts and Models of Biomathematics, Marcel Dekker, (1969) 131-156.

2. Benhamou, F., McAllester, D., Van Hentenryck, P.: CLP(Intervals) revisited. Logic Programming Symposium, MIT Press (1994) 124-131.

3. Cleary, J.G.: Logical Arithmetic. Future Computing Systems 2(2) (1987) 125-149. 4. Collavizza, H., Delobel, F., Rueher, M.: A Note on Partial Consistencies over Continuous

Domains. Principles and Practice of Constraint Programming. Springer (1998) 147-161. 5. Cruz, J., Barahona, P., Benhamou, F.: Integrating Deep Biomedical Models into Medical

Decision Support Systems: An Interval Constraints Approach. Procs. AI in Medicine and Medical Decision Making. Springer, Aalborg, (1999) 185-194.

6. Cruz, J., Barahona, P.: Handling Differential Equations with Constraints for Decision Sup-port. Frontiers of Combining Systems, Springer (2000) 105-120.

7. Cruz J., Barahona, P.: Global Hull Consistency with Local Search for Continuous Con-straint Solving. 10th Portuguese Conference on AI.. Springer (2001) 349-362.

8. Cruz J., Barahona, P.: Maintaining Global Hull Consistency with Local Search for Con-tinuous CSPs. 1st International Workshop on Global Constrained Optimization and Con-straint Satisfaction, Sophia-Antipolis, France (2002).

9. Cruz J.: Constraint Reasoning for Differential Equations. PhD thesis, submitted (2003). 10. Sam-Haroud, D., Faltings, B.V.: Consistency Techniques for Continuous Constraints. Con-

straints 1(1,2) (1996) 85-118.


11. Janssen, M., Van Hentenryck, P., Deville, Y.: Optimal Pruning in Parametric Differential Equations. Principles and Practice of Constraint Programming. Springer (2001).

12. L’homme, O.: Consistency Techniques for Numeric CSPs. Proc. IJCAI, (1993) 232-238. 13. Montanari, U.: Networks of Constraints: Fundamental Properties and Applications to Pic-

ture Processing. Information Science 7(2) (1974) 95-132. 14. Moore R.E.: Interval Analysis. Prentice-Hall, Englewood Cliffs, NJ (1966). 15. Nedialkov, N.S.: Computing Rigorous Bounds on the Solution of an Initial Value Problem

for an Ordinary Differential Equation. PhD thesis, Univ. of Toronto, Canada (1999). 16. Shampine, L.F.: Numerical Solution of Ordinary Differential Equations. New York, Chap-

man and Hall (1994). 17. Spitznagel, E.: Two-Compartment Pharmacokinetic Models. C-ODE-E. Harvey Mudd Col-

lege, Claremont, CA (1992). 18. Yeargers, E.K., Shonkwiler, R.W., Herod, J.V.: An Introduction to the Mathematics of Bi-

ology: with Computer Algebra Models. Birkhäuser, Boston, USA (1996).


Interactive Decision Support for Medical Planning

David W. Glasspool, John Fox, Fortunato D. Castillo, and Victoria E.L. Monaghan

Advanced Computation Laboratory, Cancer Research UK, PO Box 123, London WC2A 3PX, United Kingdom.

{dg,jf,fc,vm}@acl.icnet.uk

Abstract. We describe a decision support system for treatment planning which provides immediate feedback of constraints, interactions and dependencies on and between treatment actions, and the possible outcomes of proposed plans.

1 Introduction

Many computer decision support systems support only a single, isolated decision - for example, what drug to prescribe, or whether to refer a patient to a specialist. Most decisions, however, are made in the context of extended plans of action. This is espe-cially true in medicine where most actions are undertaken as part of a care plan (ex-plicit or implicit) for a patient.

For example, consider the case of a woman diagnosed as carrying a mutation to the gene BRCA1 which predisposes to breast cancer. Such a woman may have a lifetime risk of developing breast cancer approaching 80%. Following identification of the genetic factor, the woman will typically be seen by a genetic counsellor who explains what the impact of this will be, and what options are available for mitigating the risk. A number of options are available, including prophylaxis with drugs such as Ta-moxifen, screening (eg. X-ray mammography) to detect tumours early in develop-ment, and prophylactic surgery to remove the ovaries (oophorectomy) or breasts (mastectomy).

There is no “correct” plan of action in this situation, the care plan arrived at by the counsellor and patient will reflect the individual needs and plans of the patient. For example, she may be planning to have children, and will need to avoid oophorectomy and drugs like Tamoxifen until there is no further chance of pregnancy. Other meas-ures such as regular physical checks can control the overall risk level until more dras-tic action can be taken. Each decision involved in forming the plan is potentially in-fluenced by previous decisions, and can influence later decisions.

Planning imposes a number of additional cognitive demands on the decision maker compared with making an isolated decision. A planner must:

a) Maintain the evolving plan in memory as it develops.

b) Identify which options for action are available at each step in the plan.

c) Decide which of these options should actually be taken at each step in the plan.

d) Keep track of constraints on, and dependencies between, planned actions.

e) Keep track of the effect of the plan as a whole with respect to its goals – for ex-ample, the level of reduction in risk over time, or the cost or resources required.

336 David W. Glasspool et al.

This paper describes a decision support system for medical planning, REACT (Risk, Events, Actions and their Consequences over Time), which is designed to tar-get decision support at these areas of high cognitive load.

Scaife and Rogers [1] point out that good design of computer support for difficult tasks allows the user to offload some aspects of the task onto the user interface in order to reduce the cognitive load on the user. Scaife and Rogers refer to this offload-ing, which includes use of more mundane techniques such as drawings and diagrams or making notes, as external cognition. The guiding principle of REACT is to provide external cognition to support those aspects of planning which place particularly heavy cognitive demands on the planner. This has been done by providing five specific fea-tures of the REACT user interface, as described in the next section, which allow cog-nitively demanding aspects of planning to be offloaded by the user.

Huys et al. [2] show that thinking through a scenario or plan before carrying it out can significantly change decision making behaviour. REACT therefore aims to en-courage the user to explore all options open to them during the formation of a plan, by making exploration of options as easy as possible, and providing copious feedback of likely consequences.

2 The REACT User Interface

REACT provides five distinct types of decision support, aimed at allowing the user to offload the five types of cognitive demand listed above during planning:

1. An interactive chart showing the current plan and the options for actions at each point. This allows the user to offload memory for the plan configuration, and iden-tification of options for action (cognitive demands a and b).

2. Visual feedback of conflicts between planned actions and constraints on actions. This allows the user to externalise calculation of temporal constraints and depend-encies as planning progresses (cognitive demand d).

3. Feedback of the predicted effect of the developing plan on outcome measures. This allows users to externalise prediction of the overall consequences of actions and plans (cognitive demand e).

4. Arguments: Symbolic (verbal) decision support applied separately to each planned action. This allows users to externalise formulation of the pros and cons of each planning decision (cognitive demand c).

5. Recommendations: Symbolic decision support applied to the entire plan. This provides overall decision support for plans, aggregating multiple arguments rele-vant to a particular configuration of planned events to provide advice to the planner (cognitive demand c).

The REACT user interface is divided into three main sections, the planning chart, the outcome measures area, and the argumentation area, as shown in fig. 1.

2.1 The Planning Chart

The planning chart has two functions: To allow the user to offload representation of the current state of the plan, and to offload the identification of possible options for action. The first of these functions is supported by allowing the user to “draw” plan

Interactive Decision Support for Medical Planning 337

elements with the mouse. Anticipated events and planned actions, are arranged against a timeline to form a plan. A vertical line represents a single event (e.g. sur-gery), and a shaded region represents an extended period (eg a course of drug ther-apy). Plan elements can be rearranged at will. REACT re-calculates the projected effects of alterations while they are being made, giving immediate feedback on the consequences of modifications. To support the second function the planning chart separates different classes of action into different horizontal lanes, determined by the medical domain being worked in. The possibilities for action are thus made implicit.

Fig. 1. The REACT main user interface. A plan is being developed for risk mitigation in the domain of genetic predisposition to breast cancer by manipulating actions in the planning chart (against a timeline marked in years). The outcome measure graph indicates the estimated risk of death due to breast cancer, while arguments for and against the use of Tamoxifen for this pa-tient are reviewed in the argumentation area.

2.2 Feedback of Planning Constraints

Immediate feedback of constraints and dependencies between actions or events in the plan allows the user to externalise representation and calculation of these factors. Events which violate constraints are highlighted visually on the display while they are being manipulated with the mouse.

2.3 Outcome Measures

The third type of support provided by REACT aims to allow the user to offload pre-diction of the consequences of actions and plans. Any overall consequence of a plan

Planning Chart

Outcome Measures

Arguments and Rec-ommendations

338 David W. Glasspool et al.

which may be given a numerical value (such as overall risk or reduction in risk to a patient, or overall cost or level of resources required) may be displayed as a graph beneath the planning chart, which is continually updated during interaction with the planning chart.

2.4 Symbolic Decision Support and Uncertainty Communication

The decision support engine in REACT is based on logical argumentation [3, 4, 5]. This approach formalises the idea that decisions are made on the basis of arguments for or against a claim. Arguments are formally defined in [5] as structures of the form:

(Claim, Grounds, Qualifier)

where Claim is the proposition that the argument refers to (for example, "this patient carries a gene predisposing to breast cancer"), Grounds provides the justification for the argument, (for example, "three or more first degree relatives of this patient have contracted breast cancer"). The qualifier of an argument indicates its force - if this is an argument for or against the action, and how strong the argument is on a numeric or other scale.

REACT maintains a set of argument schemas related to the particular medical do-main in which the tool is being used. The REACT decision support system keeps track, at each point in time in the proposed plan, of which arguments are valid. These arguments are used to construct a case for or against the decision to take a particular action, and can hence be used to provide knowledge-based decision support during planning. Arguments are used to provide all of the decision support types. Two types of logical decision support are based on text renderings of arguments, and appear in the REACT argumentation pane. The first of these, simply referred to as “arguments”, aims to allow the offloading of the determination of pros and cons of particular op-tions during planning. This presentation of arguments (shown in figure 1) is intended to collate the pros and cons of each planned action, taking into account the action's context within the plan. There is reason to expect such a presentation to be more eas-ily assimilated than, for example, probability values [6].

2.5 Recommendations

The second type of text-based decision support provided by REACT takes the form of recommendations based on collections of arguments. On the basis of the set of argu-ments which are valid at a particular point in a proposed plan, rules may recommend actions to the user.

3 Implementation

The implementation of REACT shown in figure 1 is a simplified prototype version. More recently a commercial-quality implementation has been produced which in-cludes an extensible argumentation-based decision support system, along with a set of tools including an editor for building argumentation models for new domains. The

Interactive Decision Support for Medical Planning 339

domain editor comprises two components, a data definitions editor and and an argu-ment editor. The data definitions editor allows a collection of data items (or concepts) to be constructed, while the argument editor is used to construct logical arguments connecting these items. Detailed domain specifications for two medical application domains have been developed using this set of tools. The first provides a knowledge base for women with BRCA1 or BRCA2 gene mutations. These women are at greatly increased risk of breast and ovarian cancer. The REACT tool enables them to plan interventions and life events in the future. The second domain provides a knowledge base for people with type II diabetes. These people are at risk of multiple complica-tions for their disease, including heart disease and stroke. The REACT tool enables the patients to plan future management of their disease providing them with feedback in the form of arguments and graphs showing the risk of the complications over time.

Acknowledgements

This research was supported by awards L127251011 and L328253015 from the UK Economic and Social Research Council and Engineering and Physical Sciences Re-search Council. The implementation and domain definition work described in section 3 was carried out with funding from Cancer Research UK.

References

1. Scaife M. & Rogers Y. External cognition: How do graphical representations work? Interna-tional Journal of Human-Computer Studies 45 (1996) 185-213

2. Huys, J. Evers-Kiebooms, G. & d’Ydwalle, G. Decision making in the context of genetic risk: The use of scenarios. Birth Defects: Original Article Series. 28 (1992) 17-20

3. Fox J, Krause P and Ambler S. Arguments, contradictions and practical reasoning. In Neu-mann B, ed. Proceedings of the 10th European Conference on AI, ECAI92, Vienna, Austria (1992) 623-7

4. Fox J. On the necessity of probability: Reasons to believe and grounds for doubt. In Wright G, Ayton, P, eds., Subjective Probability. Chichester: John Wiley (1994)

5. Fox J and Das S. Safe and Sound: Artificial intelligence in hazardous applications. Cam-bridge, Mass.: MIT press (2000)

6. Ranyard, R, Crozier, WR, Svenson, O, eds. Decision Making: Cognitive Models and Expla-nations. Chapters 7, 8 and 9. Routledge: London and New York (1997)

Compliance with the HyperlipidaemiaConsensus: Clinicians versus the Computer

Wouter P. van Rijsinge1, Linda C. van der Gaag1,Frank Visseren2, and Yolanda van der Graaf3

1 Institute of Information and Computing Sciences, Utrecht UniversityP.O. Box 80.089, 3508 TB Utrecht, The Netherlands

2 Department of Internal and Vascular Medicine, University Medical Center UtrechtHeidelberglaan 100, 3584 CX Utrecht, The Netherlands

3 Department of Clinical Epidemiology, University Medical Center UtrechtP.O. Box 85.500, 3508 GA Utrecht, The Netherlands

Abstract. The hyperlipidaemia consensus was designed to providetreatment recommendations for patients suffering from increased bloodlipoprotein levels. To allow for studying compliance with the consensus,we constructed a classification tree having the same functionality. Giventhe medical records of 1328 patients, we compared the treatment recom-mendations from a multidisciplinary team of clinicians against the thusautomated consensus. The compliance study revealed various discrepan-cies between the consensus and the clinicians, some of which could beattributed to possible inaccuracies in the consensus.

1 Introduction

Over the last decades, an increasing number of guidelines have become avail-able to support clinicians in their management of patients. Since the use of suchguidelines serves to standardise the delivery of patient care and is generally ex-pected to enhance its quality, researchers are seeking to automate them for onlineconsultation [1]. Guidelines can differ considerably in the way they are presentedand in the amount of detail and support they offer. They may, for example, bewritten in natural language, describing in an informal way how patients shouldbe managed. While such guidelines leave considerable room for judgement forthe clinicians who apply them, they often suffer from the incompleteness andambiguity inherent to natural language. Research efforts are focused on formalanalysis of such guidelines, with the aim of enhancing them [2]. Other guidelinesmay be more rigorously stated, capturing patient management in classificationtrees or decision diagrams. Such guidelines are less flexible to apply, yet alsosuffer less from anomalies. Research efforts with respect to these more rigorousguidelines are typically centered on the issue of (non-)compliance [3].

In the Netherlands, various guidelines have been designed that are aimed atgeneral practitioners. One of these is the hyperlipidaemia consensus for treat-ment recommendation for patients suffering from increased blood lipoproteinlevels [4]. This consensus is quite rigorously stated, allowing for relatively easy


Compliance with the Hyperlipidaemia Consensus 341

implementation. To provide for investigating compliance, we constructed a clas-sification tree from the consensus, having the same functionality. Using this tree,we studied a collection of patient data available from a clinical study in vascu-lar disease. For each patient, we computed a treatment recommendation fromthe consensus and compared it against the advice given by a multidisciplinaryteam of clinicians. We found that for some specific categories of patients, theclinicians often presented other recommendations than the ones computed fromthe consensus. We examined the results obtained for these categories of patientsin detail and identified various different sources for the discrepancies, some ofwhich could be attributed to possible inaccuracies in the consensus.

The paper is organised as follows. In Sect. 2, we briefly describe the hy-perlipidaemia consensus. In Sect. 3, we comment on its implementation as aclassification tree and outline the design of our compliance study. In Sect. 4, theresults from our study are presented; the results are reviewed in Sect. 5. Thepaper ends with our concluding observations in Sect. 6.

2 The Hyperlipidaemia Consensus

The hyperlipidaemia consensus is aimed at general practitioners and serves toprovide treatment recommendations for patients suffering from increased bloodlipoprotein levels [4]. Increased levels of lipoprotein constitute an important riskfactor for vascular disease. Since lowering these levels has associated a substan-tial decrease in mid-term risk of coronary incidents, early identification andtreatment are of primary importance. Hyperlipidaemia is typically confirmed bymeasurement of the levels of total cholesterol, triglycerides, LDL and HDL inserum. Different types of treatment are available, among which is medication.

The hyperlipidaemia consensus can be looked upon as composed of threeparts. In the pre-selection part, two categories of patients are identified. Patientswith a total cholesterol level of more than 8 mmol/l, a triglyceride level higherthan 4 mmol/l, or a HDL level of less than 0.6 mmol/l, are referred to a specialistin internal and vascular medicine. Also, patients associated with a relatively shortlife expectancy are identified; these patients will not be managed explicitly. Thesecondary-prevention part of the consensus pertains to patients with manifestvascular disease. If these patients do not have a total cholesterol level higherthan 5 mmol/l nor an LDL level of more than 3.2 mmol/l, there is no indicationfor treatment. If one of these levels is increased, however, drug treatment, forexample with statins, is indicated; if the patient is already known to medication,the current treatment regime should be reviewed. The primary-prevention part,to conclude, pertains to patients with diabetes or hypertension. Based upontheir total cholesterol/HDL ratio, among other factors, their absolute risk of acoronary incident within the next ten years is assessed, which is then taken todecide whether or not treatment is indicated.

The hyperlipidaemia consensus provides one or more recommendations fromamong five possible alternatives. Although it may in essence select more thanone alternative, for most patients it returns a single recommendation.

342 Wouter P. van Rijsinge et al.

3 The Set-up of the Compliance Study

For the purpose of our compliance study, we automated the hyperlipidaemiaconsensus. Since the consensus could in essence be interpreted as an algorithmfor classifying patients into different categories associated with different treat-ment recommendations, we chose to represent the consensus by a classificationtree. The tree currently consists of 23 internal nodes, representing lipoproteinlevels and other patient specifics, and 26 leaves, representing treatment recom-mendations.

For our study, we had the medical records of over 3000 patients at our dis-posal. These data had been collected at the University Medical Center Utrecht(UMCU) for the SMART (Second Manifestations of ARTerial disease) study [5];included were all patients between 18 and 79 years of age who were referred tothe UMCU since 1996, for vascular disease or with an increased risk of vasculardisease. For each patient, a large number of data had been recorded, includingvarious blood measurements. These data had been reviewed by a multidisci-plinary team of clinicians, consisting of an internist, a cardiologist, a vascularsurgeon, and a nurse practitioner; if necessary, the team had been extended bya neurologist. Supported by the consensus, this SMART team had constructed,for each patient, a treatment recommendation, which had been included in thepatient’s record; the team had been allowed to enter multiple recommendations.

From the SMART data collection, we selected the records of all patients whohad been included between 1998 and 2001; the records from before 1998 wereconsidered inappropriate for our compliance study, because the hyperlipidaemiaconsensus had changed since then. From the selected records, we further excludedpatients who had been referred to the UMCU for treatment of hyperlipidaemia.Since the hyperlipidaemia consensus is aimed at general practitioners, it doesnot apply to patients who are referred to a specialist in internal and vascularmedicine. All in all, we selected the medical records of a total of 1328 patients.

In our compliance study, we computed, for each patient, a single treatmentrecommendation from the hyperlipidaemia consensus, by means of the classifi-cation tree. The computed recommendation was subsequently compared againstthe primary recommendation that had been given by the SMART team.

4 Results

From the total of 1328 patients, the consensus recommended for 132 of themreferral to a specialist in internal and vascular medicine; for 191 patients, theconsensus concluded, based upon their life expectancy, that explicit managementwas not indicated. From the 1005 patients remaining for further analysis, theconsensus identified 592 patients as manifesting vascular disease; these patientswere entered into the secondary-prevention part of the consensus. 413 patientswith diabetes or hypertension were entered into the primary-prevention part.

Upon comparing treatment recommendations, we found that the alternativeselected by the SMART team differed from the recommendation yielded by the

Compliance with the Hyperlipidaemia Consensus 343

Table 1. The results for the 592 patients entered into the secondary-prevention partof the consensus, pertaining to the recommendation to start, or adapt, drug treatment

SMART-team:yes no unknown total

Consensus: yes 180 27 53 260no 21 15 296 332

total 201 42 349 592

consensus for some 10% of the patients, that is, the team did not comply withthe consensus for one out of every ten patients. As an example category of pa-tients for which we found a considerable number of discrepancies, we brieflyreview the results obtained for the 592 patients with manifest vascular disease.For these patients, the recommendation of primary interest is whether or not tostart, or adapt, drug treatment. Table 1 summarises the results with respect tothis recommendation. The columns indicate the numbers of patients per recom-mendation by the SMART team, while the rows indicate these numbers for theconsensus; the last column of the table lists the numbers of patients for whomthe SMART team did not provide an explicit recommendation with respect tomedication. The SMART team did not offer information about medication for349 patients, or 59%. For 48 of the remaining patients, or 20%, did the SMARTteam and the consensus explicitly disagree about drug treatment. When absenceof an explicit recommendation by the SMART team is construed as a recom-mendation not to start with medication, discrepancies are found for 17% of thepatients.

5 Discussion

Upon analysing the results obtained from our compliance study, we noticedthat, while for some categories of patients there were hardly any discordancesbetween the SMART team and the consensus, for other categories there werelarge numbers of discrepancies. The diverging pattern of discrepancy suggestedthat the SMART team might have had valid reasons for not complying withthe consensus. To identify possible causes for the discrepancies for the variouscategories, sample patients from these categories were put to a specialist ininternal and vascular medicine. We briefly review one of the causes of non-compliance that we thus found.

For 21 of the 592 patients with manifest vascular disease, the SMART teamrecommended starting with, or adapting, medication, while the consensus ad-vised against drug treatment, as shown in Table 1. We observed that the con-sensus had based its recommendation on the levels of total cholesterol and LDLin these patients: it states that drug treatment is indicated if a patient’s totalcholesterol level is higher than 5 mmol/l or his LDL level is over 3.2 mmol/l. Forthe 21 patients considered, we found that their total cholesterol levels were justbelow 5 mmol/l and their LDL levels were between 1.37 mmol/l and 3.19 mmol/l.

344 Wouter P. van Rijsinge et al.

With these lipoprotein levels, these patients clearly were on the decision bound-ary of the consensus. We further found that most of these patients were alreadybeing treated with medication and suffered from a triglyceride problem. TheSMART team apparently had identified these patients as boundary cases andhad taken the additional triglyceride problem or the current treatment regimeinto account upon constructing their recommendation to start, or adapt, medica-tion. They had thereby explicitly deviated from the recommendation suggestedby the consensus.

6 Concluding Observations

We automated the hyperlipidaemia consensus for treatment recommendation forpatients suffering from increased blood lipoprotein levels, by constructing a clas-sification tree with the same functionality. The use of a classification tree forthis purpose proved to be feasible, owing to the crispness and quality of the con-sensus. We studied compliance with the consensus by comparing the treatmentrecommendations available from a multidisciplinary team of clinicians againstthe automated protocol, for 1328 patients. We found that for some categories ofpatients especially, the clinicians often presented other recommendations thanthe ones computed from the consensus. We studied the results obtained for thesecategories of patients in detail and found that some of the discrepancies couldbe attributed to possible inaccuracies in the consensus. We are currently inves-tigating how the consensus can be enhanced by the results from our study.

References

1. S. Quaglini, L. Dazzi, L. Gatti, M. Stefanelli, C. Fassino, and C. Tondini (1998).Supporting tools for guideline development and dissemination. Artificial Intelligencein Medicine, vol. 14, pp. 119 – 137.

2. M. Marcos, G. Berger, F. van Harmelen, A. ten Teije, H. Roomans, and S. Miksch(2001). Using critiquing for improving medical protocols: harder than it seems. In:S. Quaglini, P. Barahona, and S. Andreassen (editors). Artificial Intelligence inMedicine, Lecture Notes in Artificial Intelligence 2101, Springer, Berlin, pp. 431 –441.

3. B. Seroussi, J. Bouaud, and E.-C. Antoine (1999). Enhancing clinical practice guide-line compliance by involving physicians in the decision process. In: W. Horn, Y.Shahar, G. Lindberg, S. Andreassen, and J. Wyatt (editors). Artificial Intelligencein Medicine, Lecture Notes in Artificial Intelligence 1620, Springer, Berlin, pp. 76 –85.

4. Treatment and Prevention of Coronary Heart Disease by Lowering the PlasmaCholesterol Concentration (in Dutch). CBO, Utrecht, 1998.

5. P.C.G. Simons, A. Algra, M.F. van der Laak, D.E. Grobbee, and Y. van der Graaf(1999). Second Manifestations of ARTerial disease (SMART) study: Rationale anddesign. European Journal of Epidemiology, vol. 15, pp. 773 – 781.


WoundCare: A Palm Pilot-Based Expert System for the Treatment of Pressure Ulcers

Douglas D. Dankel1, Mark Connor2, and Zulma Chardon3

1 CISE, Box 116120, University of Florida, Gainesville, FL 32611-6120, USA [email protected]

2 Atlantic Net, 2815 NW 13th St., Gainesville, FL 32605, USA [email protected]

3 B21 SHCC, University of Florida, Gainesville, FL 32611, USA [email protected]

Abstract. This paper presents preliminary results on the development of a Palm Pilot-based medical expert system for the treatment of pressure ulcers. Its de-ployment on a Palm Pilot allows it to be easily carried to the patient’s location, allowing it to collect information about a patient’s pressure ulcers (i.e., bed sores) and provide immediate and accurate recommendations for the treatment of those wounds. Using knowledge acquired from a wound care specialist, this expert system provides recommendations similar to those provided by a medi-cal expert.

1 Introduction

Since their inception, expert systems have been developed in a variety of domains. Some of the earliest applications were in the fields of organic chemistry and medical diagnosis [1]. While many systems were developed for desktop and mainframe com-puters, their widespread use was restricted by cost or inconvenience. Professions such as medicine, one of the first application domains, are highly mobile in nature with physicians and nurses traveling from room-to-room. Stationary expert systems cannot travel with these care givers, making their use inconvenient and unpleasant.

Recent advances in computer technologies have given rise to the concept of mobile computing. Powerful handheld computers, performing at speeds comparable to lower-end desktop systems, have widespread acceptance among consumers and pro-fessionals due to their convenience and low cost.

This paper describes WoundCare, a prototype expert system providing advice on the treatment of Type 1 pressure ulcers (i.e., bed sores). Its deployment on a Palm Pilot allows collecting information about the patient’s wounds at the patient’s loca-tion and providing immediate and accurate treatment recommendations.

Section 2 provides some background on the use of Palm Pilots in the medical do-main and expert system shells to assist in the development of decision support soft-ware. In Section 3 we provide a brief introduction to WoundCare with a walkthrough of the system in Section 4. Section 5 provides information on the system’s knowledge representation. Finally, extensions and further research are presented in Section 6.

346 Douglas D. Dankel, Mark Connor, and Zulma Chardon

2 Other Research

A Price Waterhouse Coopers 2001 survey found that 60% of the respondents identi-fied that physicians in their organization used PDAs (personal digital assistants), compared with only 26% in 2000 [2]. This increase can in part be attributed to the wide range of available medical applications developed for PDAs [3, 4] and the need to reduce medical errors [5, 6, 7]. These applications range from medical reference texts (e.g., 5mID and PDRDrugs by skyscape.com, and Clinical Cardiology by Pa-cific Primary Care) providing work-ups, treatments, and differentials in almost every medical specialty to patient tracking software (e.g., GuineaTRACKER by guinea-soft.net) allowing practitioners to store notes about their patients.

One of the most popular medical programs is ePocrates Rx (epocrates.com), a clinical drug database, providing information on over 2,800 drugs, which “saves time during drug-related decision making, is easily incorporated into their usual workflow, and improves drug-related decision making” [8, p. 223]. Other applications include topsE&M (e-mds.com) which assists in determining appropriate evaluation and man-agement billing codes based on the 1997 HCFA guidelines; ClaimTrack (akosys-tems.com) which assists in tracking expenses, insurance claims, and reimbursements; and MediCalc (smallsyssoft.com) which performs basic medical calculations.

In contrast, there are very few expert system shells currently available for PDAs, and those that are available are very restrictive and not useful for most applications. Kex is a simple shell developed by Danny Ayers [9]. In this system, facts are state-ments in the form of a string. Rules are structured as IF-THEN statements, where the condition is satisfied if there is a matching fact (or facts for compound conditions) and conclusions are simply facts that are asserted if the condition is satisfied. While Kex is successful in providing a basic expert system’s structure, it is restrictive and impractical for all but the simplest of applications.

PicoXpert is a decision-tree-based expert system, where knowledge is structured in a hierarchal manner providing a path of knowledge analysis that follows a strict order from start to finish. PicoXpert provides a nice user interface, but is highly restrictive and, therefore, not useful in most applications [10].

3 Introduction to WoundCare

WoundCare is a Palm Pilot based medical expert system to assist healthcare providers in the treatment of pressure ulcers, also known as bedsores. A pressure ulcer is “any lesion caused by unrelieved pressure resulting in damage of underlying tissue” [11, p. 1]. Pressure ulcers occur in home settings as well as many healthcare facilities due to a patient’s immobility and inadequate or inappropriate support of the patient with treatment often requiring expert medical knowledge.

WoundCare assists healthcare providers in treating pressure ulcers by analyzing data collected about a patient and providing recommendations for treatment. Addi-tionally, WoundCare collects and organizes patient information, including medica-tions, allergies, medical history, surgical history, and wound treatment history. The next section describes WoundCare and its use.

WoundCare: A Palm Pilot-Based Expert System for the Treatment of Pressure Ulcers 347

4 WoundCare’s Operation

WoundCare guides the caregiver through the process of collecting information about a patient and generates recommendations for treatment. WoundCare starts by present-ing a list of patients. This list displays each patient’s full name, gender, and race. From this screen, the caregiver selects a patient record to view or modify, creates a new record, or deletes a record.

When the caregiver selects a patient from the Patient List screen, the first Patient Information screen is displayed. This screen is used to collect general information about the patient including the patient’s name, identification number, date of birth, gender, race, smoking habits, and alcohol use. From this screen, the caregiver may continue to the next screen or return to the Patient List. On the second Patient Infor-mation screen, the patient’s medications and allergies are displayed for review or modify. The third and final Patient Information screen displays the patient’s medical and surgical history. Again, the caregiver may view, modify, create, or delete entries.

The Visit History screen displays a list of previous visit records for the current pa-tient. Each time a patient is treated, a new visit record is created. Each record stores the patient’s general health and wound assessments performed on that visit. The Visit History screen allows viewing a previous visit by selecting a date in the list or creat-ing a new visit record.

The Patient Assessment screens collect information about a patient’s general health at the time of the visit. This includes the patient’s mental and behavioral status, nutrition, lifestyle, mobility, etc. WoundCare uses this to generate treatment recom-mendations; therefore, changes to the data collected on these screens affect the rec-ommendations of each wound record associated with the visit. From these screens, the caregiver may return to the Visit History screen or proceed to the Wound List screen.

The Wound List screen displays all wounds assessed during a visit. Wounds are listed by site and stage for easy identification.

The Wound Assessment screen collects information about a wound. The caregiver selects the wound site using a graphical display of a body. To select the wound site, the caregiver taps on the appropriate area of the image. Selecting a narrow region of the body is difficult due to the small screen size of Palm devices; therefore, selecting any portion of the image causes the magnification of the selected region. The care-giver selects the wound site by tapping on the appropriate location of this magnified image.

After selecting the wound site, the caregiver identifies the stage of the wound and its dimensions. Tapping on the “Next” button initiates diagnosis using the knowledge contained in the knowledge base and the collected patient information to generate recommendations for treatment. During execution, the system prompts for any addi-tional required information.

When WoundCare completes its analysis, the Recommendations screen is dis-played, which displays the generated recommendations in order from highest priority to lowest priority. This also allows the user to view the facts retrieved from the knowledge base.

348 Douglas D. Dankel, Mark Connor, and Zulma Chardon

During the operation of the system, caregivers are not permitted to change infor-mation collected on previous visits. This restriction is imposed for security and re-cordkeeping purposes. They may create new visit records or modify records created during the current session, but may not modify any records created during a previous session.

5 Knowledge Representation

The knowledge used by the inference engine is stored in a Palm database. To ease identification, each piece of knowledge is stored as a separate record. To distinguish the pieces of knowledge, each is prefixed with a unique character. Facts are with the letter “f,” rules are with the letter “r,” and questions are with the letter “q.” See Table 1. Facts are expressed as attribute-value pairs. In the table is a fact representing that maceration exists for this patient. Rules have a unique identification number (<id>), a priority (ranging from –127 to 128), a repeat/no repeat flag, a condition to be satisfied (which can be compound), and a set of actions to perform if executed. The actions can be to AddFact, RemoveFact, AskQuestion, or Recommend. In the table is Rule 4 whose priority is 0, is not repeated, checks for the condition of “Mobility = Bedridden AND Support Surface = None,” and makes a recommendation of “Consider a support surface” having a priority of 0. Questions have a unique identification number (<id>), a format for display (<type>), a required/not required flag, the question to be asked, and answers that could be given. Associated with each answer is a number identifying the answer’s group. At most one answer can exist for each answer group. For example, in the table is Question 3 which is displayed as a checkbox, is required, asks “Is there maceration at wound site?,” and has two possible answers in group 1 of “Yes” (deriving the fact Maceration=Yes) and “No” (deriving the fact “Macera-tion=No”).

A knowledge base is constructed by entering each piece of knowledge into a text file on a desktop computer. Each line of the text file contains a single piece of knowledge with the knowledge structure in the text file being very similar to the structure in the Palm database. These structures are not optimized for performance, instead they are organized to be understandable by the developer.

Table 1. Sample WoundCare Knowledge

Knowledge Format Example fact f|<fact> f|Maceration=Yes rule r|<id>|<priority>|[no ]re-

peat|<condition>(|<action>)+ r|4|0|no repeat|and = Mobility Bedridden = "Support Surface" None|Recommend(0, "Consider support surface")

question q|<id>|<type>|[not ]re-quired|<question>(|<answer>)+

q|3|checkbox|required|Is there maceration at wound site?|Yes Maceration=Yes 1|No Maceration=No 1

WoundCare: A Palm Pilot-Based Expert System for the Treatment of Pressure Ulcers 349

6 Conclusions

WoundCare is a stand-alone Palm OS expert system that collects and stores informa-tion about a patient and provides treatment recommendations. It is currently limited to treatment of stage one pressure ulcers, the simplest form of pressure ulcers. These wounds are defined by nonblanchable erythema of intact skin with treatment involv-ing determining the cause and preventing future deterioration of the skin [11].

Currently, the system does not support data transfers or sharing. All patient infor-mation is stored on the Palm with no integration with a centralized patient database. Integration with a central database would greatly enhance the application’s perform-ance and functionality.

Currently, WoundCare treats each wound independently, while human experts ad-just treatment strategies based on the presence and attributes of multiple wounds. To enhance the problem-solving capabilities of WoundCare, relationships should be developed between multiple wounds assessed on the same visit and between wounds occurring on multiple visits to track the treatment plan’s success or failure. Because of these limitations, WoundCare is currently a research tool and has not been de-ployed for general use.

References

1. Center for Advanced Medical Informatics at Stanford. Historical Projects, 1995. http://smi-web.stanford.edu/research/history.html

2. Gillingham, W., Holt, A. Gillies, J.: Hand-held Computers in Healthcare: What Software Programs are Available? N. Z. Med. J. (2002) Sep. 27 115 (1162) U185

3. Keplar, K. E., Urbanski, C. J.: Personal Digital Assistant Applications for the Healthcare Provider. Ann. Pharmacother. 37 (2003) 287-296

4. Fischer, S., Stewart, T. E., Mehta, S., Wax, R., Lapinsky, S. E.: Handheld Computing in Medicine. J. Am. Med. Inform. Assoc. 10 (2003) 139-149

5. Chen, E. S., Cimino, J. J.: Use of Wireless Technology for Reducing Medical Errors. J. Am. Med. Inform. Assoc. 9 (2002) S54-S55

6. Roth, A. C., Leon, M. A., Milner, S. M., Herting, R. L. Jr., Hahn A. W.: A Personal Digital Assistant for Determination of Fluid needs for Burn Patients. Biomed. Sci. Instrum. 34 (1997) 186-190

7. Bates, D. W., Cohen, M., Leape L. L., Overhage, J. M., Shabot, M. M., Sheridan, T.: Re-ducing the Frequency of errors in Medicine Using Information Technology. J. Am. Med. Inform. Assoc. 8 (2001) 299-308

8. Rothschild, J. M., Lee, T. H., Bae, T., Bates, D. W.: Clinican Use of a Palmtop Drug Refer-ence Guide. J. Am. Med. Inform. Assoc. 9 (2002) 223-229

9. Ganzalez, A. J., Dankel, D. D.: The Engineering of Knowledge-Based Systems: Theory and Practice. Prentice Hall, Englewood Cliffs (1993)

10. Ayers, D.: Kex: KVM Expert System. http://www.isacat.net/2001/code/kex.htm 11. Bergstrom, N., Allman, R., Alvarez, O., Bennett, M., Carlson, C., Frantz, R., Garber, S.,

Jackson, B., Kaminski, M. Jr., Kemp, M., Krouskop, T., Lewis, V. Jr., Maklebust, J., Mar-golis, D., Marvel, E., Reger, S., Rodeheaver, G., Salcido, R., Xakellis, G., Yarkony, G.: Treatment of Pressure Ulcers. Clinical Practice Guideline, No. 15. AHCPR Publication No. 95-0652, Rockville (1994)


VIE-DIAB: A Support Program for Telemedical Glycaemic Control

Christian Popow1, Werner Horn2,3, Birgit Rami1, and Edith Schober1

1 Dept. Pediatrics and Adolescent Medicine, University of Vienna A-1090 Wien, Währinger Gürtel 18-20, Austria

[email protected] 2 Dept. Medical Cybernetics and Artificial Intelligence, University of Vienna, Austria

3 Austrian Research Institute for Artificial Intelligence (ÖFAI), Vienna, Austria

Abstract. Ambulatory care supporting long-term treatment of type I diabetes mellitus (DM) is based on the analysis of daily notes of serum glucose meas-urements, carbohydrate intake, and insulin dosage. In order to improve glycae-mic control, telemedicine support aims at improving the communication be-tween patients and diabetologists. Patient data are collected using mobile phone services. Weekly responses from the diabetes care center aims at helping the pa-tient to optimize glycaemic control. The telemedical support system VIE-DIAB integrates data collection, visualization, and recommendation handling by using mobile phone and internet services. Its core is a module visualizing a summary of the patient’s diary. Data are displayed using 4x7 multiples that represent the serum glucose values of 28 days on one page.

1 Introduction

Ambulatory care of patients with type I diabetes mellitus (DM) is based on careful teaching about glycaemic control as related to serum glucose monitoring, diet and physical activity. In addition, monitoring of glycosylated haemoglobin levels (HbA1c, percent glycosylated over total haemoglobin) provides information about medium term (one month) glycaemic control. Ambulatory care is usually scheduled every 3-6 months. More frequent consultations could possibly improve diabetic control but this is usually not considered feasible from the point of view of both, patients and diabe-tologists. One possibility to improve the comunication between chronically ill patients and ambulatory care centers without too much increasing the time spent at the hospi-tal and the clinical workload has been brought about by telemedicine, the use of re-mote communication between patients and a health care center.

Telemedicine has been shown to improve glycemic control in adult patients with type I DM: Montori and Smith [1] found a significant decrease of glycosylated hemo-globine (HbA1c) in those patients who followed the recommendations given to the primary care provider by a diabetologist, Bellazzi et al. [2] in their type I DM telecare project found a similar HbA1c decrease of 1.23% and a significant reduction of insu-lin requirements.

Telemedicine, however, has not gained widespread use, possibly because the work-load associated with regularily typing in the data and supervising glycemic control is still too high. We therefore speculated that the workload could be reduced at the pa-

VIE-DIAB: A Support Program for Telemedical Glycaemic Control 351

tients side by using mobile phone or handheld computer technology for automated data transmission, and at the caretakers side by preprocessing the data applying auto-mated data visualization and artificial intelligence (AI) techniques. Telecommunica-tion support has been proven feasible in interested adults [3-5]. Temporal data ab-straction [6,7] provides good means to reduce the complexity of collected data.

Based on our experience with the graphical handling of complex data in intensive care [8] we designed a telemedical support system with the aim to develop and to test the nesessary technical prerequisites, to evaluate the feasibility of such a telemedicine program in adolescent patients, to minimize the additional workload for physicians by creating an intuitive data display and, if possible, to prove that glycemic control can be improved at a reasonable and tolerable workload. The present paper gives an over-view of the system architecture of our telemedical diabetes care support program, VIE-DIAB.

2 VIE-DIAB’s System Architecture

Following the intention of intensifying the communication between patients and phy-sician VIE-DIAB integrates three major functions:

• Continuously collecting the patients data that are submitted using mobile phone services (SMS or WAP). Each time a glucose measurement is taken, an insulin dosage is delivered, or carbohydrates are taken in, the patient submits one set of data including all this information. The submitted data are basically identical to those recorded in the commonly used diary;

• Visualising the patients data for quick analysis by the physician; • Feedback from the physician to the patient – usually once weekly.

To support these three functions VIE-DIAB is built as a server-client application. The server collects all the data and keeps a history of all recommendations given by the physician. The client supports patient data management, data visualization and recommendation processing. The client is a Java application (and applet) with a multi-lingual interface in German and English. It uses a multithreaded client-server com-munication for data exchange.

2.1 Data Collection

VIE-DIAB collects glycaemic control data sets received from the patient in a tele-medicine environment. Data are submitted using mobile phone services (WAP / GPRS). Given the fact that glucose measurements, insulin delivery, and carbohydrate intake usually appear at approximately the same time all this information is included within one data set. Each data set is structured like one line of a diabetes diary: date and time, serum glucose value, insulin dosage (basic, bolus and correction dosage), carbohydrate intake, and event notes (flags for physical activity, postprandial meas-urement, acute infectious disease or signs of hypoglycaemia).

2.2 Data Visualization

The main goal of VIE-DIAB’s data visualization is a means to provide the physician with an overview of the patients condition during the last 4 weeks. This overview

352 Christian Popow et al.

should be as simple and comprehensive as possible. Graphic visualization using glyphs is used to support this comprehensiveness. Twenty-eight glyphs are arranged in 4 rows (weeks) and 7 columns (days of the week). Each glyph is an abstract object representing the serum glucose measurements of one day grouped by time periods and by serum glucose concentrations thereby reducing data complexity. For grouping time periods we use six categories of unequal length (modal day): early morning (3-8 am), morning (8-11 am), noon (11 am – 14 pm), afternoon (14-17 pm), evening (17-20 pm), night (20 pm – 3 am).

For each of these periods only one value, usually the highest, is displayed. An ex-ception is the case when a “low” serum glucose value (hypoglycaemia) and a normal or high value is present within one time frame. In such cases both values are repre-sented. Serum glucose values are categorized within five groups. Each category is represented by a vertical block of different colour and height. Normal values (green) are located in the center of the time frame. Low values (blue) go down, high values (three reds of different intensity) go up. In addition, a coloured summary bar at the right margin displays the relative distribution of the displayed serum glucose values. Figure 1 gives an example of the glyph display.

Alternate views are also available for displaying the data more in detail and can be activated by mouse click. There are three alternative display modes:

• scatter plots displaying all available serum glucose values. The coloured back-ground represents the categories of the glyph display;

• histograms combining displays of serum glucose values, carbohydrate intake and insulin dosage. It is intended to use this view for a more detailed analysis of the re-lationship between serum glucose values, carbohydrate intake and insulin dosages;

• a spreadsheet containing the raw data in order to facilitate conventional data analy-sis.

2.3 Recommendations

Once weekly the physicians at the health care center review the patients data collected during the last week. VIE-DIAB visualizes the data and presents an initial (partial) recommendation. This recommendation is completed by the physician and forwarded to the server. The server stores the message and sends it to the patient via SMS.

A rule-based approach is used to compute the initial recommendation. Text blocks are composed by applying the rules to the data received in the last seven days. In the next future we are planning to apply more advanced artificial intelligence techniques for (semi-)automated therapy recommendations.

3 Discussion

Visualising data on glycaemic control in patients with DM possesses many advan-tages since it is most intuitive and concise. This is especially due to the reduction of the number of the displayed data and to omitting the overabundance of oscillating single values displayed in other visualization concepts. The reduction of the displayed data is justified by the fact that the main information sought by the diabetologists is,

VIE-DIAB: A Support Program for Telemedical Glycaemic Control 353

how many values are when and how much out of range. This question can easily be answered by the chosen view and is also supported by the summary bar on the right margin. If more detailed information is needed for analysing a specific situation, the detailed views can easily be called up. Data visualisation can also be used for teaching purposes because of its simplicity and the intuitive display.

Fig. 1. VIE-DIAB’s glyph data display. Serum glucose values are intuitively displayed as glyphs representing one day, 7 multiples in a row (days), 4 rows per sheet (weeks). The relative distribution of the serum glucose categories is displayed in the right margin.

Another argument in favour of our display is time saving: preliminary data on us-ers satisfaction suggest that a monthly overview about the quality of diabetes control can be obtained by a diabetologist for a specific patient in less than half a minute compared to at least 1-5 minutes examining the diary. This sums up if several patients or a longer time period have to be analysed. Moreover, VIE-DIABs data processing module is able to produce (simple) suggestions for advising patients. This is planned to be extended by a more complex knowledge-based system for automated analyses of the data in order to provide better clues for improving glycaemic control.

In summary we present VIE-DIAB, a novel telemedical support and visualization system for data on glycaemic control that should facilitate ambulatory care of patients with DM. The combination of mobile phone and internet services allows to intensify the communication between patients and diabetologists. The disadvantage is the in-creased workload of physicians due to the weekly responses which have to be pre-pared. VIE-DIAB’s intuitive visualization system is a way to keep the workload within tolerable limits because it supports quick overviews of the patients condition.

354 Christian Popow et al.

A thorough clinical cross-over study is currently in progress to investigate the poten-tial of VIE-DIAB to improve glycaemic control.

Acknowledgements

We are grateful to Jochen Schneider for his in-depth analysis of visualization options for the VIE-DIAB system, and to Harald Geritzer for programming the core of the Java client. We appreciate the support given to the Austrian Research Institute of Artificial Intelligence (ÖFAI) by the Austrian Federal Ministry of Education, Science, and Culture.

References

1. Montori VM, Smith SA. Information systems in diabetes: in search of the holy grail in the era of evidence-based diabetes care. Exp Clin Endocrinol Diabetes 109:S358-S372 (2001)

2. Bellazzi R, Montani S, Riva A, Stefanelli M. Web-based telemedicine systems for home care: technical issues and experiences. Comput Methods Prog Biomed 64:175-187 (2001)

3. Montani S, Bellazzi R, Quaglini S, d’Annunzio G. Meta-analysis of the effect of the use of computer-based systems on the metabolic control of patients with diabetes mellitus. Diabe-tes Technol Ther 3:347-56. (2001)

4. Hejlesen OK, Plougmann S, Cavan DA. DiasNet--an Internet tool for communication and education in diabetes. Stud Health Technol Inform 77:563-567 (2000)

5. Smith L, Weinert C. Telecommunication support for rural women with diabetes. Diabetes Educ 26:645-55 (2000)

6. Shahar Y. A framework for knowledge-based temporal abstraction. Arti.l Intell. 90:79-133 (1997)

7. Bellazzi R., Larizza C., Riva A.: Temporal Abstraction for Interpreting Diabetic Patients Monitoring Data, Intell. Data Anal., 2:97-122 (1998)

8. Horn W., Popow C., Unterasinger L.: Support for Fast Comprehension of ICU Data: Visu-alization Using Metaphor Graphics, Meth. Inform. Med. 40:421-424 (2001).

Drifting Concepts as Hidden Factors in Clinical Studies

Matjaz Kukar

University of Ljubljana, Faculty of Computer and Information ScienceTrzaska 25, SI-1001 Ljubljana, [email protected]

Abstract. Most statistical, Machine Learning and Data Mining algorithms as-sume that the data they use is a random sample drawn from a stationary distribu-tion. Unfortunately, many of the databases available for mining today violate thisassumption. They were gathered over months or years, and the underlying pro-cesses generating them may have changed during this time, sometimes radically(this is also known as a concept drift). In clinical institutions, where the patients’data are regularly stored in a central computer databases, similar situations mayoccur. Expert physicians may easily, even unconsciously, adapt to the changedenvironment, whereas Machine Learning and Data Mining tools may fail due totheir underlaying assumptions. It is therefore important to detect and adapt to thechanged situation. In the paper we review several techniques for dealing with con-cept drift in Machine Learning and Data Mining frameworks and evaluate theiruse in clinical studies with a case study of coronary artery disease diagnostics.

Keywords: concept drift, partial memory learning, windowing, gradual forgetting,clinical studies, Machine Learning, Data Mining.

1 Introduction

Clinical decision-making is a complicated process based on experience, judgement, andreasoning that should simultaneously integrate information from the medical literatureand a variety of other sources, including quantitative results of clinical trials and, mostimportantly, diagnostic test results. In most clinical institutions the patients’ data areregularly stored in a central computer database. With time, more and more recordsthat include confirmed diagnoses appear in the database. Such databases are frequentlysubject of retrospective studies. The patients in whom the outcome has already occurredare selected and analyzed, thus looking backward to assess potential risk factors anddiagnostic principles. Retrospective studies naturally fit in the Machine Learning andData Mining application framework, which is becoming increasingly popular as a supporttool in medical decision making.

Having collected a set of patient descriptions with confirmed diagnoses, the task ofa Machine Learning algorithm is to automatically generate a model (a description) ofthe given data with respect to the correct diagnosis. A set of possible diagnoses is usedas a target for classification process. The generated model can subsequently be used forrisk factors’ assessment and decision-making support.

Most statistical and Machine Lerning/Data Mining algorithms assume that the datais a random sample drawn from a stationary distribution [7]. Unfortunately, many of


356 Matjaz Kukar

the databases available for mining today violate this assumption. They were gatheredover months or years, and the underlying processes generating them may have changedduring this time, sometimes radically.

In clinical trials, the experimental setup is supposed to be fixed and strictly controlled.It is therefore unsuitable to talk about drifting concepts. However, one must be awarethat even in most strictly controlled environments, unexpected changes may happen. Forinstance, a crucial piece of equipment may start to fail and later gets replaced, personnelchanges may happen, new scientific discoveries may be absorbed in practice. Whilechanges in the process may not be visible immediately, it is necessary to act as soon as theyare discovered.While humans can with relative ease gradually adapt to changed situation,it is not the same with machines, not even with learning ones. Most Machine Learningalgorithms already provide techniques for handling noise in training data (such as pruningof decision trees [2] and rules [1]). However, new examples, generated under changedconditions, may be considered as noisy and therefore excluded from training. Therefore,generated models do not reflect changed conditions until enough new examples arecollected. During this transition the model performance on new examples would bepoor.

1.1 MotivationThe main motivation for this paper are quite unexpected experimental results on the –by now for us very familiar – Nuclear dataset (nuclear diagnostics of coronary arterydisease) [4, 12]. This is a two-class problem, diagnosing whether the patients suffer fromcoronary artery disease (CAD), or not. The data were collected in years between 1991and 1994. After performing a leave-one-out testing on the whole dataset and orderingexamples by the date of final examination, we obtained classification accuracy graphsas depicted in Fig. 1. Decreasing classification accuracy for both physicians and naiveBayesian classifier in the last observed year (1994) could be either a result of significantlychanged class distribution or of a concept change. The former seems not to be the casein our problem, since the class distribution (see class prevalence in Fig. 1) does notchange significantly over the observed time interval. This leads us to a question whathas happened and how can we deal with it.

It is important that retrospective studies as well as ongoing (online) studies whereMachine Learning tools are being used employ techniques for detecting an dealing withtime-changing concepts. While this may not happen very often, it may seriously skewresults of otherwise perfectly valid studies. In order to compensate for changed conditionsit is also important for a Machine Learning system to decide on when to rebuild a modelto account for newly arrived training examples and what extent of historical training datato use for learning.

The paper is organized as follows. In Sec. 2 we review some related work andproposed solutions for dealing with drifting concepts. In Sec. 3 we describe the datasetwe are using for our case study. In Sec. 4 we present some experimental results on ourcase study. Finally, in Sec. 5 we present some conclusions and directions for future work.

2 Methods

Most statistical and Machine Learning algorithms assume that all data was generated bya single concept and is basically a random sample drawn from a stationary distribution.

Drifting Concepts as Hidden Factors in Clinical Studies 357

Fig. 1. Time-based variation of classification accuracy in the Nuclear dataset.

[7]. In many cases, however, it is more accurate to assume that data was generated by aseries of concepts, or by a concept function with time-varying parameters. TraditionalMachine Learning systems learn incorrect models when they erroneously assume that theunderlying concept is stationary if in fact it is changing or drifting [7]. For classificationsystems, which attempt to learn a discrete function given examples of its inputs andoutputs, this problem takes the form of changes in the target function over time, and isknown as concept drift [6, 8, 14, 15]. In this section we review some methods for dealingwith concept drift.

2.1 Drifting Concepts

Recently, several systems have been developed that employ Machine Learning methodsin real life applications. They learn real-life concepts that tend to change over time [8,14, 15]. An illustrative example comes from Text Mining when learning shifting humaninterests [3].

The concept drift, whether abrupt or gradual [5, 6], occurs over time. The evidencesfor changes in a concept are represented by the training examples, which are distributedover time. Hence the old observation can become irrelevant to the current time periodand thus the learned knowledge can be outdated. Several methods have been suggestedto cope with this problem, either to forget outdated induced knowledge, or to forgetoutdated training examples [3, 5, 9, 15].

Special techniques are applied when concepts can be expected to recur [5]. Recurring(oscilating) concepts may be due to cyclic phenomena or may be associated with irregularphenomena. In both cases the approach is to to identify stable concepts and the associatedcontext specific, locally stable concepts, and store them to be reused when appropriate.

358 Matjaz Kukar

In the following sections we will focus on forgetting training examples (learning withpartial memory), since this approach is more general and does not require significantchanges in training algorithms.

2.2 Partial Memory Learning

Partial memory learners are systems that select and maintain a portion of the past trainingexamples, which they use together with new examples in subsequent training episodes.Such systems can learn by memorizing selected new facts, or by using selected factsto improve the current concept descriptions or to derive new concept descriptions. Re-searchers have developed partial memory systems because they can be less susceptibleto overtraining when learning concepts that change or drift, as compared to learners thatuse other memory models [13, 15]. The key issues for partial memory learning systemsare how they select the most relevant examples from the input stream, maintain them,and use them in future learning episodes. These decisions affect the system’s classifi-cation accuracy, memory requirements, and ability to cope with changing concepts. Aselection policy might keep each training example that arrives, while the maintenancepolicy forgets examples after a fixed period of time.

These policies more or less bias the learner toward recent events, and, as a conse-quence, the system may forget about important but rarely occurring events. On the otherhand, the learner that is strongly anchored to the past may perform poorly if conceptschange or drift.

2.3 Learning to Forget

Most frequently, forgetting is implemented in an abrupt manner. That means the examplesthat are irrelevant according to some time criteria (e.g. examples that are outdated)are deleted from the partial memory [13]. Hence, these instances are totally forgotten.The examples that remain in the partial memory are equally important for the learningalgorithm.Another possibility is to use gradual forgetting [9]. It can be implemented witha time based forgetting function, which provides each example with a weight accordingto its occurring time. The importance of an example diminishes with time. The drawbackof this approach is that Machine Learning algorithms need to implement techniques fordealing with unequally important examples.

Abrupt Forgetting (Windowing). A common approach to learning from time-changingdata is to repeatedly apply a traditional learner to a sliding window of w examples; asnew examples arrive they are inserted into the beginning of the window, a correspondingnumber of examples is removed from the end of the window, and the learner is reapplied[15]. As long as w is small relative to the rate of concept drift, this procedure assuresavailability of a model for the current concept generating the data. If the window istoo small, however, this may result in insufficient examples to satisfactorily learn theconcept. Further, the computational cost of reapplying a learner may be prohibitivelyhigh, especially if examples arrive at a rapid rate and the concept changes quickly.


Gradual Forgetting. The principal idea behind gradual forgetting is that natural for-getting is a gradual process. This means that newer training examples should be moreimportant than older ones and their importance should decrease with time. The impor-tance of example is given with its weight w = f(t). The calculated weights must be inan interval that is suitable for the applied learning algorithms.

Assuming that training examples arrive on equal time steps, Koychev [9] suggestsusing a linear gradual forgetting function, defined as follows:

wi = − 2k

n − 1i + k + 1 (1)

where i is a counter of observations starting from the most recent one and it goes backover time i = 0 . . . n−1 where n is the length of the observed training sequence, and k isa parameter that represents the percent of decreasing the weight of the first observationand consequently the percent of increasing the weight of the last one in comparisonto the average. By varying the parameter k, the slope of the forgetting function can beadjusted.

Within the same framework, a kernel function for example weighing can also beused (Eq. 2).

wi =1√2π k

e− d2

2k2 (2)

Here d = i/n is a relative time distance to the training example from the past, and k is areal-valued kernel parameter. Both forgetting functions (Eq. 1 and Eq. 2) were utilizedin experiments described in Sec. 4.

Parameter Setting. While we have quite a few options for dealing with drifting con-cepts, they all require parameter adjustment (window size, slope of linear function, kernelparameter). Because we cannot detect drift until it has happened, these parameters can-not be optimally set in advance, unless we know the exact extent of the drift. Thereforewe always start with certain amount of “drifted” data, that can be used for parameteroptimization [8].

3 Materials

In our experiments we focused on the Nuclear dataset [4, 12] for three reasons:

– we have been working on this dataset for quite some time and therefore know itpretty well,

– we have close relations with physician who has collected the data, and who hasprovided also the original diagnoses,

– it was possible to order the patients by the date of their examination, which is rarein publically available datasets (such as the UCI repository). This is because mostlyexisting temporal information is not compiled by experts preparing the data foranalysis.

360 Matjaz Kukar

3.1 Clinical Diagnostics of Coronary Artery Disease (Nuclear Dataset)

Coronary artery disease (CAD) is the most important cause of mortality in all developedcountries. It is caused by diminished blood flow through coronary arteries due to stenosisor occlusion. CAD produces impaired function of the heart and finally the necrosis ofthe myocardium – myocardial infarction.

In our study we used a dataset of 327 patients (250 males, 77 females) with per-formed clinical and laboratory examinations, exercise ECG, myocardial scintigraphyand coronary angiography because of suspected CAD. The features from the ECG anscintigraphy data were extracted manually by the clinicians.

Table 1. CAD data for different diagnostic levels.

Diagnostic level Diagnostic attributesNominal Numeric Total

Signs, symptoms and history 23 7 30Exercise ECG 7 9 16Myocardial scintigraphy 22 9 31Coronary angiography 1 1

Total attributes 53 25 78Disease prevalence 70% positive 30% negativeEntropy of classes 0.89 bit

In 228 cases the disease was angiographically confirmed and in 99 cases it wasexcluded. 162 patients had suffered from recent myocardial infarction. The patientswere selected from a population of approximately 4000 patients who were examined atthe Nuclear Medicine Department between 1991 and 1994. We selected only the patientswith complete diagnostic procedures (all four levels) [12]. Results of the fourth level(coronary angiography) were taken as a “gold standard”.

4 Results

For testing different methods for dealing with concept drift we applied the followingmethodology. All examples were ordered by the time of patient’s examination. Whenwe needed to start with some initial training set, we fixed for this purpose the first 100out of 327 examples. Performance on this set was evaluated with a leave-one-out testingprocess. On the remaining examples we applied different techniques, described in Sec.2. For windowing as well as for gradual forgetting, testing was done in single steps,where the potential training set consisted from the first n examples, 100 ≤ n ≤ 326,and the testing set consisted of the (n + 1) -st example. From training examples eitherlast w (window size) was used for training, or they were all assigned different weights(gradual forgetting).

Our experimental Machine Learning tool of choice was naive Bayesian classifier.It was suitable for our purpose because of its very fast, incremental learning, becauseit usually performs well in medical diagnostic problems and because it can easily bemodified for dealing with unequally important (weighted) training examples.


4.1 Detecting the Concept Drift

The first thing was to find out how can concept drift be detected. If we want to detect it, itmust have already happened or started to happen. Thus we actually have certain amountof data available for experimenting. In diagnostic problems, where sooner or later thediagnoses are confirmed, our task is easy. When we collect enough drifted examples,the drift reflects in significantly decreased average classification accuracy achieved inrecent past in comparison with average classification accuracy achieved in distant past(see average classification accuracy in Fig. 2, recent past is last 50 examples).

However in prognostic problems the situation is more difficult, since actual prognosesmay not be known for a long time, if ever. In such cases it may be more useful to usea measure of reliability estimation [10, 11] that assigns a kind of confidence value toevery prediction (see reliability estimation in Fig. 2). Although actual predictions maynot be known, one could detect the drift by significantly decreased average reliabilityestimations in recent past in comparison with average reliability estimations calculated indistant past (see average reliability estimation in Fig. 2, recent past is last 50 examples).

0%

20%

40%

60%

80%

100%

1991 1992 1993 1994 1995

Reliability estimation

Averaged reliabilityestimationAveraged classificationaccuracy

Fig. 2. Detecting concept drift with classification accuracy and reliability in the Nuclear dataset.

In Fig. 2 it can clearly be observed, that the drift has been happening since thebeginning of the last observed year (1994), that is, from the example 261 on. So weselected the last 66 examples as a “drifted” set of our interest.

4.2 Dealing with Concept Drift

For dealing with concept drift we applied windowing as well as linear and kernel-basedgradual forgetting (Sec. 2.3). We used the first 261 “non-drifted” examples for parameteroptimization (window size, slope and kernel parameter). In order to evaluate the qualityof obtained parameters, we also optimized the parameters on the “drifted” set. Forcomparison we selected the best achieved results. The obtained parameter values weretested on the last 66 “drifted” examples, and the results are compared in Tab. 2.

362 Matjaz Kukar

Table 2. Experimental results on the drifted data.

Naive Parameter Optimized Best achieved OverallBayes Value Accuracy Value Accuracy AccuracyWindowed Window size 100 85% 70 88% 95%Linear Slope (k) 0.90 83% 0.9 83% 94%Kernel Kernel size (k) 0.25 83% 0.17 86% 95%Physicians 70% 70% 85%Ordinary Naive Bayes 64% 64% 91%

(a) Windowing. Notice the negative effectof too small window size (w=50).

(b) Gradual forgetting. Differences be-tween linear and kernel-based forgettingare almost negligible.

Fig. 3. Catching up with the drift with gradual forgetting in the Nuclear dataset.

As we can see, differences in accuracy between optimized and actual best parametervalues exist, but they are small. By using the optimized parameter values the averageperformance on the whole dataset was 94-95% for all three methods. We cannot say thatany of them (windowing, linear or kernel-based forgetting) is significantly better than theother ones. However, we can see an improvement in overall accuracy for naive Bayesianclassifier by 4%. This is no small achievement, since it actually reduces the error rateby 44% (from 9% to 5%). But more than this it is important that the performance onthe “drifted” examples (last 66) is much higher (by 20%) for naive Bayesian classifier(from 64% to 83-85%). This means that we can almost level the performance on thisproblematic subset with overall performance and should equal to it when a few moretraining examples arrive.

In Figs. 3(a) and 3(b) we compare classification accuracy of the ordinary naiveBayesian classifier with windowing and gradual forgetting methods for different pa-rameter values. For training ordinary naive Bayesian classifier, non-drifted examples(first 261) were used, and leave-one-out testing was performed, the others were addedin training set and tested incrementally.


5 Discussion

We reviewed three different methods for dealing with changing (drifting) concepts:windowing and gradual (linear or kernel-based) forgetting. We found out that in our casestudy of coronary artery disease diagnostics all perform reasonably well. While they allrequire setting certain parameters, they can be automatically tuned on the training set[8], and nearly optimal results can be expected. In windowing, at most n (size of thetraining set) re-runs of the training algorithm are required for window size selection,whereas for gradual forgetting linear optimization with desired precision for respectiveparameter is sufficient.

We can also detect concept drift when it happens, whether by using classificationaccuracy (in diagnostic problems) or reliability estimation (in prognostic problems).More formally, one could use a statistical test with a given confidence level to detectconcept drift, and rebuild (re-learn) the model only when necessary and with suitableparameters (e.q. window size, slope or kernel parameter) to compensate for the drift. Thisis especially useful for practical applications, where rebuilding a model is not performedevery time when a new training example arrives. Namely, model rebuilding may requirea presence of a Machine Learning expert, especially if learning parameters need to bechanged. Often a generated model is stored and used independently of the learner (e.g.in a handheld device, or even printed on a paper). In such cases a model should be rebuiltand deployed only when really necessary.

In our case study of coronary artery disease diagnostics we managed to achieveoverall improvement of 4% compared to ordinary naive Bayesian classifier’s result.However, since it actually reduces the error rate by 44% (from 9% to 5%), it is no smallachievement.

The improvement was most notable in last 66 “drifted” examples, where it wasabout 20% (from 64% to 83-85%). This means that the performance on this problematicsubset is almost levelled with overall performance and should equal to it with a few moretraining examples. We argue that any (online) learning system that is used in practiceshould use similar techniques to detect and deal with drifting concepts.

There are several things that can be done to further develop the described methods.Most notably, a weighting scheme for gradual forgetting should be devised, that does notneed to be recalculated every time a new training example arrives. This would enable trueincremental learning, however it would require from the learner to cope with increasinglylarge, theoretically unlimited example weights.

There is one question left unanswered, and that is how and why in our case study aconcept drift has occurred. So far, we have no definitive answers (yet), however it seemsthat the human factor was seriously involved.

Acknowledgements

I thank dr. Ciril Groselj, from the Nuclear Medicine Department, University MedicalCentre Ljubljana, for collecting the data. This work was supported by the SlovenianMinistry of Education and Science.

364 Matjaz Kukar

References

1. W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russel, editors, Proc. 12th

Intl. Conf. on Machine Learning ICML’95, pages 115–123, San Francisco, California, USA,1995. Morgan Kaufmann.

2. F. Esposito, D. Malerba, and G. Semeraro. Simplifying decision trees by pruning and grafting:new results. In N. Lavrac and S. Wrobel, editors, Proc. Europ. Conf. on Machine LearningECML’95, pages 287–290. Springer Verlag, 1995.

3. I. Grabtree and S. Soltysiak. Identifying and tracking changing interests. InternationalJournal of Digital Libraries, 2:38–53, 1998.

4. C. Groselj, M. Kukar, J. Fettich, and I. Kononenko. Machine learning improves the accuracyof coronary artery disease diagnostic methods. In Proc. Computers in Cardiology, volume 24,pages 57–60, Lund, Sweden, 1997.

5. M. B. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning,32:101–126, 1998.

6. D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements.Machine Learning, 14:27–45, 1994.

7. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedingsof the 17th ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, pages97–106, San Francisco, CA, 2001. ACM Press.

8. R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines.In P. Langley, editor, Proceedings of ICML-00, 17th International Conference on MachineLearning, pages 487–494, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco,US.

9. I. Koychev. Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI 2000Workshop Current Issues in Spatio-Temporal Reasoning, pages 101–106, Berlin, Germany,2000.

10. M. Kukar. Making reliable diagnoses with machine learning: A case study. In S. Quaglini,P. Barahona, and S. Andreassen, editors, Proceedings of Artificial Intelligence in MedicineEurope, AIME 2001, pages 88–96, Cascais, Portugal, 2001. Springer.

11. M. Kukar and I. Kononenko. Reliable classifications with Machine Learning. In Proceedingsof 13th European Conference on Machine Learning, ECML 2002, pages 219–231, 2002.

12. M. Kukar, I. Kononenko, C. Groselj, K. Kralj, and J. Fettich. Analysing and improvingthe diagnosis of ischaemic heart disease with machine learning. Artificial Intelligence inMedicine, 16 (1):25–50, 1999.

13. M. A. Maloof and R. S. Michalski. Selecting examples for partial memory learning. MachineLearning, 41(1):27–52, 2000.

14. N. A. Syed, H. Liu, and K. K. Sung. Handling concept drifts in incremental learning withsupport vector machines. In Knowledge Discovery and Data Mining, pages 317–321, 1999.

15. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts.Machine Learning, 23(1):69–101, 1996.

Multi-relational Data Miningin Medical Databases

Amaury Habrard, Marc Bernard, and Francois Jacquenet

EURISE – Universite de Saint-Etienne – 23, rue du Dr Paul Michelon42023 Saint-Etienne cedex 2 – France

{Amaury.Habrard,Marc.Bernard,Francois.Jacquenet}@univ-st-etienne.fr

Abstract. This paper presents the application of a method for min-ing data in a multi-relational database that contains some informationabout patients strucked down by chronic hepatitis. Our approach maybe used on any kind of multirelational database and aims at extractingprobabilistic tree patterns from a database using Grammatical Inferencetechniques. We propose to use a representation of the database by trees inorder to extract these patterns. Trees provide a natural way to representstructured information taking into account the statistical distributionof the data. In this work we try to show how they can be useful forinterpreting knowledge in the medical domain.

1 Introduction

The main objective of Data Mining techniques is to extract regularities froma large amount of data. For this purpose some efficient techniques have beenproposed like the apriori algorithm [2]. However these techniques are closelyrelated to data stored in a flat representation even though more and more struc-tured data (like relational databases) are used in all domains of activity. Thusto deal with algorithms working on flatten data some pre-processing steps arerequired that unfortunately lead to lose some valuable information. Cios andMoore pointed out some features that make medical data mining unique [6]. Wethink that beeing able to deal with structured data is also especially importantin medicine where databases may contains a large number of tables.

Over the last years, several attempts have been made to extract regular-ities directly from databases without having to flatten data. That has led tothe emergence of an active field of research, called multi-relational Data Mining[8]. For example, the ILP system WARMR [9] defines a generic framework tolearn from a datalog representation of a database, keeping the structuring ofdata. Crestana-Jensen and Soparkar [7] propose an algorithm for mining decen-tralized data exploiting the inter-tables relationships to extract frequent item-sets on separate tables. In this paper we present a method allowing to extractsome knowledge from data structured as trees. Trees are natural candidates forrepresenting structured information of multi-relational databases. Mining someprobabilistic knowledge in this context leads to mining tree patterns that re-spect the statistical distribution observed over the data. A tree pattern can be


366 Amaury Habrard, Marc Bernard, and Francois Jacquenet

viewed as an abstraction from trees and thus provides an interesting way torepresent regularities observed in large amount of data. The concept of tree pat-terns has received a lot of interest during the past two years. In the machinelearning field, learning of tree patterns has been studied for example by Amoth,Cull and Tadepalli in [3] or by Goldman and Kwek in [13]. More recently, datamining approaches have been proposed to extract these patterns [4, 10, 17, 16,18]. In this paper we are interested in statistical learning methods which mayimprove the Data Mining task. Indeed these methods are interesting for DataMining because they provide a quantitative approach to weighting the evidencesupporting alternative hypotheses. They are known to perform well with noisydata and may discover some knowledge from positive examples only.

The method we present follows several steps. First we have to define the tablecontaining data we want to focus on. From the rows of this table we generatea set of trees. For each row, a tree is generated recursively using the relation-ships defined between various tables by foreign keys. Using some probabilisticgrammatical inference techniques we learn a probabilistic tree grammar on thedatabase. The tree grammar is represented with a stochastic tree automaton.Then we use a level wise search approach to generalize transition rules of the au-tomaton. These generalized rules are finally used to produce a set of probabilistictree patterns.

Each step of our method will be described in the next two sections. Then, insection 4 we show how our system may be applied to the discovery of knowledgeabout chronic hepatitis by extracting probabilistic tree patterns from a relationaldatabase. We have focused our work on the relations between the level of liverfibrosis and laboratory examinations made on patients. The level of liver fibrosisis often closely related to the stage of hepatitis C, a disease which may lead todevelop liver cirrhosis or hepatocarnocima. The final objective is to point outlaboratory examinations that can predict the level of liver fibrosis.

The work presented in this paper can be view as a preliminary work on theextraction of probabilistic tree patterns in relational database on medical data.The objective is to see how these patterns can be useful in multi-relational datamining tasks and how they can model structured data in relational databases.

2 Stochastic Tree Automata

A tree automaton [12] defines a regular tree language as a finite automatondefines a regular language on strings. Stochastic tree automata are an exten-sion of tree automata and define a statistical distribution on the tree languagerecognized by the automata. Learning tree automata has received attention forsome years. For example Garcia and Oncina [11] or Knuutila and Steinby [15]have proposed some algorithms for learning tree automata. In the probabilis-tic framework, Carrasco, Oncina and Calera [5] proposed an efficient algorithmto learn stochastic tree automata ; Abe and Mamitsuka [1] dealt with learningstochastic tree grammars to predict protein secondary structure. In this sectionwe mainly describe stochastic tree automata, which are the core of the system.

Multi-relational Data Mining in Medical Databases 367

In fact we consider an extension of stochastic tree automata which takes sortsinto account. Thus we consider many-sorted stochastic tree automata. We firstdefine the concept of signature, which represents the set of constructible trees.

Definition 1 A signature Σ is a 4-tuple (S, X, α, σ). S is a finite set whoseelements are called sorts. X is a finite set whose elements are called functionsymbols. α is a mapping from X into IN . α(f) will be called the arity of f . σ isa mapping from X into S. σ(s) will be called the sort of s.

Definition 2 A stochastic many-sorted tree automaton (SMTA) is a 5-tuple(Σ, Q, r, δ, p). Σ is a signature (S, X, α, σ). Q = ∪s∈SQs is a finite set of states,each state having a sort in S. r : Q −→ [0, 1] is the probability for the state to bean accepting state. δ : X × Q∗ −→ Q is the transition function. p : X × Q∗ −→[0, 1] is the probability of a transition.

We denote RA,g,q the set of all rules of the form g(q1, . . . , qn) → q of a treeautomaton A. Note that we assume we do not allow overloading of functionsymbols.

A SMTA parses a tree using a bottom-up strategy. A state and a probabilityare associated with each node of the tree. The labelling of each node is definedby the transition function. The tree is accepted if the probability of its root nodeis strictly positive. Given a SMTA A, the probability of a tree t is computed asfollows:

p(t | A) = r(δ(t)) × π(t)

where π(f(t1, . . . , tn)) is recursively computed by:

π(t) = p(f, δ(t1), . . . , δ(tn)) × π(t1) × · · · × π(tn)

Example The automaton A defined by the following transition rules is able torecognize, for instance, the tree g(f(c, b, a)) with an associated probability of0.15.

1.0 : a → q1 1.0 : b → q2 1.0 c → q50.6 : f(q1,q2,q4) → q3 0.15 : f(q5,q2,q1) → q3 0.15 : f(q1,q4,q3) → q30.1 : f(q1,q2,q3) → q3 1.0 : g(q3) → q4 Final state : r(q4) = 1.0

The inference of a tree automata is made from a sample of trees defining adataset. We do not detail the procedure here, the interested reader may con-sult [5, 14]. The structure of the automaton is iteratively constructed using astate merging procedure. The probabilities are then computed from the trainsample, taking into account the distribution of the data. We generalize rules ofthe automaton, looking for frequent regularities in the database, relatively tothe distribution of the dataset. This step of generalization allows to generateprobabilistic tree patterns, modeling concepts stored in the database.


3 Generalization of a SMTA

We have proposed in [14] a technique to generalize SMTA relatively to a thresh-old γ. The idea is to generalize the transition rules of the automaton using alevel wise algorithm. The generalization algorithm is local, that is it considersgeneralization of a SMTA locally to a given arrival state and a given symbol.

The generalization process considers rules containing variables, these rules arecalled generalized rules. The score of a generalized rule is computed by addingprobabilities of the transition rules of the SMTA subsumed by the generalizedrule. The algorithm looks for the most specific generalized rules having a scoregreater than γ. Algorithm presented in [14] extracts generalized rules from allthe sets RA,f,q definable in the SMTA.

Definition 3 Let V a set of variables and Q a set of states. Let r1 and r2 be tworules: r1 = f(x1, . . . , xn) → q and r2 = f(x′

1, . . . , x′n) → q′ such that xi ∈ Q ∪ V

and x′i ∈ Q∪V . r1 is more general than r2 (we note r1 > r2) if and only if there

exists a substitution θ such that f(x1, . . . , xn) θ = f(x′1, . . . , x

′n). We say that r1

subsumes r2.

Definition 4 A rule r is called a generalized rule if there exists a rule r′ inRA,f,q such that r > r′.

For example, consider the following set of rules RA,f,q3 from a SMTA A:

0.10 : f(q1, q2, q3) → q3 0.15 : f(q1, q4, q3) → q30.60 : f(q1, q2, q4) → q3 0.15 : f(q5, q2, q3) → q3

If we fix the γ parameter to 0.6, the generalization algorithm will extract thefollowing generalized rules:

0.6 : f(q1, , q4), 0.7 : f(q1, q2, ) and 0.6 : f( , q2, q4)

In this work we are interested in extracting probabilistic tree patterns. Thesetree patterns give a probabilistic representation of the database. Informally atree pattern is a tree whose leaves can be either variables or symbols of arityzero. Formally we define tree patterns as terms in first order logic.

Definition 5 Let Σ = (S, X, α, σ) a signature and V a set of variables. A treepattern, on Σ ∪ V , is defined recursively as follows:

– a symbol s ∈ X, such that α(s) = 0, is a tree pattern– a variable v ∈ V is a tree pattern– if t1, . . . , tn are tree patterns, then f(t1, . . . , tn) is a tree pattern for any

symbol f ∈ X such that α(f) = n

Definition 6 A probabilistic tree pattern is a couple (t, p) where t is a treepattern and p a probability (0 ≤ p ≤ 1).


To extract these patterns we use a depth first search on the generalizedrules of the SMTA. The idea is to start from the final states of the generalizedSMTA (that is states with a probability greater than zero to be a final state)and to construct the tree patterns recognized by the automaton. This process isrecursive using the generalized rules of the automaton. The rules are chosen infunction of their arrival state, when many rules can be used, we generated treepatterns for each possibility. The probability of a tree pattern is the product ofthe probabilities of the rules used in the depth first search process.

4 Mining the Chronic Hepatitis Data

In this section we present the work we have done with our system in order todiscover some knowledge about chronic hepatitis. The data we used were pre-pared in the context of a collaboration between the Shimane Medical University,School of Medicine and Chiba University Hospital. The database stores 771 pa-tients with hepatitis B and C who took examinations in the period 1982-2001.Hepatitis A, B and C are virus infections that affect the liver of the patient.Hepatitis B and C are especially important because they have a potential riskof developing liver cirrhosis or hepatocarcinoma. An indicator that can be usedto know the risk of cirrhosis or hepatocarcinoma is fibrosis of hepatocyte. Forinstance liver cirrhosis is characterized as the terminal stage of liver fibrosis.One way to evaluate the stage of liver fibrosis is to make a biopsy, but this ex-amination is invasive to patients, thus it could be interesting to use laboratoryexaminations as substitutes for biopsy.

In this work we propose to extract probabilistic tree patterns trying to linkthe stage of liver fibrosis and in-hospital and out-hospital laboratory examina-tions. For this purpose we focus on four levels on the five described in biopsyexaminations, that are levels F1, F2, F3 and F4 which is the most severe stage.Then we construct a sample for each level, each sample is made up of treestaking into account laboratory examinations. The following section presents thepreparation of the data.

4.1 Data Preparation

Data are organized in a relational database made up of six tables. One table givesinformation about patients (PT E table), another one gives results of biopsy onpatients (BIO E table) while information on interferon therapy are stored in theIFN E table. Two tables give results of in-hospital and out-hospital examinations(ILAB E and OLAB E tables), the last table (LABN E table) gives informationabout measurements in in-hospital examinations. We stored the data in a rela-tional database, using the relational database management system PostgreSQL.

Let us recall that our system extracts knowledge from data in the form oftrees. Thus, using the database, we have to build those trees. For a comprehen-sive general presentation of the transformation process, the reader may refer to[14]. In this particular case, for each level of liver fibrosis, we choose the table


PT E to build the root of each tree and we select tuples corresponding to pa-tients having a biopsy with the considered level of liver fibrosis. In this tablewe only consider the MID attribute which is the identifier of each patient. Thisattribute is duplicated to define two foreign keys: one on table OLAB E, the sec-ond one on table ILAB E. Using table OLAB E we continue building the treesby considering the attributes OLAB Exam Name (name of the out-hospitalexamination), OLAB Exam Result (result of the out-hospital examination),OLAB Evaluation (evaluation of the result) and OLAB Eval SubCode (inter-nal subcode of the evaluation items). Finally using table ILAB E we build thetrees by considering the attributes ILAB Exam Name (name of the in-hospitalexamination) and ILAB Exam Result (result of the in-hospital examination).To construct the subtrees corresponding to the foreign keys on tables ILAB Eand OLAB E, we consider only records so that the time between the date ofexamination and the date of the biopsy of the patient doesn’t exceed two days.

Values of attributes ILAB Exam Result and OLAB Exam Result are dis-cretized taking into account information of table LABN E. If an examinationhas an entry in this table, we consider the value as normal if it is between the up-per and lower bound, otherwise we measure its level in function of a discretizedstep. For other examinations, we compute a mean to define lower and upperbounds, then we apply the same policy.

The tree of figure 1 is an example of tree produced for a patient at level F2.This patient has no out-hospital examinations, and has two in-hospital examina-tions: one on activated partial thromboplastin time (APTT) which has a normalvalue and one on prothrombin time (PT) which has also a normal value.

APTT

cons_ILAB_E

cons_ILAB_E

ILAB_E end_cons_ILAB_E

PT

APTT_|0|

UNKNOWN_OLAB_E

PT_|0|

PT_E

ILAB_E

Fig. 1. Tree associated with patient with MID=414

4.2 Experimentation on Level of Liver Fibrosis

We generate a sample of trees for each level of liver fibrosis (F1 to F4). Then, foreach sample we learn a generalized SMTA with different levels of generalization.Let us recall that the level of generalization depends on the γ parameter (0 ≤γ ≤ 1) and the higher this parameter is, the more we generalize. Thus, if thisparameter is high we extract very few patterns which are often too general. Onthe other hand, if it is low, a lot of patterns are extracted which are very specific.For example, we extract 768320 patterns with γ = 0.15 and only 4 with γ = 0.85


on patients at level F3. So this parameter may be used to define the specializationof the extracted knowledge. For this purpose we learned generalized SMTA usingthe γ parameter from 0.1 to 0.85. Then for each generalized model, we extractprobabilistic tree patterns.

To illustrate the extracted knowledge, we give some examples of tree patternsand their probabilities. As discussed in a previous section, some leaves of treescorrespond to discretized attributes of the database, and thus the number ofthe interval of values for such attributes is denoted between two vertical lines.Understanding tree patterns is not always easy and it may requires some jointwork between data miners and medical experts. It could be useful to design atool that would convert tree patterns into some natural language sentences butthis is not the aim of this paper to discuss such a tool.

Figure 2 shows a general pattern, of probability 0.66, extracted with γ = 0.7from patients at level F2. Informally this pattern says that patients, at level F2,may have at least one in-hospital examination with probability 0.66. Figure 3shows a less general pattern saying that 20% of patients at level F1 generallyhave at least two in-hospital examinations with one albumin (ALB).

cons_ILAB_E

ILAB_E

_

_

_ _

PT_E0.66:

Fig. 2. Tree pattern of level F2 extracted with γ = 0.7

cons_ILAB_E

ILAB_E

ALB _

_

_

PT_E0.20:

Fig. 3. Tree pattern of level F1 with extracted with γ = 0.55

Figure 4 shows a very specific pattern of probability 0.0004. This pattern saysthat 0.04% patients at level F3 often have at least one out-hospital examinationon “D inshi” with no evaluation and a subcode equals to F504 and at least onein-hospital examination on albumin (ALB).

Figure 5 shows a less specific pattern. It says that 6.9% of patients at levelF4 may have one in-hospital examination on albumin (ALB) with value between0 and 3.9 g/dl.


OLAB_E

D inshi _ F504

_

cons_OLAB_E cons_ILAB_E

ILAB_E

ALB _

_

PT_E0.0004:


PT_E

cons_ILAB_E _

ILAB_E

_ ALB_|−1|

_

0.069:


We now sum up on figure 6 of the influence of the γ parameter on the numberof extracted tree patterns. For small values of γ, we expect a weak generalizationand thus a high number of patterns with a priori small probabilities. When γtends to 1, we expect overfitting and so a small number of patterns. Note thatif too general patterns are not very interesting, patterns with low probabilitiescan be useful because they can point out rare events.

We now give a summary of the probabilistic tree patterns extracted by oursystem. Note that in our work we only consider examinations made in the sameperiod of the biopsy examination. Thus a larger study seems necessary to inter-pret these results.

– patients at level F1: For in-hospital examinations we extract that 20% ofpatients have ALB examinations. We also extract patterns saying that 17% ofpatients at level F1 have normal values for this examination. For out-hospitalexaminations, the examination “ABO shiki ketsuekigata kensa” concernsabout 50% of the patients. 20% of them have “O kata” and 15% have “Bkata” as result for this examination.

– patients at level F2: The out-hospital examination “ABO shiki ketsuekigatakensa” is present for 30% of patients. For in-hospital examinations we no-tice the presence of albumin (ALB) (20%) and alkaline phosphatase (ALP)(20%).

– patients at level F3: The out-hospital examination “ABO shiki ketsuekigatakensa” is also present for 18% of patients. We also notice the presence of “Dinshi” examination for 17% of patients. These two examinations are bothlinked with the following subcodes (uniformally distributed): E499, E500,E501, E502, F503, F504 and F505. The albumin (ALB) examination is alsopresent for in-hospital tests for 26% of patients.


1

10

100

1000

10000

100000

1e+06

1e+07

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Num

ber

of p

atte

rns

Gamma parameter

F1F2F3F4

Fig. 6. Number of patterns relatively to γ

– patients at level F4: We notice the examination albumin (ALB) for 20%of the patients, direct bilirubin (D-BIL) and cholinesterase (CHE) for 7%of patients in in-hospital examinations. Note that for albumin exams, thepatients have a probability of 12% to have a result lower than the normalrange.

5 Conclusion

In this paper we have experimented a method to extract probabilistic tree pat-terns from a medical database. This method is based on a representation of thedatabase by a set of trees and the inductive phase consists in first learning aSMTA and generalizing this model relatively to a parameter. In the context of adatabase about chronic hepatitis, we are able to link data from many relationsof the database, like out-hospital and in-hospital examinations.

One interesting perspective of this work would be to define a language biaswith medical experts to embed all the relevant information in trees. Then we maytry to use the method to extract other kind of knowledge. Another perspectivewould be to work on the way probabilistic tree patterns could be used as acondensed representation of the database.

References

1. N. Abe and H. Mamitsuka. Predicting protein secondary structure using stochastictree grammars. Machine Learning, 29:275–301, 1997.


2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Jorge B.Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very LargeData Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994.

3. T. R. Amoth, P. Cull, and P. Tadepalli. On exact learning of unordered treepatterns. Machine Learning, 44(3):211, 2001.

4. H. Arimura, H. Sakamoto, and S. Arikawa. Efficient learning of semi-structureddata from queries. In 12th International Conference on Algorithmic Learning The-ory, volume 2225 of Lecture Notes in Computer Science, pages 315–331, 2001.

5. R.C. Carrasco, J. Oncina, and J. Calera. Stochastic Inference of Regular TreeLanguages. Machine Learning, 44(1/2):185–197, 2001.

6. K.J. Cios and G.W. Moore. Uniqueness of medical data mining. Artificial Intelli-gence in Medicine, 26:1–24, 2002.

7. V. Crestana-Jensen and N. Soparkar. Frequent itemset counting across multipletables. In 4th Pacific-Asian conference on Knowledge Discovery and Data Mining(PAKDD 2000), pages 49–61, April 2000.

8. L. De Raedt. Data mining in multi-relational databases. In 4th European Confer-ence on Principles and Practice of Knowledge, 2000. Invited talk.

9. L. Dehaspe and H. Toivonen. Discovery of frequent DATALOG patterns. DataMining and Knowledge Discovery, 3(1):7–36, 1999.

10. J. Ganascia. Extraction of recurrent patterns from stratified ordered trees. In 12thEuropean Conference on Machine Learning (ECML’01), volume 2167 of LNCS,pages 167–178, Freiburg, Germany, 2001. Springer.

11. P. Garcıa and J. Oncina. Inference of recognizable tree sets. Research Report DSIC- II/47/93, Departamento de Sistemas Informaticos y Computacion, UniversidadPolitecnica de Valencia, 1993.

12. F. Gecseg and M. Steinby. Tree Automata. Akademiai Kiado, Budapest, 1984.13. S. A. Goldman and S. S. Kwek. On learning unions of pattern languages and tree

patterns. In 10th Algorithmic Learning Theory conference, volume 1720 of LectureNotes in Artificial Intelligence, pages 347–363, 1999.

14. A. Habrard, M. Bernard, and F. Jacquenet. Generalized stochastic tree automatafor multi-relational data mining. In Proceedings of the Sixth International Collo-quium on Grammatical Inference (ICGI 2002), volume 2484 of LNCS, pages 120–133. Springer, 2002.

15. T. Knuutila and M. Steinby. Inference of tree languages from a finite sample: analgebraic approach. Theoretical Computer Science, 129:337–367, 1994.

16. R. Kosala, J. Bussche, M. Bruynooghe, and H. Blockeel. Information extractionin structured documents using tree automata induction. In 6th European Confer-ence on Principles and Practise of Knowledge Discovery in Databases (PKDD’02),volume 2431 of LNCS, pages 299–310. Springer, 2002.

17. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Dis-covery of frequent tag tree patterns in semistructured web documents. In sixth Pa-cific Asia Conference on Knowledge Discovery and Data mining (PAKDD 2002),Taipei, Taiwan, May 2002.

18. M. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the eighthACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing (KDD), Edmonton,Alberta, Canada, July 2002.


Is It Time to Trade “Wet-Work” for Network?

Computational Approaches Open up New Directions and Define Theoretical Limitations in Massively Parallel Biology

Zoltan Szallasi

Children’s Hospital Informatics Program, Boston, MA, 02215, USA [email protected]

www.chip.org

Abstract. Systems biology aims to create a predictive mathematical model of the living organism. Such a goal may seem to be an exciting challenge to those with a quantitative background but it often sounds like a preposterous idea to others with less affinity to computers. Therefore, a talk on this topic should start with presenting several concrete examples on how systems biology may aid solving both practical and theoretical problems in biomedical research. These will include the difficulties associated with finding combinatorial therapeutic targets and a related theoretical problem of understanding the robustness of ge-netic networks in order to facilitate efficient drug design. The limitations of massively parallel biological data acquisition will be also discussed in order to provide an assessment of the data quality quantitative approaches will rely on [1,2]. The basic steps of modeling, such as data collection, reverse engineering, for-ward simulations, parameter optimization, are well known but each of these steps is associated with specific issues due to the fact that the object of the study is a highly complex, heterogeneous, adaptive system. There is no doubt that recent interest in systems biology has been ignited by the development of massively parallel measurement techniques in molecular biology, such as mi-croarray chips. It is all the more important to point out that current modeling ef-forts can make little use of microarray data due to their well documented lack of accuracy [2]. Large-scale measurements, unless they significantly improve, can be used only in probabilistic models, where independent validation will be re-quired before inclusion into reliable models. The noise level of measurements can be used to estimate the information content of these data sets, which in turn will impose limitations on modeling itself. The dichotomy of high throughput low quality measurements versus low throughput, high quality measurements will be discussed in terms of actual applications to modeling efforts. Reverse engineering spans from whole genome sequencing projects to parame-ter optimization of dynamic models. A major difficulty in reverse engineering human genetic networks is the fact that at present functional annotations exist for less than half of all human genes, and most of the existing annotations are incomplete. It seems, however, that in microorganisms unknown functional an-notations can be correctly guessed with a reasonable probability exploiting phy-logenetic profiles, sequence homology and protein interactions databases [3]. This is based on the fact that proteins that share function tend to be present in the same microorganisms, tend to interact and tend to have sequence similarity.

376 Zoltan Szallasi

We are in the process of extending this method for human proteins. We have created a microarray based large scale (~10,000 genes) gene expression data-base on a wide variety of primary human cell lines. This will provide a “differ-entiation profile” of human proteins, analogous to the phylogenetic profile of genes in microorganisms. This information will be combined with other “knowledge data bases” in order to produce probable functional annotations. Reverse engineering from time-series data for a genetic network with a given complexity has a well-defined information requirement. These estimates, how-ever, should be adjusted according to the specific characteristics of data derived from genetic networks [1]. Methods developed by the AI community are widely used in systems biology. Evolutionary algorithms, for example, are exploited both in parameter optimi-zation of large data driven models and gene expression based classification of various disease states. As an example, our recent work on the applications EA algorithms to breast cancer classification will be discussed. Finally, understanding robustness is one of the deeper intellectual challenges of systems biology with several practical implications. Fundamental concepts as-sociated with robustness are well known to the AI community. I will, therefore, present some surprising results derived from breast cancer associated gene ex-pression measurements that can be best explained in terms of abstract concepts such as “attractors”. Taken together, I will argue as a practicing biologist that, although with due caution, it is time to add computational “genetic network” approaches to tradi-tional, “wet-lab” biology.

References

1. Szallasi, Z. :Genetic network analysis in light of massively parallel biological data acquisi-tion. Pacific Symp. on Biocomputing. 4:5-16, 1999

2. Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC. Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res. 2002 May 15;30(10):e48.

3. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. Related Articles, Links Abstract A combined algorithm for genome-wide prediction of protein function. Na-ture. 1999 Nov 4;402(6757):83-6.

Robots as Models of the Brain:What Can We Learn

from Modelling Rat Navigationand Infant Imitation Games?

Philippe Gaussier1, Pierre Andry1, Jean Paul Banquet1,2,Mathias Quoy1, Jacqueline Nadel3, and Arnaud Revel1

1 Neuro-cybernetic team, Image and Signal processing Lab. (ETIS)CNRS UMR 8051, UCP-ENSEA, 6 av du Ponceau, 95014 Cergy Pontoise, France

2 Neurosciences and modelisation institute, INSERM 483, Jussieu, Paris3 Equipe Developpement et Psychopathologie UMR CNRS 7593, Hopital de la

Salpetriere, Paris, [email protected]

1 Introduction

Understanding the brain and the cognitive mechanisms is a central question forphilosophers, neuroscientists, psychologs and engineers as well. In our team, wetry to reconcile the old cybernetic approach with neural network modelling andartificial intelligence. This neurocybernetics approach aims at participating inthe effort to build a science of the cognition. Our goal is clearly not to designoptimal solutions for a particular problem but to try to understand what arethe mechanisms allowing the brain to adapt itself in order to survive to a widevariety of unpredictable situations. Hence, robots can be seen as simulation toolsallowing to test the behavioral consequences of a particular model in almost realconditions. We use Koala mobile robots (see fig. 1) equipped with one pan-tilt”head” and a 5 degrees of freedom Katana arm. The pan-tilt head can rotates180 degrees horizontally and vertically, and supports a single CCD color camera.

Our work is focused on how visual informations and other related ideotheticinformations are managed by the brain in order to allow global and coherentbehaviors such as homing, planning, object manipulation or more complex tasksincluding social interactions. We believe such global models are important tounderstand the effect of brain lesions or the use of particular drugs for instance.As our work is concerned, we are involved in joint research programs devoted tothe understanding of autism, schizophrenia and Alzheimer’s diseases for instance.In this paper, we will discuss two examples that can be seen as completelydifferent: a model of rodent navigation and planning as opposed to a model of thebaby development in imitation games. We will try to show both applications canbenefit from the same inner global model which can have important repercussionsin brain understanding and medicine.


378 Philippe Gaussier et al.

Fig. 1. Robotic setup: A Katana robotic arm and a home-made pan tilt camera (right)are mounted on a mobile Koala robot (left).

2 A Model of the Prefrontal Cortex/Hippocampus/BasalGanglia Loop

Since the initial observations of severe anterograde amnesia following medialtemporal lobe1 resection [29], the hippocampus (Hs) is known as a very impor-tant structure for human and primate memory. Electrical recording of neuronsactivity in rodent Hs have also shown the existence of “place cells” that fire forspecific places in the environment [15, 26]. O’Keefe and Nadel proposed the Hsworks like a Cartesian map of the environment, each place cell coding for a verylimited and well defined area [15]. A lot of experiments in the following 3 decadeshave confirmed the importance of the rat Hs in navigation tasks but more preciseexperiments have also shown a wide variety of sometimes contradictory results.Studies concerning mainly the understanding of conditioning mechanisms, haveshown that the Hs was also activated during the acquisition of classical condi-tioning and that lesions of the Hs disrupt acquisition of long latency conditionedresponses. More generally, Hs lesions seem to impair non spatial relational tasks[6]. Hs may be critical in situations in which the relevant stimuli do not occurcontiguously in time [30]. These works on the role of Hs in memory processeshave also enlighted the importance of the memory capacities of the perirhinal(Pr) and parahippocampus (Ph) [1] which are directly connected to the entorhi-nal cortex (EC) an input to the Hs. Hence, it is obvious that the function of aparticular brain structure depends highly on its interconnexions with other brainstructures. The behaviors are emergent properties of the dynamical interactionsbetween brain structures and also between the agent and its environment.1 The main structures within the medial temporal lobe are the hippocampal region

(the hippocampal field, the dentate gyrus, and the subiculum) and the adjacententorhinal, perirhinal, and parahippocampal cortices

Robots as Models of the Brain 379

To study the possible roles of the Hs, we have developed a neural networkmodel which includes a visual system [18] with parallel feature extraction, atten-tional mechanisms... in order to build a “what” and “where” representation ofthe different visual stimuli in the image (associated to the temporal and parietalcortex[27]). The network learns on-line the local visual configurations (“what”information) and tries to associate them to the “where” information in order tobuild a representation robust to occlusions and object displacements.

A B

shelves

table

workstation

table

workstation chair

7.2m

5.4m

resistor

table

planksradiator

chair

door

chair

cupboard

table

case

radiator

dustbin

table

TV chair shelves

2 4

2

4

ccm20

2 4

2

4

ccm24

2 4

2

4

ccm06

2 4

2

4

ccm03

2 4

2

4

ccm17

2 4

2

4

ccm11

2 4

2

4

ccm13

2 4

2

4

ccm01

2 4

2

4

ccm18

2 4

2

4

ccm21

2 4

2

4

ccm00

2 4

2

4

ccm23

2 4

2

4

ccm15

2 4

2

4

ccm19

2 4

2

4

ccm12

2 4

2

4

ccm09

2 4

2

4

ccm22

2 4

2

4

ccm04

2 4

2

4

ccm07

2 4

2

4

ccm05

2 4

2

4

ccm10

2 4

2

4

ccm16

2 4

2

4

ccm14

2 4

2

4

ccm08

2 4

2

4

ccm02

Fig. 2. a) The room in which the experiments are performed. The little crosses rep-resent the places where the robot has learned. b) 25 neurons, 25 places, 5 measuresaveraged per place, diffusion 35 degrees. Neurons are considered to be isolated (ac-tivation only comes from the direct input). Learning was supervised: each neuron isassociated with a particular location in a 5 × 5 paving of the room in which the exper-iments have been carried out. Each rectangle is a map of the room. The curves showthe activity of the neurons corresponding to the crosses of fig a).

We have been surprised to notice the field of our cell what really wide andpertinent on very long distances (typically 2 to 3 meters - see fig. 2) allow-ing to build very simple reactive homing strategies [12]. We have proposed oursimulated cell could be similar to the cells in the entorhinal cortex (EC) thus ex-plaining why some kinds of navigation tasks remain possible after hippocampalbilateral ablation. Fig. 3 shows how parietal and temporal data could be mergedin the perirhinal cortex ito build large “place cells” in the entorhinal cortex (thisnetwork has been used to obtain the results of fig. 2 - see also [9, 11]).

Later, we have proposed a model of the Hs that reconciles the presence ofneurons that look like “place cells” with the implication of the Hs in other cog-nitive tasks (complex conditioning acquisition, memory tasks...). We believe, therole of Hs is not fundamentally dedicated to navigation or map building. In ourmodel, Hs is used to learn, store and predict transitions between multimodalstates. This transition prediction mechanism could be important for novelty de-tection but, above all, crucial to merge in a single and coherent system, planningand sensory-motor functions. Fig 3 represents a schematic view of how the tran-


IT

PPC

Obje

ct re

cognitio

n

Object location (azimuth)

Pla

ce o

r vie

w r

ecognitio

n

EC-DGPr-Phmerging

a)

PF

ACC

AB

CA1/SUB

BC

G2

G1

BC

BD

BD

BC

CD

B

CA3

EC

d/dt

CD

B

C

D

Motivation

DG

Drive

b)

Fig. 3. a) Merging of “What” and “Where” information for place recognition in thecase of an high-level visual system. The lateral diffusion allows to measure the differencebetween the learned azimuth and the current azimuth. b) Global architecture of theinterconnexions between the Hs (here CA1 and Sub regions), the prefrontal cortex(PF) and basal ganglia (the nucleus accumbens ACC). We suppose place recognitionperformed in the entorhinal cortex (EC) is delayed in the dentate gyrus (DG). Thus,neurons in CA3 region can learn the transitions between places and build transitionstates useful for the action selection in ACC.

sitions computed in the Hs (CA1 and Subiculum regions) could be transmittedto the prefrontal cortex (PFC) to build transition maps (cognitive maps andhigh order representations of the agent/environment interactions). At the sametime, the transitions proposed by the Hs can trigger a motor action in the basalganglia (here the nucleus accumbens). In the case several transitions are pos-sible, the PFC could act as a bias in order to allow the selection of the mostappropriated transition or action according to the plan built in the PFC [11,16]. The complete model has been successfully tested on a mobile robot. Predic-tions about the nature and properties of the cells in Hs are currently tested byneurobiologists.

3 Modelling the Infant Developmentand the Imitation Games

For more complex tasks, interactions with a teacher or an expert can be veryimportant to reduce the time necessary to discover how to perform the correctbehavior. In collaboration with psychologs, we have developed very simple mod-


els allowing a robot to use imitation both for learning and communication [21,24]. As medicine is concerned, understanding and modelling imitation mecha-nisms is very important since it has been shown nonverbal children with autismcan imitate but have difficulty to recognized they are imitated [22, 23]. Besides,some psychologs believe neonatal imitation is produced by the conjunction ofrather “high-level” mechanisms such as a “supra-modal module” [19] linking vi-sion and proprioception (afferent copy of the motor action) and a “social module”motivated by the neonate in order to explore and discriminate the social environ-ment [20]. Conversely, a study from Jacobson [13] tends to show that neonatalimitation response could be a stereotyped low-level response determined by aparticular spatio-temporal configuration of a non-human stimuli. Moreover, theobservations of more complex imitative capabilities come with the progressivedevelopment of the baby [25]. For example, imitation of arm movements is ob-served from 2 month old, as soon as arm coordination is starting to be acquiredby the baby. Such examples lead us to ask the following question : at a given levelof the sensory-motor development, does imitation require much more featuresthan a simple arm coordination? If a simple perception-action coupling betweenvisual and motor information can explain tracking or pointing behaviors, canthis coupling also explain imitative behaviors?

Our robotics experiments advocate the theory of a co-development of thesensori-motor and imitation capabilities [10, 2, 4] at the opposite of the ap-proaches that suppose a clear difference between human imitation capabilitiesand animal mimicring capabilities. We defend the idea that low level imitationscan be the result of a side effect of a simple neural architecture devoted to thelearning of sensori-motor coordinations.

CCD

Roboticarm

Learning Phase Control Phase

detectionMovement

detectionMovementα β

Jointposition position

Joint

Con

trol

ler

Con

trol

ler

Fig. 4. Low-level imitation principle applied to a robotic arm. In a learning phase,a controller robot learn the correspondence between its arm proprioception (the jointposition) and its position in its visual field. To do this, the controller detects movement.Once the associations are learned, if the robot focuses its attention in a human teacher’smoving hand, it will reproduce the teacher’s simple movement just because it willperceive a difference between its proprioceptive and visual information. It will try toreduce the proprioceptive error of its arm position according to what it believes to bethe visual information linked to its arm (the detection of movement in the visual field)!An external observer will then deduce the learner robot is imitating the teacher.


We have shown that a low level imitative behavior can be obtained as a sideeffect of the perception ambiguity [10]. “Perception ambiguity” must be under-stood here as a difficulty to discriminate objects (is this my arm or another’sone?), or to decide between different interpretations (is this a useful object, oran obstacle?) without any additional information. According to this principle,an imitative behavior of an autonomous robot can be bootstrapped as follows(Fig 4). Let’s suppose a simple robot uses visual information to control themovements of its arm. Let’s now suppose that this robot processes only move-ment detection to perceive its own arm. Such a system can’t differentiate it’sextremity from another moving target. As a result, moving in front of the robotinduces changes in the perceptions that the robot considers as an unforeseen selfmovement. It will then tries to reduce the error by an action in the directionof the moving hand, inducing the following of the gestures of the demonstrator.Hence, low level sensori-motor architectures could be the bootstrap of more andmore complex imitative capabilities (emergent behaviors) avoiding the need forcomplex inate capabilities such as the agentivity which is studied in a lot ofneuroimaging studies [14, 8].

a)

temporalsequencelearningactionaction

System 1

Interaction

System 2

perception perception

b)

Fig. 5. a) Young children interacting and turn-taking. Imitation deserves a communi-cational purpose, where synchronization and rhythm matter. Reproduction from [22].b) Interconnection of two systems. System 1 and 2 have the same architecture. Eachsystem has learned perception-action associations. The two systems must produce out-puts (the same sequence of motor outputs for example) at the same time.

In addition to its learning function, imitation also has a communicative func-tion, which allows us to investigate how to enhance social relationships betweenagents. As observed among nonverbal children [22], imitation is a powerful toolfor gestural interactions (fig. 5). We have proposed a model in which, the inabil-ity to predict the rhythm of an exchange with a teacher can be interpreted as anegative reward because it may correspond to a rupture in the teacher rhythm.On the opposite, the ability to predict the rhythm can be associated with goodcommunication and then with a positive reward. We have shown the Hs networkproposed in the previous section for navigation tasks can be used to this pur-pose. Indeed, if we suppose the cells in the dentate gyrus are different temporalfilters (like a time base) then the cells in the CA3 region can predict both the


transitions and the timing of the transitions [5]. Hence, a comparison betweenthe prediction and the arrival of the event is sufficient to measure the efficiencyof the rhythm prediction [3] and the adequation of the “student” robot behavior.

4 Conclusion

In this paper, we have shown an example of how neurobiology and psychologycan change our way of thinking the artificial intelligence by using the same modelto solve navigation and imitation tasks. Lot of works in cognitive sciences insiston the fact that the nature of our representations are fundamentally dynamics[17] and that the perception cannot be decoupled from the action (for instancethe same brain structures are activated when we execute an action or when wesee somebody else performing the same action [28, 14, 7]). Moreover recent worksinsists on the importance of emotions in all our cognitive processes (controllingplanning, allowing fast reaction,...). It is clear that we can learn a lot from thesestudies in order to build new kind of intelligent systems.

On the opposite, neurobiologists and psychologs lack of means for testingand proving the global coherence of their models. The artificial intelligence andthe cybernetics approaches can be a good way to validate the global coherenceof cognitive models and to propose simpler alternative solutions. We can thenreally participate to this new science devoted to the study of cognitive mecha-nisms by developing tools to analyze brain activity, but also by our capabilityto propose new models of the cognition and new mathematical tools to analyzecognitive mechanisms. Modelling is essential to bridge the gap between neuro-biological and psychological data. It has to be clear that today we are far awayfrom a medical use of brain modelling but a cross fertilization between roboticsand biology/psychology is possible and fruitful (changing the way of thinking ofpsychologs and neurobilogists as they change or way of thinking as well).

Acknowledgments

This work has been supported by one french “Cognitique” project on imitationand autism and 2 ACI on computational neurosciensces devoted to modellingthe loop between the prefrontal cortex, the hippocampus and the basas banglia.The neurocybernetic team and the psychopathology and development team con-stitute a joint team project on “imitation in robotics and psychology” financedby the department STIC of the CNRS.

References

1. J.P. Aggleton and MWBrown. Episodic memory, amnesia, and the hippocampal-anterior thalamic axis. Behav. Brain Sci., 22:425–489, 1999.


2. P. Andry, P. Gaussier, S. Moga, J.P. Banquet, and J. Nadel. Learning and com-munication in imitation: An autonomous robot perspective. IEEE transactions onSystems, Man and Cybernetics, Part A, 31(5):431–444, 2001.

3. P. Andry, S. Moga, P.Gaussier, A. Revel, and J. Nadel. Imitation : learning andcommunication. In J. A. Meyer, A. Berthoz, D. Floreano, H. Roitblat, and S. Wil-son, editors, Proceedings of the Sixth International Conference on Simulation ofAdaptive Behavior, pages 353–362, Paris, 2000. The MIT Press.

4. P. Andry, P.Gaussier, and J. Nadel. Development of the firsts sensori-motor stages:a contribution to imitation. In J. A. Meyer, A. Berthoz, D. Floreano, H. Roitblat,and S. Wilson, editors, Seventh International Conference on Simulation of AdaptiveBehavior, 2002.

5. J.P. Banquet, P. Gaussier, J.L. Contreras-Vidal, and Y. Burnod. The cortical-hippocampal system as a multirange temporal processor: A neural model. InR. Park and D. Levin, editors, Fundamentals of neural network modeling for neu-ropsychologists, Boston, 1998. MIT Press.

6. M. Bunsey and H. Eichenbaum. Conservation of hippocampal memory function inrats and humans. Nature, 379:255–257, 1996.

7. J. Decety. Do imaged and executed actions share the same substrate? CognitiveBrain Research, 3:87–93, 1996.

8. J. Decety, J. Grezes, N. Costes, D. Perani, M. Jeannerod, E. Procyk, F. Gradssi,and F. Fazio. Brain activity during observation of actions. Brain, 120:1763–1777,1997.

9. P. Gaussier, C. Joulain, J.P. Banquet, S. Lepretre, and A. Revel. The visualhoming problem: an example of robotics/biology cross fertilization. Robotics andautonomous system, 30, 2000.

10. P. Gaussier, S. Moga, M. Quoy, and J.P. Banquet. From perception-action loopsto imitation processes: a bottom-up approach of learning by imitation. AppliedArtificial Intelligence, 12(7-8):701–727, Oct-Dec 1998.

11. P. Gaussier, A. Revel, J.P. Banquet, and V. Babeau. From view cells and place cellsto cognitive map learning: processing stages of the hippocampal system. BiologicalCybernetics, 86:15–28, 2002.

12. P. Gaussier and S. Zrehen. Perac: A neural architecture to control artificial animals.Robotics and Autonomous Systems, 16(2-4):291–320, December 1995.

13. S. Jacobson. Matching behaviour in the young infant. Child Development, 50:425–430, 1979.

14. M. Jeannerod. To act or not to act. perspectives on the representation of actions.Quaterly Journal of Experimental Psychology, 52A:1–29, 1999.

15. J.O’Keefe and N. Nadel. The hippocampus as a cognitive map. Clarendon Press,Oxford, 1978.

16. S. Moga A. Revel J.P. Banquet, P. Gaussier and C. Joulain. Learning, recognitionand generation of temporo-spatial sequences by a cortico-hippocampal system: Aneural network model. In Proceedings of the conference on Vision, Recognition,Action: Neural Models of Mind and Machine, page 46, 1997.

17. J.A. S. Kelso. Dynamic patterns: the self-organization of brain and behavior. MITPress, 1995.

18. S. Lepretre, P.Gaussier, and J.P. Cocquerez. From navigation to active objectrecognition. In The Sixth International Conference on Simulation for AdaptiveBehavior SAB’2000, pages 266–275, Paris, 2000. MIT Press.

19. A. Meltzoff and M. K. Moore. Explaining facial imitation: A theoretical model.Early Development and Parenting, 6, 1997.


20. A. Meltzoff and M. K. Moore. Persons and representation: why infants imitationis important for theories of human development. In J. Nadel and G. Butterworth,editors, Imitation in Infancy, pages 9–35. Cambridge: Cambridge University Press,1999.

21. D. Muir and J. Nadel. Infant social perception. In A. Slater, editor, Perceptualdevelopment, pages 247–285. Hove: Psychology Press, 1998.

22. J. Nadel. The functional use of imitation in preverbal infants and nonverbal chil-dren with autism. In A.Meltzoff and W. Prinz, editors, (in press). The ImitativeMind: Development, Evolution and Brain Bases. Cambridge: Cambridge UniversityPress, 2000.

23. J. Nadel. Imitation and imitation recognition: their functional use in preverbalinfants and nonverbal children with autism. In In A. Meltzoff and W. Prinz,editors, The imitative mind:Development, Evolution and Brain Bases, pages 42–62. Cambridge: Cambridge University Press, 2002.

24. J. Nadel, C. Guerini, A. Peze, and C. Rivet. The evolving nature of imitation asa format for communication. In J. Nadel and G. Butterworth, editors, Imitationin Infancy, pages 209–234. Cambridge: Cambridge University Press, 1999.

25. J. Nadel and C. Potier. Imiter et etre imite: leur role dans le developpement del’intentionnalite. In J. Nadel and J. Decety, editors, Imitation, representationsmotrices et intentionnalite. Paris: PUF: Sciences de la Pensee, 2002.

26. J. O’Keefe. The hippocampal cognitive map and navigational strategies. In J. Pail-lard, editor, Brain and Space, pages 273–295. Oxford University Press, 1991.

27. Perret D. I. Oram M. W. Modelling visual recognition from neurobiological con-straints. Neural Networks, 7(6/7):945–972, 1994.

28. G. Rizzolatti. From mirror neurons to imitation: Facts and speculations. InA.Meltzoff and W. Prinz, editors, (in press). The Imitative Mind: Development,Evolution and Brain Bases. Cambridge: Cambridge University Press, 2002.

29. W.B. Scoville and B. Milner. Loss of recent memory after bilateral hippocampallesions. Journal of Neurology, Neurosurgery and Psychiatry, 20:11–21, 1957.

30. R.F. Thompson. Neural mechanism of classical conditioning in mammals. Phil.Trans. R. Soc. Lond. B, 329:161–170, 1990.

web site (papers):http://www-etis.ensea.fr/˜neurocyber/Publications/Equipe NeuroCyber Publications.html

web site (videos):http://www-etis.ensea.fr/˜neurocyber/Videos/index videos.html

Author Index

Abu-Hanna, Ameen 61d’Aquin, Mathieu 304Alonso-Betanzos, Amparo 284Andreassen, Steen 264, 274Andry, Pierre 377

Balser, Michael 132Banquet, Jean Paul 377Barahona, Pedro 324Barry, Catherine 209Bartenstein, Peter 117Baud, Robert 199Bellazzi, Riccardo 11Bellazzi, Roberto 11Bernard, Marc 365Beveridge, Martin 76Bey, Pierre 304Bichindaritz, Isabelle 314Bielza, Concha 299Boaz, David 21Boiocchi, Lorenzo 163Bouaud, Jacques 46, 168Bratko, Ivan 229Buchholz, Hans-Georg 117Burgun, Anita 51Bury, Jonathan 158Bush, Nigel 314

Caffi, Ezio 163Capelle, Anne-Sophie 112Castillo, Fortunato D. 335Cavazza, Marc 101Charbonnier, Sylvie 1Chardon, Zulma 345Ciccarese, Paolo 163Ciocchetta, Federica 239Colot, Olivier 112Combi, Carlo 36Connor, Mark 345Cornet, Ronald 61Cruz, Jorge 324

Dameron, Olivier 51Dankel, Douglas D. 345Darmoni, Stefan J. 81, 209Dell’Anna, Rossana 239

Demichelis, Francesca 239Demsar, Janez 229Dhillon, Amar Paul 239Dojat, Michel 91Donaldson, Gary 314Duelli, Christoph 132

Ewing, Gary 41

Fernandez-Maloigne, Christine 112Fernandez del Pozo, Juan A. 299Fontenla-Romero, Oscar 284Fox, John 76, 142, 158, 335Fraga-Iglesias, Ana del Rocıo 284Frank, Uwe 274Freer, Yvonne 41

Gaag, Linda C. van der 254, 294, 340Galperin, Maya 122Gamberger, Dragan 244Garbay, Catherine 91Gaussier, Philippe 377Geissbuhler, Antoine 199Geldof, Marije 173Georg, Gersende 168Gibaud, Bernard 51Gierl, Lothar 31Glasspool, David W. 335Golbreich, Christine 51Graaf, Yolanda van der 340Grabar, Natalia 189Grasso, Floriana 179Guijarro-Berdinas, Bertha 284

Habrard, Amaury 365Halevy, Assaf 163Harmelen, Frank van 132, 173Hessing, Alon 122Horn, Werner 350Hunter, Jim 41Hurt, Chris 158

Jacquenet, Francois 365Jakulin, Aleks 229

Kansu, Emin 314

388 Author Index

Kavalar, Maja Skerbinjek 249Kokol, Peter 249Kosara, Robert 152Kristensen, Brian 274Kukar, Matjaz 355Kumar, Anand 71, 163

Larizza, Cristiana 11Lavrac, Nada 244Leibovici, Leonard 274Lenic, Mitja 249Lieber, Jean 304Logie, Robert 41Lu, Chuan 219Lucas, Peter 299

Magni, Paolo 11Marcos, Mar 132, 173Mayaffit, Alon 122McCue, Paul 41McIntosh, Neil 41Miksch, Silvia 152Milward, David 76Modgil, Sanjay 214Moinpour, Carol 314Monaghan, Victoria E.L. 335Moret-Bonillo, Vicente 284Moskovitch, Robert 122Munn, Katherine 86Murley, David 264

Nadel, Jacqueline 377Napoli, Amedeo 304Neveol, Aurelie 81

Oehm, Sebastian 117Oliboni, Barbara 36

Papakin, Igor 86Popow, Christian 350Povalej, Petra 249

Quaglia, Alberto 239Quaglini, Silvana 163Quoy, Mathias 377

Rami, Birgit 350Rasmussen, Bodil 264Rees, Stephen 264

Renooij, Silja 294Revel, Arnaud 377Richard, Nathalie 91Rijsinge, Wouter P. van 340Rios, Maria 304Rogozan, Alexandrina 81Rossato, Rosalba 36Ruch, Patrick 199

Saha, Vaskar 158Sauvagnac, Catherine 304Sboner, Andrea 239Schmidt, Rainer 31Schober, Edith 350Schønheyder, Henrik C. 274Sent, Danielle 254Seroussi, Brigitte 46, 168Shahar, Yuval 21, 122Shalom, Erez 122Siessmeier, Thomas 117Simo, Altion 101Smith, Barry 71, 86Smrke, Dragica 229Soualmia, Linda Fatima 209Steele, Rory 142Stefanelli, Mario 163Stiglic, Milojka Molan 249Sullivan, Keith M. 314Suykens, Johan A.K. 219Szallasi, Zoltan 375

Teije, Annette ten 132, 173Timmerman, Dirk 219Touzet, Baptiste 46

Uthmann, Thomas 117

Van-Gestel, Tony 219Van-Huffel, Sabine 219Vergote, Ignace 219Visseren, Frank 340Votruba, Peter 152, 173

Young, Ohad 122

Zalounina, Alina 274Zavrsnik, Jernej 249Zupan, Blaz 229Zweigenbaum, Pierre 189

link-springer-com-443.webvpn.jmu.edu.cn€¦ · preface the european society for artiﬁcial...

Documents