locative inferences in medical texts

13
Journal of Medical Systems, Vol. 11, Nos. 2/3, 1987 Locative Inferences in Medical Texts Paula S.D. Mayer, Guy H. Bailey, Richard J. Mayer, Argye Hillis, and John E. Dvoracek, M.D. Medical research relies on epidemological studies conducted on a large set of clinical records that have been collected from physicians recording individual patient observations. These clin- ical records are recorded for the purpose of individual care of the patient with little consider- ation for their use by a biostatistician interested in studying a disease over a large population. Natural language processing of clinical records for epidemological studies must deal with tem- poral, locative, and conceptual issues. This makes text understanding and data extraction of clinical records an excellent area for applied research. While much has been done in making temporal or conceptual inferences in medical texts, parallel work in locative inferences has not been done. This paper examines the locative inferences as well as the integration of temporal, locative, and conceptual issues in the clinical record understanding domain by presenting an application that utilizes two key concepts in its parsing strategy--a knowledge-based parsing strategy and a minimal lexicon. INTRODUCTION The goal of our research is to develop a computer software system that takes as input clinical medical records transcribed from physician dictation, "understands" the text within a restricted medical domain, and builds a database from individual patient data that when aggregated together can be used as the basis of population studies on trends of diseases. As such, our research is an application in an area--natural language pro- cessing--where applications must rely upon "scruffy" techniques in designing a soft- ware system. Our research was immediately motivated by the needs of an allergist and a statistician who wanted to study the effects of residing in several different locales on allergy patients. On the basis of their experience, as many as 20,000 records must be processed in order to obtain 250 usable data points, making manual processing prohibi- tive in most studies. The information to be extracted from the physicians' consultation records can be characterized into eight major areas: (1) the geographic location and duration of the cur- rent and all prior residences, (2) the age and location of the patient at emergence of the From the Knowledge Based Systems Laboratory, Department of lndustrial Engineering. Texas A&M Univer- sity, College Station, Texas 77843, and the Scott and White Hospital, Temple, Texas 76508. 123 0148-5598/87/0600-0123505.00/0 © 1987 Plenum Publishing Corporation

Upload: paula-s-d-mayer

Post on 10-Jul-2016

232 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Locative inferences in medical texts

Journal of Medical Systems, Vol. 11, Nos. 2/3, 1987

Locative Inferences in Medical Texts

Paula S .D. Mayer , Guy H. Bailey, Richard J. Mayer , Argye Hill is , and John E. Dvoracek, M . D .

Medical research relies on epidemological studies conducted on a large set of clinical records that have been collected from physicians recording individual patient observations. These clin- ical records are recorded for the purpose of individual care of the patient with little consider- ation for their use by a biostatistician interested in studying a disease over a large population. Natural language processing of clinical records for epidemological studies must deal with tem- poral, locative, and conceptual issues. This makes text understanding and data extraction of clinical records an excellent area for applied research. While much has been done in making temporal or conceptual inferences in medical texts, parallel work in locative inferences has not been done. This paper examines the locative inferences as well as the integration of temporal, locative, and conceptual issues in the clinical record understanding domain by presenting an application that utilizes two key concepts in its parsing strategy--a knowledge-based parsing strategy and a minimal lexicon.

I N T R O D U C T I O N

The goal of our research is to develop a computer software system that takes as input clinical medical records transcribed from physician dictation, "understands" the text within a restricted medical domain, and builds a database from individual patient data that when aggregated together can be used as the basis of population studies on trends of diseases. As such, our research is an application in an area--natural language pro- cessing--where applications must rely upon "scruffy" techniques in designing a soft- ware system. Our research was immediately motivated by the needs of an allergist and a statistician who wanted to study the effects of residing in several different locales on allergy patients. On the basis of their experience, as many as 20,000 records must be processed in order to obtain 250 usable data points, making manual processing prohibi- tive in most studies.

The information to be extracted from the physicians' consultation records can be characterized into eight major areas: (1) the geographic location and duration of the cur- rent and all prior residences, (2) the age and location of the patient at emergence of the

From the Knowledge Based Systems Laboratory, Department of lndustrial Engineering. Texas A&M Univer- sity, College Station, Texas 77843, and the Scott and White Hospital, Temple, Texas 76508.

123

0148-5598/87/0600-0123505.00/0 © 1987 Plenum Publishing Corporation

Page 2: Locative inferences in medical texts

124 Mayer et al.

symptoms, (3) the allergist's diagnosis, (4) the recorded symptoms that led to the diag- nosis, (5) the description of the recorded symptoms (severity, periodicity) as well as skin test results, (6) the recommended treatment and its effects, (7) occupational information including location, type of activity, and materials present in the workplace, and (8) de- mographic information, including date of birth, sex, marital status, and race of patient. The focus to date has been on areas 1, 2, 7, and 8. This paper will specifically report on the work in progress on area 1.

One of the interesting facets of this research has been to coordinate the relationship of time, event, and locale as they relate to the patient history. Individual clinical records contain such information, but because the data exist in a number of different formats and are gathered by a number of different physicians for the purpose of recording an examina- tion of a single person, and not with population studies in mind, it currently must be collected, tabulated, and interpreted manually. Our research attempts explicitly to con- front these problems by providing a computer system to read the texts, build patient profiles, infer missing or incomplete data whenever possible, and build an appropriate database.

Developing a natural language processing system (NLPS) 1,2 is an extremely difficult task, although it might at first glance seem deceptively easy because of the ease with which humans, including small children, manipulate and understand language. Under- standing natural language involves not only understanding the meaning of individual words but also understanding the meaning and function of those words within a sentence or even a body of discourse. However, many English words are ambiguous; that is, the same word can be used either as different parts of speech or with different meanings. Further, in developing an applications-oriented NLPS, we have the classic problems of linguistic theory, including anaphora, elipsis, and coordinate conjunctions, that remain open research questions, as well as the inherent ambiguity of English syntax. Humans process ambiguity because of a background or world knowledge that sets the context of the discourse and builds a script of what is expected. It is the difficulty of building into an NLPS this background knowledge component and a method for recognizing the functions of words in blocks of discourse that makes the task of designing computer NLPS so difficult.

M A T E R I A L S A N D M E T H O D S

An NLPS consists of a parser and a semantic interpreter. The parser determines the syntactic structure of a sentence while the interpreter assigns meanings to these struc- tures. To assign words to syntactic categories or parts of speech, the parser makes refer- ence to a lexicon. Once assigned a part of speech, these lexical items are then assigned to larger syntactic structures, such as noun, verb, or prepositional phrases, according to a grammar component.

In its simplest form, the lexicon is a dictionary of possible lexical items and their associated syntactic categories. However, such a simple lexicon has a serious drawback: Many English words are inherently ambiguous on several different levels. First, a word may be syntactically ambiguous, with several possible category assignments. Thus the

Page 3: Locative inferences in medical texts

Locative Inferences 12S

word run may be either a noun or verb; the word that may be a pronoun, determiner, or subordinator. Second, a word may be lexically ambiguous, with several different meanings associated with the same syntactic category, as in the word run, which can he used as either a verb or a noun. As a verb, it has a range of meanings from "run a race" to "run away" to "run a risk" to "run a fever," while as a noun it can have such various meanings as in "a 10K run," "a long run," "the usual run of mean," or "a run in my stocking." Humans easily process and understand these various senses because of the semantic cues provided by the larger body of discourse. Unfortunately, a large number of English words are ambiguous in one or both ways.

The best strategy for actually parsing and understanding sentence components of text is an open research question. Of the many prevailing approaches to NLPS, most fall into two broad categories--linguistic 3 or conceptual. 4 Linguistic systems maintain the parser and interpreter as separate components, but conceptual systems use some elements of the interpreter to constrain the parser. One such constraint mechanism is the use of case frames, 5 that is, the assignment of syntactic categories based on the semantic/syn- tactic relationship of constituents to the predicate of the sentence. Another is the use of a sublanguage system, 6 or a lexicon constrained by a particular domain.

For instance, each lexical item in the sentence

(1) He had experienced nasal congestion

represents a single syntactic category except for the word experienced, which has two syntactic categories. The interpreter of a linguistic parser would have to decide between these two interpretations, one of which would take the word experienced as the past participle of a verb synonymous with suffer. The second would assign experience to the adjective category, with the meaning "made capable by reasonable experience." A do- main specific conceptual parser might eliminate the second reading altogether by as- suming that in the medical domain experience will only modify human nouns when it is used as an adjective. Such a semantic constraint would be imposed as part of the infor- mation given in the lexicon, either about the syntactic category of experience or about the case roles that it may fulfill. Conceptual parsing strategies differ in how lenient these constraints are.

Linguistically based systems rely primarily on knowledge of grammar (the syntax and morphology of a particular language) rather than on knowledge of a particular do- main. Because these systems are syntax-driven, they may actually generate several meanings for the same sentence and may in fact generate interpretations that make no sense in the real world. Conceptual parsers, on the other hand, are guided in their parsing by their knowledge of some domain and thus eliminate interpretations that have no meaning for a particular domain. However, since conceptual systems are domain-spe- cific, they cannot easily be generalized without the regeneration of the knowledge com- ponent. The grammar in a linguistic parser usually consists of a set of phrase structure rules used to bundle syntactic categories into syntactic constituents, but in a conceptual parser the grammar forms a set of sentence patterns anticipated by the system. Concep- tual parsers attempt to incorporate some of the human's ability to disambiguate by pro- viding extensive knowledge about a limited domain, thus imposing severe constraints on the use and cooccurrence of certain lexical items. For instance, in a conceptual parser,

Page 4: Locative inferences in medical texts

126 Mayer et al.

the verb s n o r e would require a human agent and an optional indicator of manner. These constraints are imposed by syntactic category restictions on words in the lexicon and by case frames associated with verbs.

Regardless of the approach, parsers accept as input a subset of the text (we will assume a sentence-by-sentence parse) and match each lexical item (or word) with posible syntactic categories (or parts of speech). The string of syntactic categories are bundled into larger and larger syntactic constituents by the parser until the sentence can be pro- cessed, and the information content of the sentence incorporated into some representation strategy.

RESULTS

Parsing Strategy

In our work we have adopted the conceptual approach because we believe that suc- cessful, efficient text processing and understanding depends upon building a computer system that combines grammatical knowledge with the expectations and constraints that a physician brings to the reading of a clinical record. Thus, we reject the notion (for our application) that a parser can operate autonomously on syntax, without regard to the domain of discourse. Our parser uses expectations of textual structure built into patterns based on the information structure of the material as well as grammatical expectations or case frames. For this reason, our system would not understand texts that lie outside that domain, such as newspaper articles or even articles in medical journals. The patterns for medical texts would expect information relating to a patient, either reported by the patient or observed by the doctor. The patterns also anticipate either the verb and preposition or the verb and case frame cooccurence structures that help establish case frames. Since these structures are complex, the parser must sometimes rely on a scale of probability for understanding the meanings of lexical items that have a wide range of meanings. Our parser, for instance, would attempt to understand r u n as referring to a fever (as a verb) or perhaps to exercise (as a noun or verb); it would not, however, understand " r u n in a stocking" or other meanings of r u n outside of the medical domain.

In the domain of medical texts, for example, we have structured our set of patterns to expect text in the first or third person. For instance, we certainly do not expect impera- tive sentences. Moreover, most of our patterns anticipate that text in the third person refers to the patient; other references written in the third person are expected only in the section of the patient history text that describes family history. Further, these patterns are associated with processing sentence frames that anticipate certain types of information once the verb phrase is isolated. Since this information is often denoted by the cooccur- rence of verbs and prepositions or verbs and syntactic functions (such as subject or direct object), our processor is tuned to search for the anticipated semantic case markers.

Because of their importance in medical research, our NLPS must deal with time, place, movement, symptoms, attributes of patients, their relationships, and diagnosis. In the development of our prototype, we have concentrated on formulating the basic con- cepts necessary for making inferences about the effects of geographic movement. We

Page 5: Locative inferences in medical texts

Locative Inferences 127

have tried to develop these concepts independent of any single parsing strategy. How- ever, our current parsing strategy is based on a control mechanism as follows: First, a gross parse of the sentence is made in order to get the essence of its information content. This is accomplished through the use of a heuristic of trying to isolate the verb of the sentence and determine the number of words in the sentence that can be recognized. If the gist of the sentence is understood in this first pass, it is sent to be processed by a subset of the grammar and lexicon dealing with the particular information content associated with that verb in the second pass of the parser.

For instance, using the locative domain, the first step in this processing can be thought of as scanning the sentence to determine if locative information is present. The initial step scans the sentence for verbs that provide locative information (such as live or reside), for geographic designators (such as the words area, south, or coast), and for prepositions that often mark locative cases (such as in or f r o m . . , to). This initial step, then, is actually a partial parse to determine whether or not the sentences contain data that are of interest, with those that do not being discarded. Among the sentences that would be isolated on the initial scan are the following:

(2) He was born and raised in Houston for the first 7 years of his life. (3) He has lived in Dallas the past 51/2 years. (4) She moved to Central Texas in July 1984. (5) He has lived in Central Texas for over 40 years. (6) He lives in a 7-year old house.

but not

(7) As a child, she had asthma especially when the trees were pollenating. (8) She does have an indoor cat and this does seem to aggravate her.

Sentence (6) succeeded in fooling the first pass of the parser, but it is to be discarded in the second pass as explained below. It should also be noted that this strategy is heuris- tically based; that is, it is not immune to giving false results. It may, in fact, misidentify a word that is ambiguous. The robustness of the heuristics is based on the completeness of the cooccurrence sets and patterns defined. These sets and patterns have been derived from linguistic analysis on actual medical texts.

In the second parsing stage, the processor further parses and builds a data structure of the information content into a locative case frame. This is accomplished by further analyzing those sentences on the basis of the type of verb that occurs. Verbs that carry locative information can be grouped into a small set of classes based on the case frames with which they are used, thus providing the parser with a limited set of patterns. We have found four groups of verbs that either contain or imply locative information in our corpus of medical texts. These verb classes are shown in Figure 1. These verbs differ in the type of locative and temporal case frame markers that follow. The case markers, generally prepositions, are grouped into classes on the basis of two criteria. First, the locative and temporal case markers are in separate groupings. Then, the case marker s - - whether locative or temporal - -are also grouped into classes based upon the syntactic patterns in which they occur. Figure 2 shows a hierarchy of the case markers grot~ps.

In our system, the processing of locative information uses the notion of a minimal

Page 6: Locative inferences in medical texts

128 Mayer et al.

Verb Classes

Group1 Group2 Group 3 Group 4

l i v e see move be

r e s i d e r e t u r n r e m a i n t r a n s f e r

s k i n - t e x t come

o c c u r go g row up lea ve

born and ra ised

born r a i s e d

Figure 1. Verb classes indicating locative information.

lexicon. We cannot incorporate every geographic place name (state, county, city, country, area, etc.) in the world in the lexicon; further, some places may be indicated by directional indicators (e.g., "west of here") rather than by specific place names. We have built our system on the basis of characteristic geographic patterns derived from our linguistic analysis of the medical texts. Locative information may be present as either the object of prepositions in verb/preposition cooccurrence sets or in the various syntactic functions associated with verbs, but both rely on successfully isolating the verb. Once the verb is isolated, the system will detect geographic locative information in one of two ways: by actually matching the cooccurrence sets in the case of locative information marked by designators or by eliminating all other posibilities. This elimination of all other possibilities is used in identifying proper nouns of geographic locales. For instance,

(no marker) State Markers

I I at in on

f rom in and near

Markers

j ~ ~ ~ ~ Locative Markers Temporal Markers

Process Markers Duration To Punctual I Markers Markers Markers

from •

t h

(for) over since unt i l af ter after (for) as long as in when through as soon as (for) at least through since during

for from for as long as in (no marker) before before

during

Figure 2. Locative and temporal case markers.

Page 7: Locative inferences in medical texts

Locative Inferences 129

if a phrase is determined to be locative but does not fit a pattern, it is assumed to be a proper place name. As a result, we have only the smallest lexicon necessary for isolating sentences with information about place and time.

In the following examples we would expect our system to isolate sentences (9) and (10), but not sentence (11):

(9) She lived in the northeastern a r e a . . . (10) She lived in Paris, Texas . . . (11) She lived in a white house . . .

All three sentences contain a verb-preposition combination that usually signals loca- tive information. The additional presence of a geographic designator, area, provides further help in isolating (9). Neither (10) nor (11) contains such a designator, but the parser must somehow determine that (10) refers to a geographic locale while (11) does not and is probably not useful. At this point, the parser must make use of lexical infor- mation, which indicates that house is not a geographic locale. ExampIe (10) is retained by the processor because Paris, Texas does not match either any known designated geo- graphic pattern or a pattern that is known not to be a geographic pattern. Therefore, it is assumed to be a place name. Thus, (11) is eliminated while (10) is retained for further parsing.

It is interesting to note that, in fact, a case marker need not be present to indicate temporal or locative information. For instance,

(12) He lived in West Texas 40 years

has no case marker indicating 40 years as a temporal (durative) case.

(13) He resided there since 1968

has no case marker indicating there as a locative case. In this case, our processor relies on the word order following the verb. For each class of verbs, a word-order pattern for locative and temporal information can be found. In our classes of verbs, a locative is expected to precede a temporal when a single case marker is absent.

In determining both the verb subclasses and the case frames, we have made exten- sive use of domain-specific constraints. Our treatment of the single lexical item, the case marker in, demontrates the use of domain-specific constraints in parsing. The word in can be used as a preposition (in Texas), an adverb (come in), an adjective (the in crowd), or a noun (he has an in with the boss). Within our domain, however, in is used almost exclusively as a preposition; we have restricted its use accordingly. As a preposition, though, in still has a variety of functions, indicating manner (written in pencil), state of being (in luck), t ime (in the summer, in 1968), or place (in the lake, in Texas). Only two of these, place (locative) and time (temporal), are common in our domain, so we further restrict in to those uses. Our system distinguishes locative and temporal uses of in by examining its cooccurrence with particular verbs. For example, when used with move, in is most likely temporal, as in

(14) He moved to Waco in 1982.

but not

(15) *He moved in Waco.

Page 8: Locative inferences in medical texts

130 Mayer et al.

The input 'FRED J SMITH RESIDED IM THE CEMTRRL TEXRS RRER FOR OUER 48 YERRS '

was parsed into {~LIUE~ (21"lOOD DECLRRRTIVE) (RgEItT (FRED

J SMZTH))

(LOCRTIUE (THE CEHTRRL TEHRS ~RE~))

DURRT~OH (~B YERRS))

2VOICE RCTIVE) RTEHSE (PRST))) .

The ca~e frame ua~ then mapped into {~LIUE±

2TENSE (PRST)) RVOICE RCTIUE) DURRTION (48

YEARS)) LOC~TZUE (THE

CEHTRRL TEARS ~RE~))

(RgEHT (FRED J SMITH))

(21100D DECLRRRTIUE)}. Figure3. Language Cra~outputofFredJ. Smi~ residedin ~eCen~aZTexasarea~rover4Oyea~.

The i n p u t 'SHE HRS BEEH ll't CENTRRL TEARS SINCE 1968 ,

uas p a r s e d i n t o {mBE±

(21t00D DECLRRRTIUE) (RGEMT SHE) (LOCRTEUE (CENTRRL

TEXnS)) (FROII-TEIIPORRL 19~B) (~VOICE RCTIVE) (2TEI'tSE (PRESENT

PERFECT)) } . The case frame was then mapped into {±BE$ (RTEHSE (PRESENT

PERFECT)) (2VOICE RCTIUE) (FROI-1-TEI1POR~L 1968) (LOCRTIUE (CENTRRL

TEX~S)) (RGENT SHE) (21100D DECLRRRTIVE)}.

Figure 4. Language Cra~ output of She has been in Central ~xas since 1968.

Page 9: Locative inferences in medical texts

Locative Inferences 131

The input 'RMMR JOHES WaS BORH RHD RRISED IH HEW YOR K FOR THE FIRST 7 YEARS OF HER LIFE ' ua~ par~ed into

(zLIUE~ (2HOOD DECLRRRTIVE) (RBSEMT-TEIIPORRL (THE

FIRST 7 YERRS OF HER LIFE))

(AGEHT (RHHA dOMES))

(LOCATIVE (HEN YORK))

(£TEHSE (IHDETERIIIHANT))}. The c a s e £rav, e uas t h e n mapped i n t o

{*LIUE* (2TEHSE IMDETERHIM~HT)) (L0CnTIUE ME~4

YORK)) (RGEItT RHHR

JOHES)) (RBSEHT-TEI1PORRL (THE

FIRST 7 YERRS OF HER L IFE) )

(£HOOb ~ECLRRRTZUE)}. £i~u[e~. La~aaSe C[a~ ou~utofA~n#]o~es w~bor~Mr~is~di~ N~wYork Ci~or ~ t 2 ~ e ~ r s o ~ ~rliZ~.

When used with live, it can be both locative and temporal, as in

(16) He lived in Central Texas in 1982.

The use of in as both locative and temporal case markers might seem to be a poten- tial source of ambiguity, but word-order patterns serve to disambiguate them nicely. The kind of temporal information conveyed by in is quite different according to the verb it occurs with. With move, in marks an event that happens at on particular time; with live, in indicates duration.

While these different uses of in (as a marker of locative case and as a marker of two different kinds of temporals) might seem troublesome for parsing at first glance, the cooccurrence of verb and preposition and sentence fragment patterns resolve potential ambiguities and allow for an accurate reading of the text. Because the locative informa- tion in our texts is contained in clauses headed by a limited number of verbs, these relationships can easily be made explicit, and the locative and temporal data can be extracted in an efficient manner.

For our prototype, we are utilizing the Language Craft 7 package marketed by Car- negie Group because it utilizes the case frame grammar that forms the basis of our lin- guistic analysis. Figures 3 through 5 illustrate sample parses of sentences. This is output generated by Language Craft. After sentences are isolated from the text on the first scan,

Page 10: Locative inferences in medical texts

132 Mayer et al.

the parser assigns case flame labels to each of the constituents. Note that constituents may be phrasal as well as single words. Time information, unlike that about place, is actually categorized into several different case flames according to the kind of temporal relations specified. The various locative and temporal case frames then become the input to the inferential mechanism.

Inferential Mechanism

A clinical researcher collects a set of possibly dissimiliar observations about an occurrence of a disease in a population and attempts to find a pattern or cause to account for the observations. This is inductive reasoning versus deductive reasoning used by the physician/practitioner to diagnosis a disease given a set of symptoms and laboratory re- sults for an individual. It is this difference in the reasoning done that will account for many of the research questions in developing an inferential mechanism. For our proto- type, we have not yet dealt with the symptom/disease aspect, but only with the extraction of locative and temporal information, which utilizes inductive reasoning.

Simply extracting locative and temporal information, of course, will not answer our medical research questions. Understanding written text sufficiently to extract information required for determining the geographic location and duration of the current and all pre- vious residences involves the extraction of situation descriptions from the parsed input forms and then reasoning across those descriptions to deduce the proper data. That is, the record of the physician's consultation with a patient does not normally include a specific (carefully researched) history of the patient's prior life. What does appear frequently is a series of indicators that can be used to determine part of that history. Thus, scattered throughout the text one may find such references as

He is a white 40-year-old migrant worker from Waco . . . . He received treatment for allergic

rhinitis while working in Brownsville for several years after coming to this country . . . . He was

born and raised in Metamoros, Mexico . . . . He has not received any treatment for the past 11

years.

This type of information describes the timing of treatments and symptom emergence but only indirectly describes the past history of movement and residence of the patient. In processing a text, the human reader can readily make temporal and locative inferences, but duplicating the same ability in a computer system is computationally difficult. For that reason, while the inferences may seem simple to the human reader, they, in fact, make transparent the difficulty of the underlying inferential mechanism. Some of the desired information can be reasoned using reconstructed timeline manipulation and from exploiting the common uses of durative phrases. Thus, for example, in our texts, the use of the phrase born and raised without any additional qualification referred to a time period of 16 to 21 years. Knowing this, we can, in the preceding example, hypothesize an accounting for the first 21 years of the patient's life. We can signal the text processor that this is a hypothesis and hence specifically look for contraindications. We also know that he resided in the Brownsville area for at least 7 years. However, we cannot explicitly account for his location for the past 11 years since all we know from the text is that during that time he did not receive any treatment.

Temporal information in the clinical records is of two kinds: point-time and dura-

Page 11: Locative inferences in medical texts

Locative Inferences 133

tive. Time markers (e.g., lexical items such as dates and prepositional phrases) usually mark events- -such as kindergarten, graduation, Word War I I - - o r implicitly mark a specific time relative either to the individual or to a generally known event. However, we have to infer much of the durative information from sequences of point-time events. In a few cases, we must infer the actual sequence of events from durative markers. To ac- complish this, we treat the extracted time data points as either a point or a line segment in a graph with links representing precedence relationships.

Figures 6 through 8 illustrate the inferential mechanism. In Figure 6, we have an example of an initial inference that is then replaced by a stronger, second inference. Both Figures 6 and 7 illustrate cases where time and locale can be coordinated for the patient's entire lifetime. However, Figure 7 illustrates text where locale and time cannot always be correlated because we cannot infer that the list of places is given chronologically. In this case there are too few data to make any kind of logical inference. Thus, it is important that constraints be built into the inferential mechanism that prohibit it from making infer- ences where they are not warranted.

She was born and raised in and near Hamilton, Texas. She lived in Mississippi during the winter of 1969 and was in Laredo, Texas from 1969 through 1972. She returned to Central Texas where she has remained since 1972.

• Date of exam: 1986

Central Texas (known from text)

Laredo, Texas (known Irom text)

M iss iss ipp i (known from text)

Hamilton, Texas ~ ( f i rst inference)

~ 1972

~ 1 969

D 1968

D 1 954

D Date of

~ s Hamilton, Texas

econd inference)

birth: 1947

First inference: She lived in Hamilton, Texas from 1947 until 1954, Second inference: She lived in Hamilton, Texas from 1947 until 1968~

Figure 6. Second inference replaces first inference.

Page 12: Locative inferences in medical texts

134 Mayer et al.

He was born in Cal i fornia. He has l ived in Central Texas 1 1/2 years and was in sou theas te rn Co lo rado a year and a hal f be fore that, Ar izona a year and a half be fore that, and Ca l i fo rn ia be fo re that.

Cent ra l Texas (known f rom text )

C o l o r a d o ( known f rom text )

In ference:

A r i z o n a

(known f rom text)

C a l i f o r n i a /

'inference) ~

He

I l l Date of 1986 e x a m :

He has l ived in Central Texas 1 1/2 years

I 1984 and in sou theas te rn Co lorado a year and a hal f be fore that

1983 Ar izona a year and a hal f be fore that

1981 I I and Ca l i fo rn ia be fo re ,

t h a t .

I I Date of bir th: 1958 He was born in Ca l i fo rn ia

l ived in Ca l i fo rn ia f rom 1958 unt i l 1981.

Figure 7. Inference mechanism.

DISCUSSION

Our work with clinical records suggests that the efficient, accurate retrieval of data from those records requires a conceptual approach to natural language processing. As a result, we are devising a conceptual approach that makes use of the following:

1. Knowledge of the domain (i.e., the kinds of expectations that a physician would bring to the text).

2. Knowledge of the discourse and linguistic structure of medical texts to derive expected syntactic and semantic patterns based on classes of verbs and the case frames associated with them.

3. A minimal lexicon (i.e., a lexicon that does not, and in fact cannot, list all of the possible items).

4. A successive parsing procedure, with preliminary parses isolating parts of text containing relevant information and successive parses extracting that informa- tion.

5. A lexicon and set of syntactic patterns that makes use of domain-specific con- straints to eliminate ambiguities.

6. A parsing strategy that makes use of multiple structural cues, such as verb type and verb/preposition cooccurrence, to assign structures to phrases and clauses.

Page 13: Locative inferences in medical texts

Locative Inferences 13~

Date of exam: 1986

~l)She has ~ived in the Hillsboro area for 2 1/2 years and before that lived in

~ ~ Tennessee 2 years :

l\ye ayears :

j year Houston 3/,4~ears / \ ~ ~ A l a b a ~ year

? ~ / ~ G~rg~rg iTyear ~ n ~ l o ~ y ? Paris, Te,;~Ca's 7 years \ rwash ears

1969 and before t h a t was raised in Miss iss ipp i

~ Date of birth: 1952

Inference: She lived in Mississippi from 1952 to 1969. No other temporal/locative inferences can be made.

She has lived in the Hillsboro area for 2 1/2 years end before that lived in Houston 1 1/2 years, Florida 1 1/2 years, Houston 3 years~ Paris, Texas 7 years, Memphis, Tennessee, 6 months, Alabama 1 year, Washington State 2 years, Ohio 3 years, Tennessee 2 years, and about one year each in Mississippi, Hawaii, and Georgia and before that was raised in Mississippi.

Figure 8. Sample text for which an inference cannot be made.

Although we have developed a promising prototype that will extract geographic and tem- poral information from medical records and will organize that information into a time frame, much remains to be done. We still must coordinate that information with data about symptoms and their severity before we can begin answering the research questions put forward by our medical team. However, we believe that the approach outlined here is the most promising method for answering those questions.

REFERENCES

1. Moyne, J., Understanding Language: Man or Machine, Plenum Press, New York, 1985. 2. Winograd, T., Language as a Cognitive Process, Addison-Wesley, Reading, Massachusetts, 1983. 3. Marcus, M., A Theory of Syntactic Recognition for Natural Language, M.I.T. Press, Cambridge, Massa-

chusetts, 1980. 4. Schank, R., Conceptual Information Processing, North-Holland, Amsterdam, 1975. 5. Fillmore, C., The case for case. Universals in Linguistic Theory (Emmon Bach and Robert T. Harms,

eds.), Holt, Rinehart & Winston, Chicago, 1968, pp. 1-90. 6. Sager, N., Natural Language Information Processing, Addison-Wesley, New York, 198 I. 7. Language Craft 2.0 Reference Manual, Carnegie Group, Inc., 650 Commerce Court, Station Square,

Pittsburgh, Pennsylvania 15219.