a method of measuring information in language, applied to medical texts

21
0306-4’93’8.’ 53 00 + .oo C 1985 Pergamon Press Ltd. A METHOD OF MEASURING INFORMATION IN LANGUAGE, APPLIED TO MEDICAL TEXTS DANIEL B. GORDON and NAOMI SAGER Linguistic String Project. New York University, 251 Mercer Street, New York. NY 10012 (Revisedfor publication October 1984) Abstract-In this study, quantitative measures of the information content of textual material have been developed based upon analysis of the linguistic structure of the sentences in the text. It has been possible to measure such properties as: (1) the amount of information contributed by a sentence to the discourse; (2) the complexity of the information within the sentence, including the overall logical structure and the contributions of local modifiers; (3) the density of information based on the ratio of the number of words in a sentence to the number of information-contributing operators. Two contrasting types of texts were used to develop the measures. The measures were then applied to contrasting sentences within one type of text. The textual ma- terial was drawn from narrative patient records and from the medical research lit- erature. Sentences from the records were analyzed by computer and those from the literature were analyzed manually, using the same methods of analysis. The results show that quantitative measures of properties of textual information can be devel- oped which accord with intuitively perceived differences in the informational com- plexity of the material. INTRODUCTION The problem of measuring information is not new. Yet the impressive results in this domain (e.g. information theory) have dealt with statistical properties of messages as they bear on the capacity of channels to transmit messages, rather than with the content of the messages themselves. Thus, Shannon writes in his Introduction to The Math- ematical Theory of Communication[ 11: The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the mes- sages have meaning; that is, they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communi- cation are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. (Italics in original) The approach taken in this study is somewhat different. Rather than investigating the degradation in the amount of information contained in a message, we have endea- vored to develop quantitative measures of the amount of information contained in a message. Such measures would make it possible (1) To compare the amount of information in two different texts; (2) To characterize a text by its quantity of information; (3) To compute and compare the densities and complexities of information in various texts. Shannon makes an important point in saying that such an approach cannot be taken without reference to “some system with certain physical or conceptual entities”. Text must be interpreted in the light of a theory which accounts for how the text carries information. We are able to approach these questions because of advances over the past 25 years in the description and computation of language structure. Measuring structural properties of textual information depends on the ability to separate the information-

Upload: daniel-b-gordon

Post on 02-Sep-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A method of measuring information in language, applied to medical texts

0306-4’93’8.’ 53 00 + .oo C 1985 Pergamon Press Ltd.

A METHOD OF MEASURING INFORMATION IN LANGUAGE, APPLIED TO MEDICAL TEXTS

DANIEL B. GORDON and NAOMI SAGER Linguistic String Project. New York University, 251 Mercer Street, New York. NY 10012

(Revisedfor publication October 1984)

Abstract-In this study, quantitative measures of the information content of textual material have been developed based upon analysis of the linguistic structure of the sentences in the text. It has been possible to measure such properties as: (1) the amount of information contributed by a sentence to the discourse; (2) the complexity of the information within the sentence, including the overall logical structure and the contributions of local modifiers; (3) the density of information based on the ratio of the number of words in a sentence to the number of information-contributing operators.

Two contrasting types of texts were used to develop the measures. The measures were then applied to contrasting sentences within one type of text. The textual ma- terial was drawn from narrative patient records and from the medical research lit- erature. Sentences from the records were analyzed by computer and those from the literature were analyzed manually, using the same methods of analysis. The results show that quantitative measures of properties of textual information can be devel- oped which accord with intuitively perceived differences in the informational com- plexity of the material.

INTRODUCTION

The problem of measuring information is not new. Yet the impressive results in this domain (e.g. information theory) have dealt with statistical properties of messages as they bear on the capacity of channels to transmit messages, rather than with the content of the messages themselves. Thus, Shannon writes in his Introduction to The Math- ematical Theory of Communication[ 11:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the mes- sages have meaning; that is, they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communi- cation are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. (Italics in original)

The approach taken in this study is somewhat different. Rather than investigating the degradation in the amount of information contained in a message, we have endea- vored to develop quantitative measures of the amount of information contained in a message. Such measures would make it possible

(1) To compare the amount of information in two different texts; (2) To characterize a text by its quantity of information; (3) To compute and compare the densities and complexities of information in various

texts. Shannon makes an important point in saying that such an approach cannot be taken

without reference to “some system with certain physical or conceptual entities”. Text must be interpreted in the light of a theory which accounts for how the text carries information.

We are able to approach these questions because of advances over the past 25 years in the description and computation of language structure. Measuring structural properties of textual information depends on the ability to separate the information-

Page 2: A method of measuring information in language, applied to medical texts

270 D.B. GORDON and N. SAGER

bearing portions of sentences from the results of all other linguistic operations that have entered into the final form. The first step toward such a description of language structure was the discovery of linguistic transformations by Z. Harris[Z]. with the result that every sentence could be analyzed into component elementary sentences with cer- tain transforming and combining operations on them, and the further result that these operations (transformations) were themselves composed of elementary operators that on the whole were either incremental (information-bearing) or paraphrastic[3]. The isolation of informational processes in language was then formalized in a theory of grammar in which the information-contributing operations in building a sentence are a separate component of the grammar and produce a predicational structure for sen- tences that corresponds. recognizably, to the information carried by the sentence [4]. This structure can be represented by a parenthesized structure or by an operator- argument tree, in which each higher operator is interpreted as a predication on the argument(s) it immediately dominates. For example, from [4] (as reprinted, p. 674):

relate to (1. grow (tree). increase (rain))

= 1 relate the tree’s growth to the rain’s increase.

Here, the incremental informational operator relate to has the three arguments I, gro\\‘, itlcrease (syntactically the subject and two-part object of relate ro): the arguments grolt’ and increase are themselves informational operators with the respective arguments tree and vain (where the subjects of grolt- and increase appear as modifiers of the nominalized forms of the verbs).

The remaining portion of the operator grammar (the non-information-contributing part) is found to consist of “reductions”, i.e. shortenings of the full information form by such devices as substituting a distinguished word (pronoun), dropping repeated or indefinite reconstructable words (zeroing) and shortening words to become affixes, etc.[5. 61. These operations. in various combinations. ti,hen applied to the information form of the sentence. result in the sentence as we see it. For example. if the above example sentence had as subject the indefinite sot?reo)ze in place of I. then the permitted reduction of indefinites to zero. in combination with certain other operations. would provide the passive form: The tree’s gro~,rh is related to rhe rain’s increase (or: to the increase in rain). which contains no additional substantive information over the purely informational form.

Harris shows in [6] that this type of analysis can be applied to the whole language, laying bare, in the composition of each sentence, the contribution of each informational operator and its relation to the other informational operators and primitive arguments in the sentence. This result provides us, finally, with units suitable for quantitative measurement of properties of information that are related to its structure. At the sim- plest, one would expect that sentences carrying, intuitively, more information than others, will contain more operators of the informational type. Other informational prop- erties should also be measurable based on the number, types and configurations of the informational operators.

During this same period in which a theory of language structure related to infor- mation was developing. the techniques for computerized natural language processing were also increasing in extent and power. Comprehensive bibliographies of these de- velopments are to be found in several recent books, where the work is viewed in the contexts of information retrieval[7], artificial intelligence and cognitive science[8], se- mantic representation in limited domains[9]. text processing and informational repre- sentation[ lo], discourse processes[ 111, and speech understanding[ 121, to name a few.

Regardless of the ultimate purpose, workers have found that to obtain useful rep- resentations of language material and to process it with some degree of generality, regularities within the language data have to be formulated and made part of the pro- cessing procedures. In our own case (the LSP system of NYU) and in several other systems (e.g. the IBM REQUEST/TQA system[l3, 141). a first stage of parsing that captures “surface” grammatical regularities is followed by a stage of transformational

Page 3: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 271

analysis that produces a more regular, underlying, or “deep”, structure of the sen- tence[l5, 161.

The use of linguistic transformations to regularize parse tree structures in both the systems cited above is in the service of reaching a final representation suited to a particular application. TQ.4 is a natural-language question-answering system, serving as a front-end to numerical databases. In the LSP case, the purpose is to obtain a database of structured information from free-text input[l7], and, in related research, to develop a natural-language query system for such a database[l8]. It has not been necessary, in the material processed to date, to carry out a full operator-argument decomposition in the transformational stage of processing. Still. many of the operations are the same or similar to those that would result in an operator-argument analysis, so that, with adjustments. we have been able to use the LSP transformational implemen- tation to obtain the structures that figure in the present investigation.

METHODS

1. Measures based OH operator coutlts In this study,. we based the measurement of information on the operator-argument

type of sentence analysis, although the full operator grammar of [6] could not be used. What was possible was to adjust the results of sentence decomposition for the different types of texts so that comparisons of operator counts would be valid. In addition, in our decomposition trees, as in the operator-argument analysis of [6]. a hierarchy of operators in the grammatical decomposition of the sentence corresponds to a hierarchy of predications in the sentence. Thus. the number and distribution of the operators in the operator-argument tree can serve as the basis for the.measurement of the infor- mation contained in a sentence. Moreover, the conversion of a text into its operator- argument representation depends mainly on general properties of natural language rather than domain-specific information. If texts similar in complexity should turn out to have similar values in measurements based on an operator-argument analysis, then a quite general method would exist for quantified comparison of the amount of infor- mation and related properties in texts in different subject areas.

We decided to make the following counts and calculations. for each sentence and its operator-argument tree:

(1) The number of words in each sentence: (2) The number of operators in each sentence; (3) The ratio of the number of words to the number of operators (the inverse of the

density operator); (4a) The maximum depth of nesting of each operator-argument tree, including that

of local modifiers; (4b) The maximum depth, not including that of local modifiers;

(5) The product of (2) and (4a), as a measure of overall complexity. The motivation for these measures was straightforward. Item (2), the number of

operators in a sentence, is an index of the amount of predication contained in a sentence. Since the predications installed by a sentence are at least one sense of the “meaning” of a sentence, we reasoned that (2) would be one measure of the amount of information carried by a sentence. The inverse of (3) is at least a rough measure of the density of information per sentence. It reflects the amount of compacting (dropping of words that can be reconstructed on the basis of grammatical or contextual regularity). We felt that (4) would measure one feature of informational complexity, viz. the extent to which the predications of the sentence were predicated “on top of’ each other. A sentence could have a large number of operators without those operators being deeply nested (e.g. by each element having one or more local modifiers); measuring (4b) would give an independent estimate of the complexity of predication within a sentence. The product of (2) and (4a), item (5), would be a more satisfactory measure of complexity because it would “reward” a sentence for having either high operator content or deep nesting, and would serve as a composite index of the relative effects of the two.

Page 4: A method of measuring information in language, applied to medical texts

2-E D. B. GORDON and N. SAGER

2. Measures delqeloped for contrasting types of text One obvious problem in developing measures of such properties of information as

amount and complexity is the need for an independent source ofjudgment as to whether the measurements do indeed capture our intuitive (human-perceived) evaluation of these properties. Suppose the results of our calculations of complexity were to show that sentence SI is twice as complex as S2. Who would judge this result, and on what basis say “Yes. Sl is twice as complex as S2” or “No, Sl is oniy lf times as complex as S2”?

To avoid the large task of setting up scaled subjective judgments of human ob- servers, we divided our problem into two stages. First we would use highly contrastive types of text-say, Tl and TZ-for which there was general agreement, because of the type of text, that T2 was carrying more, and more complex. information than was Tl. and we would develop measures whose numbers clearly reflected this contrast. Then. having gained a certain amount of confidence in the measures, we would examine sentences, within the same type of text, whose values of the measure indicated they should have differing properties, and see whether our human reaction as readers dis- cerned these differences. A controlled experiment with human subjects would have been the ideal situation, but in the absence of this possibility, the techniques we used had gratifyingly consistent results that suggest that the method is worthy of further development and testing.

The contrasting types of texts used in the investigation were: Tl Factual reports-documents containing the narrative of the events in one partic-

ular situation; T2 Theoretical texts-portions of published articles describing mechanisms of action

of complex processes. For the Tl material, we chose texts extracted from hospital discharge summaries.

Figure 1 shows an example of this type of text. Typically, the sentences in this material are short and declarative, and their concatenation into discourse represents only an accumulation of facts.

For the T2 material, we chose an article reviewing advances in the theory of lipid metabolism, and portions of other articles in this field. In sharp contrast to the discharge summary material, these sentences are individually complex, and the discourse as a whole is marked by the reference in later sentences to earlier sentences. A page of the review text is shown in Fig. 2.

The conversion of the discharge summary material to an operator-argument form was performed by computer. For this we made use of the LSP natural-language pro- cessing system. Although the system as a whole is designed to convert natural-language texts into database entries, the first two components- the English parse and the English transformational decomposition- are perfectly general and were adaptable to our task.

The English parse[lO] produces a structure for each sentence representing the “surface” grammatical features of the sentence. The output of the parser for one sen- tence from the discharge summary material is shown in Fig. 3.

The transformational component operates on such a parse tree. Its effect is to undo the paraphrastic (non-information-bearing) reductions of the full information form that are found in the surface text, leaving a tree with only the information-bearing relations represented in the structure. From the standpoint of the laws governing the “packing” of information into natural-language text, the details of these transformations, their mode of operation and their integration into a consistent sequence of processing are subjects of some interest. These are described in detail in (161. What was needed for measurements, however, was a “skeleton” of this tree, where only the operators, arguments and their relations of dominance were expressed in the tree structure. For this purpose, additional computer processing of the discharge summary material was necessary.

A program of some generality for making the “skeletons” was developed. The input to the program is a specification of which substructures are required and how they are to be represented. The “skeletonizing” program scans each parse tree, ex-

Page 5: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 273

Ped. Ser.

HIS : This is the 2nd adm. for this 13 and lo/‘12 yr. old Hispanic male who presented with a history of one day of headache, vomit- ing, fever, anorexia and the onset of a stiff neck on the morning of adm.

The patient was in his usual state of good health until the morning prior to adm., when he developed a severe bifrontal headache with no diplopia, photophobia, tinnitus or seizure activity. Shortly after the onset of the headache, vomiting ensued with 8 episodes by history. No blood or coffeeground material or bilious material was noted. The patient also had a low grade fever on the morn- ing of adm. He noted the onset of a stiff neck and a rash that day. He was rushed to the PES, seen and immediately admitted to rule out meningitis.

The patient apparently had a URI last week which resolved. On trans- fer to the floor, he was lethargic and complaining of a very severe stiff neck.

PAST MED. HIS: The patient %as previously admitted to Center secondary to an auto accident in which he sustained a blowout fracture of the right orbit and a fracture of the floor of the left orbit, as well as a fractured zygoma. This required extensive repair. He has had residual ptosis and blurred vision.

In Aug. of 1975 the patient had headaches, vomiting, facial rash, enuresis and again a stiff neck while in Florida. He was admitted to Hospital in Orlando, Florida and diagnosed by spinal tap to have a meningitis; etiology not quite clear. The patient was treated with 10 days of IV antibiotics.

ALLERGIES: None known.

IMMUNIZATIONS: Complete.

ACUTE INFECTION DISEASE: The patient has had the chickenpox, measles and the mumps.

ROS : Noncontributory.

Fig. 1. Sample of REPORT textual material

tracting the desired substructures in the desired mode of representation. This sequence of substructures is then submitted to a simple bottom-up parser for correct nesting and marking of distinguished operators.

We hope in further research in this area to link the “skeletonizing” program to a high-level description of the substructures of interest, so that researchers sophisticated in linguistics or information science but unsophisticated in computer science can use the program to query collections of parse trees. Such a system would compile the user’s description into tables suitable for input to the skeletonizing program, which would then display, store and statistically manipulate the substructures described.

For the research summarized in this article, tables for extracting simplified op- erator-argument skeletons from the decomposition trees were manually constructed

Page 6: A method of measuring information in language, applied to medical texts

274 D. B. GORDON and N. SAGER

Regulation of Plasma Cholesterol by

Lipoprotein Receptors

Michael S. Brown, Petri T. Kovanen. Joseph L. Goldstein

Recent advances tn the gencttcs and

cellular btoiog) of cholesterol metabo-

ltsm hare provided new Instights mto the

control of plasma cholesterol levels tn

ma” It is now apparent that normal

humans possess effictent mechamsms for

the removal of cholesterol from plasma

I/). Th!c dtsposal process depends on

receptors located on the surface of cells

I” the ltver and extrahepauc tissues The

receptors btnd ctrculatlng lipoproteins

thar transpon cholesterol tn the blood-

stream. thereby Inlrlarmg a process b,

wh!ch the Itpoprowms are lake” up and

degraded by cells. yxlding their choles-

terol for cellular uses (2).

experimenrs show that the number of

Itpoprotetn receptors m the Itver. and

hence the rate of removal of cholesIerol

from plasma. IS under regularton and that

the “umber of receptors can be Increased

b) certatn cholesterol-lowcnng drugs 13.

41. Thts neu tnformatlon eventually may

reveal wh) dIfferen indlbtduals respond

d!KerenIly to dIetar) cholesterol

In lhts antcle we ret1.w the general

features of the llpoprors~n transpon 5)s.

tern worked out In srudles from man)

laboratones o\er the past ?Z years (5)

We then focus on the hpoprote~n recep-

lors of liver and extrahepauc t~cwes and

their role in reguldtlng plasma choiester-

Summary The ltpoprotern transpofl system holds the key to understandmg the mechanisms by which genes, duel. and hormones mleract to regulate the plasma cholesterol level I” man. Crucral components of thrs system are kpoprotetn receptors In the liver and extrahepatic hssues that mediate the uptake and degradation 01 cholesterol-carryng kpoprotems The number of llpoproteln receplors. and hence the eflciency of drsposal of plasma cholesterol, can be Increased by cholesterol-lowerlng drugs. Regulatton of lipoprotein receplors can be explolted pharmacologically in the therapy 01 hypercholesterolemra and atherosclerosis In man

The lipoprotem receptors are compo- 01 levels Fmall). we ratse the posslblltt)

“ents of a” inregrated Iranspon system that pharmacologic mantpulatton of the

that shuttles cholesterol contmuousl~ hcparic and extrahepatlc recrptors ma)

among Intestine. Iwer. and extrahepatlc habe therapeutic lmponance m Ihe treat-

tissues (I). An tnreresttng feature of the men1 of hypercholesterolemia and arh-

system ts that the hpoprotems are de- erosclerosls I” man.

graded as they dehbcr thctr cholesterol

IO Iissues, while the cholesterol sur-

vives, cvenrually to be excreIed from the The Lipoprotein Transport System

tissues bound IO new hpoproreln carrI-

ers. ExII of cholesterol from the body The ltpoprotein transpon system car-

occurs only when the sterol is transport- nes IWO classes of hkdrophobtc Itplds.

cd IO the liver for cxcrelion tnto the bile. triglycerides lcsrers of gl\ccrol and long-

Because of the continuous cycling of cha1” fatty acids) and cholesteryl esters

cholesterol into and oul of the blood. (esters of cholesterol and long-chat” fal-

stream, the plasma cholesterol concen- ty acids). Before they can be used by

tration is not a simple additwe function cells, the triglycendcs and cholestcryl

of dietary choleslerol Intake and endoge- esters must be hydrolyzed to liberate

n’ous cholesterol synthesis. Rather. tt fatty acids and unestertfied cholesterol.

reflects the rates of synthesis of the respecwcly (I). Tnglycerides are deltv-

cholesterol-carrying lIpoproteini and the l red primanly IO adtpose tissue and mus-

eUicIency of rhe receptor mcchantsms cle where the fatty actds are sIored or

that determine their calaboltsm. Recent oxidized for energy. Cholcsteryl esters

628 m3t-ruv~a~iOsoww8so: ww cOpynshl c tw ~~4s

are transported IO all body cells where

the unesrenficd sterol IS used as a strt~c.

fur-al COmponent of plasma membranes

These esters also supply cholesterol for

syntheses of sterotd hormones and bile

actds.

For transporr in plasma. tnglycetides

and cholestetyl esters are packaged tnto

lipoprotcm particles tn which they form a

hydrophobic core surrounded by a sur-

face monolayer of polar phosphohptds

The surface coar also contains unesten-

lied cholesterol in relatively small

amounts together uith protetns called

apoprotems (5). Through inIeracrlons

wtth enzymes and cell surface receptors.

the apoprorems dtrect each hpoprotem

to 11s site of metaboltsm.

The lipoprotem transpon pathway can

be dIvtded conceptually tnto exogenous

and endogenous systems that Iranspotl

hptds of dietary and hepallc ongtn. re-

spcclwely tFtg. I) Both s)stems be@”

utrh the recreuon of tnglycertde-rxh

lipoprolem+-mrertmal ch)lomtcron< I”

the exogenous ,y~tem and hepattc \er!-

lowdensity ltpoprolems (VLDL) m rhe

endogenous system Each of these panl-

cles contams an apoproteln called apoB.

uhtch malnlalns 11s structural mtegrrl)

and uhlch remains wrh the panicle

lhroughout 11s Intercon\erstons I” the

plasma. Recent srudtes lndlcate that the

apoB of ch)lomlcrons ts no1 tdenwai 10

that of VLDL (6).

&.10Ke”<)US ilpld ,r‘l”Y,OQrr. A lyplcal

Amencan adult absorbs about IO0 grams

of tnglycende and 250 mlllqrams of cho-

le\terol from the diet dad) tFtg Il. The

tntesltne tncorporales there ltptds into

chylomlcrons. huge hpopro~e~ns Idlame-

ter. 800 IO .Co(K, angstromst that are se.

creted tnto the lymph and from there

enter the bloodstream. Inasmuch as chb-

lomlcrons are IOO large IO cross the c”-

dotheltal bamer. they must be metabo-

hzed uhlle ~1111 tn lhe bloodstream For

thts purpose the chylomtcrons btnd to

ltpopro~em ltpase, a” enzyme tE C.

3 I.1 34) that IS fixed to the lumlnal

surface of the endotheltal cells that it”c

caplllanes of achpose and muscle ItssUes

(Ftg. ?A). The chylomicrons contat” a”

apoprorem (C-11) that acttvates the lt.

pase. which hberates free fatty acids and

monoglycendes The fatly acids enter

Ihe adJacen1 muscle or a&pose cells

where Ihey are either oxidtzed or rees-

rentied for storage (I). As the triglyccr-

ide core is depleted. the chylomicro”

Fig. 2. Sample of theoretical textual material.

Reprinted by permission from Scwn~~. Vol ?I?. pp. 62X-635. Ftg.. 8 Ma) 1981. Wmlcn b! M 5. Broun er al C 1981 AAAS

and run on the decomposition trees from the discharge summary material. The skeleton of the example sentence of Fig. 3 is shown in Fig. 4. In Fig. 4 it should be noted that the transformational expansion of conjunctions results in four separate assertions for the four symptoms mentioned (stiff neck, abnormal posruring, diarrhea and other focal signs), each carrying the modifier no history of, and connected via the relational ad- jective compatible with to meningitis. Note also that a transformation replaced the or

Page 7: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts yij

occurring under negation by arzd (No X or Y + No X and No Yj. A full listing of the discharge summary sentences used in this research and their skeletons is given in [19].

The lipid-review sentences of T2 were analyzed manually into an operator-argu- ment tree structure, A sampfe sentence and its tree representation are shown in Fig, 5, and the fufi set of sentences, together with their representations and an explanation of methods and conventions used in the notation, is given in [XII.

One feature 0-f the analyses of the [E) theorry type sentences wks the ~~~~~~~~j~~ made in these analyses between what we may call ifngical complexity, which is based on the global relations between arguments and operators, and the ~a1 complexity, which reflects the contribution of modifiers whose action is purely local. As an example

The cholesterol ester formed in HDL . . .

is analyzed as

cholesterol ester * (be formed in/ cholesterol ester/ HDL)

where “be formed in HDL” is a local modifier of “choJesteroJ ester”* Xn grammatical terms, “formed in HDL” is a reduced relative cfause modifying the noun cholesterol ester. This constructian would be represented in the output of the computerized trans- formational decomposition as

(REL-CLAUSE . . .

(ASSERTfUN =

(SUBJECT- ‘cholesterol ester’)

res3 of assertiotl

f.4SSERTfuN =

(SUBJECT = “cholesterol ester’)

(VERB = ‘FORM’ ‘1N‘)

(OBJECT = “HDL’)

Since we wished to accumulate measures for the lacal modifiers (“adjuncts”) sepam rately, we needed a way to make the two representations commensurate. A diagram showing this transformation appears In Fig. 6. Because of the more regular format of the cuMp~t~r-generated i~forrnatjo~~ it was easier to map trees from the discharge- materiaf sentences to the lipid sentences rather than Gee versa. Also, the mapping was done only virtually in the execution space of the program in order to save the original skeleton representations of the discharge-material trees.

Another step to make the two representations commensurate was necessary. The transformational prucessor in the LSf system treats afl ~o~ju~ct~ons as binary operators (a sequence of conjoined ASSERTfONS has a cunju~ctjon connective between each ASSERTION), while in the lipid materia1 all conjuncts at the same fevel were (more properly, from the point of view of information complexity studies) treated as n-ary operators (a sequence of conjoined subtrees is operated on by a single conjunction operatsr). la order to make the skeleton data conform to this analysis, the bottom-up parser of the “skeletonizer” program was used to mark the first conjunct on each feveI.

Page 8: A method of measuring information in language, applied to medical texts

TN

TE

NC

E

TE

XT

LET

iLD

S

EN

TE

NC

E

MO

RE

SE

N

I INT

RO

DU

CE

R-

CE

NT

ER

-EN

qMA

RK

AS

SE

RT

ION

‘.

*

SU

BJE

CT

-TE

NS

E-V

ER

B

YB

JEC

T -R

V

“TH

ER

E

* I LV

-VV

AR

-RV

;B

JEC

TB

E

:V

OB

JBE

&A

S

;ST

G

P

I m

LN

R

L!N

0 N

VA

R

I -7

” s

iPO

S -

QP

OS

-AP

QS

-NS

PO

S-N

PO

S

“1

TN

P

$

I!TR

H

X

‘;”

$ I LT

-T

2

A0

“‘--‘;

-”

TG

o 7

OF

N

ST

G

CA

I b

LNR

F

I LN

I

N;A

R-C

OM

MA

ST

G-R

N

I I

TqO

S-Q

PO

S-A

PpS

-NS

PO

S..P

OS

‘;’

*I

+ *2

*

LTR

A

DJA

OJ

NE

CK

I

I LT

LA

RI

I LA-?

VA

R-

RA

’ LC

DA

-AD

J ShF

F

Fig

. 1.

Pat

-w t

ree

of

HID

SM

1.

1.13

.

Page 9: A method of measuring information in language, applied to medical texts

*I+

‘:’

II I ,

“-N

OT

OP

T-

SA

CC

NJ

-0-C

ON

J A

DJI

NR

N

I LA

R

I LA

dV

AR

-R

I I

I L

CD

A -

AD

J P

N

I

I LP

-P

-NS

TG

O

I I

CO

LlP

AT

lBL

E

WIT

H

NS

TG

I LNR

I LN

N

VA

R

-RN

I I

TP

OS

-O

PO

S-

AP

OS

-N

SP

OS

-NP

OS

N

I I

LT

R

ME

NIN

GIT

IS

I LT

LN

-NV

AR

--C

OM

MA

ST

G

I “’

V

ING

,

“-N

OT

OP

T-S

AC

ON

J -0

-CO

NJ

I PO

STU

RIN

G

I YA

R

- O

YS

TG

N

I “O

R”-

SA

CO

NJ-

0-C

ON

J

DIA

RR

HE

A

I LN -

N

VA

R

I N I

TP

DS

-O

PD

S-

AP

OS

-NS

WS

-NP

OS

F

OC

AL

--S

IGN

S

I I

LT

R

AD

JAD

J

I I

LT

L

AR

I

I T

PO

S -

OP

OS

-AP

CS

-NS

PO

S

-NP

OS

LA

-AV

AR

-A

1

I I

I L

TR

A

DJA

DJ

LC

DA

-AO

J I

I

I L

T

LA

R

(

AB

NO

RM

AL

I L A

--:V

AR

-

RA

3

IhA

-AD

J I OT

HE

R

Fig

. 3.

(C

on

/inrm

f).

Page 10: A method of measuring information in language, applied to medical texts

X8 D. B.GORDON and N. SAGER

EXPLANATION OF NODE NAME ABBREVIATIONS

TEXTLET nodes to allow more than

one CENTER within

MORESENT a sentence

CENTER independent unit terminated by, an ENDMARK

INTRODUCER slot for paragraph headings (e.g. HISTORY:)

Terminal symbols

N noun

ADJ adjective

TV tensed verb

T article

P preposition

Coordinate Conjunction Strings/Nodes

COMMASTG conunction string headed by comma

ORSTG conjunction string headed by or

NOTOPT optional nor following conjunction

SACONJ subset of sentence adjuncts in conjunction strings

Q-CON J body of conjunction string, mimics host

Phrase types

TENSE

OBJECTBE

OBJBE

NSTG

NSTGO

LNR

LARI

LTR

slot for modals (e.g. will, may)

object set of the verb be, including participles

non-verb containing objects of be

noun string

noun string in object

noun with left and right adjuncts prenominal adjective with left and right adjuncts

article with left and right adjuncts

Left adjuncts

LN left adjuncts of noun LV left adjuncts of verb

LA left adjuncts of adjective

LCDA left adjunct position of compound adjective

LT left adjunct of article

LP left adjunct of preposition

Right adjuncts-similar to left adjuncts

Local variants

LVAR variants of noun

AVAR variants of adjective

VVAR variants of verb

Left noun adjuncts (LN) positions

TPOS article position

QPOS quantifier position

APOS adjective position

NSPOS possessive-of-type position

NPOS noun position %‘o~es t(, Fig. 3.

Page 11: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts

+ HIDSM 1 1 13

l THERE WAS No HX OF STIFF NECK ABNORMAL

l POSTURING DIARRHEA OR OTHER FOCAL SIGNS

l COMPATIBLE WITH MENINGITIS

I lCONJCINEC=',', ( (REL-CLAUSE='TRN-FILLIK' ! (ASSERTION=

( SUBJECT = INEG-FIND='NC' 'HISTORY 'PAST' 'OF' 'STIFF' 'NECK')

~vERB='EXIST' 'PAST']

( IRELATION='COMPATIBLE' 'IFITH'I (FRAGMENT=

(NSTG='STIFF' 'NECK')

(FRAGMENT= (NSTG='MENINGITIs',

(CONJOINED=','1

(REL-CLAUSE='TRN-FILLIN', (ASSERTION=

ISUBJECT= (NEG-FIND='NO' 'HISTORY' 'PAS?' 'OF' 'ABNoRMAL' 'POSTURING'J

(VERB='EXIST' 'PAST',

(RELATION='COMPATIBLE' 'WITH' 1 (FRAGMENT=

(NSTG='ABNORMAL' 'POSTURIN;'

;FRAGMENT= INSTG='MENINGITIs'

;-CONJOINED='AND'I

(REL-CLAUSE='TRN-FILLIN' / (ASSERTION=

(SUBJECT= (NEG-FIND='NO' 'HISTORY' 'PAST' 'OF' 'DIARRHEA',

(VERB=~EXIST' 'PAST' I

( (RELATION='COMPATIBLE' 'WITH' / (FRAGMENT=

(NSTG='DIARRHEk' 1 (FRAGMENT=

INsTG='MENINGITI",

(REL-CLAUSE='TRN-FILLIN' (ASSERTION=

(SUBJECT= (NEG-FIND='NO' 'HISTORY' 'PAST' 'OF' 'OTHER' 'FOCAL' 'SIGK'

'PLURAL')

~VERB='EXIST' 'PAS?')

1 (RELATION='COMPATIBLE' 'WITH', (FRAGMENT=

(NSTG='FOCAL' 'SIGN' 'PLURAL',

(FRAGMENT= (NSTG='MENINGITIS',

Fig. 4. Skeleton of HIDSM I. 1. I?

Page 12: A method of measuring information in language, applied to medical texts

2X0 D. B.GORDON and N. SAGER

UBS 6- Cholesterol ester is removed from LDL in the splanchnic bed and LDL part;cie stability maintained by addltron of triglyceride.

SBS 6:l iand (be removed from *In the splanchnic bed /cholesterol ester /LDL 1

(by (malntalned istability/LDLi) (addition of/triglyceride)))

Fig. ?. Example of transformational decomposition in operator-argument form.

For statistical purposes, only this distinguished conjunct was counted in the operator scores.

Other points that were considered in relation to commensurability of the two rep- resentations are the following.

Both vocabularies have phrases in which the component elements are not treated separately. although in other environments one of the morphemes may be an op- erator. Some examples from the REPORT texts are: teeth chatterkg chiIls, alive and well (a standard expression in Family History), respiration rate, localizing signs. Some examples from the THEORY texts are: the 2-position of lecithin, mutarrr FH cells. membrane bindirrg assay. These are fixed phrases in these sub- languages. In both types of texts adverbs and adverbial PN’s are considered local adjuncts of the associated operator and are not counted as separate operators. This includes Iocatives and time expressions. Time expressions are more frequent in the RE-

REPORT MATERIAL

ASSERTION 2 SUBJ VERB 2 OBJ 2

OPERiTOR 2 ARG 4 ARG 3

EQUIVALENCE MAP

Page 13: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 281

PORT (clinical) texts, whereas locatives are more frequent in the THEORY (phys- iological mechanism) texts. In the latter, locatives include some words that would not ordinarily be thought of as locations (e.g. diet): these are source-sites for the substances that appear as arguments of associated operators. Some examples of adverbial modifiers are the following. LOCATIVES: to thefloor, in Orlando, Flor-

ida, around both eyes, in capillaries, in tissue culture; TIME: last brseek, after a

single high cholesterol meal: OTHER ADVERBS: grossly, mostly, intral,enously.

3. Negation and modal operators (e.g. no, \tsould) are counted as operators in both data sets.

4. The conventions regarding quantity expressions are the following: (4a) Quantity nouns (amount, rate, etc.), occurring as the argument of a quantity

verb (increase, decrease, etc.) are taken as separate operators since their dominating operator requires it (but respiration rate is not decomposed. cf. (1) above).

(4b) If a quantity verb occurs without a quantity noun as its argument. an abstract operator (such as QAMT) is supplied as its argument since one is required (this happens only rarely in the data of this study).

(4~) A numerical quantity with its units is treated as a separate operator. For example, in the REPORT texts.

EXDSM 3.1.5 RESP. 14.

contains one operator. Similarly (with the verb be as carrier of the operator)

HIDSM 1.1.16 CHILD WAS 5 LBS. 1 OZ.

is counted as containing just one operator. (4d) Non-numerical quantifiers (e.g. much and se\.ere) are treated as operators.

5. The same connectives are treated as operators in the decompositions of the two types of texts, with several exceptions that did not affect the results. These are noted in the full report on this work in 1191. Their representation in the decom- positions is different, and required the mapping described above.

6. Verbs are counted as operators in both sets of decompositions. In the THEORY text decompositions nominalized verbs are transformed to the infinitive form of the verb. In the computer outputs of decompositions of REPORT text sentences the occurrence of ASSERTION or FRAGMENT implies the presence of an op- erator. Most nominalized forms of verbs in this sublanguage occur under a verb operator (had 2 transfusions during hospitalization) or as a noun FRAGMENT derived from such a construction (2 transjirsions during hospitalization). Counting ASSERTIONS and FRAGMENTS gives the same count as if the nominalized verbs were decomposed as far as main clauses of sentences are concerned.

RESULTS

Tabulations were obtained for the following variables: WORDS Number of words in sentence OPS Total number of operators w/o Ratio of words to operators ADJ Number of operators in “adjuncts” (local modifiers) NAJ Number of operators not in “adjuncts” DMAX Maximum depth of nesting including “adjuncts” DNAJ Maximum depth of nesting not including “adjuncts” O*D Product of DMAX and OPS, the “overall complexity”.

The results for the two types of textual material are summarized in Table I, which also shows the ratios between the measures for the two sets of data. Complete tables of results for all sentences of both types are given in [21].

We included the number of words in the sentence as an intuitive indicator of the

Page 14: A method of measuring information in language, applied to medical texts

282 D. B. GORDOX and N. SAGER

amount and complexity of the information in a sentence. We were interested in including this measure (w:hich is the only measure obtainable in many installations) in order to compare it to our more theoretically motivated measures.

The number of operators is one measure of the amount of information in a sentence. It should be noted that sentences in the theoretical texts, in a number of cases, refer to structures in previous sentences. This referencing mechanism can be thought of as introducing prior information into the later sentence. As we have not counted these referred-to operators as part of the referring sentence, the operator count measures the ne\t~ information contributed to the ongoing discourse by the referring sentence. By including these referred-to operators we would obtain a measure of the total in- formation included in the sentence, Such a measure of total information would show a considerably elevated THEORY/REPORT ratio in respect to number of operators, since the THEORY texts contain many more referentials than do the REPORT texts. For the total operator counts in Table 1, the ratio between the theoretical texts and the factual report texts is 2.37.

The number of words per operator is a rough inverse measure of the density of information in a sentence. The two types of texts were roughly comparable in this variable (the ratio is 1.24). despite the great difference in types of textual material. Part of this similarity! is an artifact of the processing of the theoretical texts, where long metatextual material (e.g. “It seems clear from the studies of BG . . .” in UBS 8) and muhi-w-ord phrases (e.g. “integrated transport system . . .” in UBS 12) were analyzed as a single operator, On the other hand. there must exist a limit to the amount of packing and reduction that can be performed on information in the process of encoding it into

Table 1. Comparison of measures for contrasting types of text

1. REPORT Texts CT1 t

**““*TOTALS

Total number of sentences Average number of words Average number of operators Average number of v.ords per operator Average number of operators in adjuncts Average number of operators not in adjuncts Average depth of nesting Average depth of nesting ewluding adjuncts Average operator-nesting product

2. THEORY Texts CT21

*****TOTALS

TotaI number of sentences Average number of words Average number of operators Average number of words per operator Average number of operators in adjuncts Average number of operators not in adjuncts Average depth of nesting Average depth of nesting excluding adjuncts Average operator-nesting product

3. T2Tl Ratios

*****TOTALS

Total number of sentences 62 Average number of \sords 21.61 Average number of operators 6.53 Average number of words per operator 3.79 Average number of operators in adjuncts 1.40 Average number of operators not in adjuncts 5.11 Average depth of nesting 4.03 Average depth of nesting excluding adjuncts 3.25 Average operator-nesting product 29.24

Theory Text

113 6.73 2.75 3.06 0.38 2.37 1.92 1.86 7.02

62 21.61 6.53 3.79 I .40 5.12 4.03 3.25

29.24

Report Text

113 6.73 2.75

Ratio

- 3.21 2.37

3.06 I .24 0.38 3.68 2.37 2.16 1.92 2.10 1.86 1.75 7.02 4.17

Page 15: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 283

sentences, and material at or near this limit would be expected to have the same density, regardless of the complexity of the information. Greater complexity could arise in such a case only from a longer sentence.

The number of operators in adjuncts is a measure of how complicated the local modifiers are in a sentence. The number of operators not in adjuncts is a measure of how complicated the logical structure of the sentence as a whole is. The ratios in both variables are significantly greater than 1 in favor of the THEORY texts (3.68 for the operators in adjuncts. 2.16 for the operators not in adjuncts), indicating that the the- oretical material is more complex both in the local modifiers and in the overall logical structure.

One feature of complexity is depth of nesting. For our sentences, we measured both the maximum depth of nesting and the maximum depth excluding local modifiers. In the report material, the two measures are essentially the same (1.92 vs. 1.86). This accurately reflects the shallow structure of local modifiers in the report material. In the theoretical texts, on the other hand, there is a more pronounced difference (4.03 vs 3.31), indicating greater complexity of modifier structures. In fact, the ratio (not shown in table 1) between the depths of nesting in adjuncts alone would be 13.0. The ratios for the given depth-of-nesting indices are 2.10 and 1.75.

As expected, the product of the number of operators and depth of nesting enhanced the contrast between sentences with regard to complexity of information. The ratio of the averages for the two types of text is 4.17. Thus, while the ratio of sentence lengths was 3.21, the ratio of complexity, as measured by O*D, is almost one-third higher. reflecting what we believe to be the case for textual material-namely, that the increase in complexity is not a linear function of the number of words in the sentence.

Looking at Table 1 as a whole, it is clear that the differences in information com- plexity between two radically different types of textual material can be expressed quan- titatively, based on the structural properties of the sentences. Indeed, with the excep- tion of the inverse density, the various measures are roughly in the same ratio, about 2.75.

Since the measurements on the two contrasting types of texts were at least grossly consistent with what one would expect for such different materials, it was now possible to study the performance of these measures on sentences within-a single type of textual material.

Table 2 shows the high and low extremes of sentence groups from the theory

Table 2. High- and low-complexity groups of T2 sentences

A. High O’D B. Low O*D Sentence # O*D OPS Sentence # O’D OPS

Al. O*D>60 B1. O*D = 12 UBKG 2.7.3 102 17 UBKG 2.5.1 12 4 UBS II 84 I? UBKG 2.5.3 12 4 UBKG 3.35 78 13 UBKG 2.7.5 12 4 UBKG 4.7.3 77 11 UBKG 2.9.2 12 4 UBS 13 72 1’ Fig. 16.2 60 lo’

AZ. 50 < O*D < 59 B2. O’D = 9

UBS 4 55 11 UBS 2 3 UBKG 3.1.7 55 11 UBS 5 ; 3 UBKG 4.6.7 55 11 UBS 12 9 3 UBKG 2.8.3 54 9 UBKG 2.8.2 9 3

UBKG 2.9.6 9 3

A3. 40 -=z O’D < 49 B3. O’D < 6

UBS 3 45 9 UBKG 2.4.7 6 3 UBKG 2.7.4 44 11 UBS 8 4 2 UBKG 2.9.5 45 9 UBKG 2.4.9 1 1 UBKG 4.6.3 45 9 UBKG 2.8.4 40 8 UBKG 8.4.3 40 8

Page 16: A method of measuring information in language, applied to medical texts

284 D. B. GORDON and N. SAGER

Table 3. Low-complexity groups

BI. O’D = 12

U BKG 25.1: The depleted chylomicron is released from the capillary wall and reenters the circulation, U BKG 2.5.3: The remnant (diameter, 300 to 800 A) is carried to the liver where it binds to receptors on

the surface of hepatic cells. U BKG 2.7.5: With the typical American diet, which is high in cholesterol, about 1100 mg of sterol is lost

from the body each day. U BKG 2.9.2: The VLDL particles interact with lipoprotein lipase in capillaries, releasing most of their

triglycerides.

B2. O*D = 9

UBS 2: HDL are thought to accumulate cholesterol from peripheral tissues. UBS 5: Alternatively, VLDL may be the exclusive source of LDL cholesterol. UBS 12: The lipoprotein receptors are components of an integrated transport system that shuttles cholesterol

continuously among intestine, liver, and extrahepatic tissues. U BKG 2.8.2: These lipoproteins also contain cholesterol, which will be delivered to extrahepatic cells. U BKG 2.9.6: The excess surface materials, mostly phospholipids and cholesterol, are transferred to HDL.

B3. O’D = 6

U BKG 2.4.7: As the triglyceride core is depleted. the chylomicron shrinks. UBS 8: It seems clear from the studies of B & G that LDL is also the maior vehicle for transnort of cholesterol

to peripheral tissues. U BKG 2.4.9: A similar transfer reaction occurs in the endogenous lipoprotein transport system.

Table 4. High-complexity groups

Al. O*D 2 60

U BKG 2.7.3: Much of the cholesterol and bile acid secreted by the liver is reabsorbed in the intestine and again delivered to the liver for excretion, thus forming an enterohepatic circulation.

UBS 11: The receptors bind circulating lipoproteins that transport cholesterol in the bloodstream, thereby initiating a process by which the lipoproteins are taken up and degraded by cells yielding their cholesterol for cellular use.

U BKG 3.3.5: The mutant FH cells grow in tissue culture because they adapt to the unavailability of LDL- cholesterol by increasing HMG CoA reductase activity and cholesterol synthesis.

U BKG 4.7.3: These particles accumulate in plasma because their removal rate fails to increase in proportion to the increased production of chylomicrons and VLDL that occurs during cholesterol feeding.

UBS 13: Recent experiments show that the number of lipoprotein receptors in the liver, and hence the rate of removal of cholesterol from plasma, is under regulation and that the number of receptors can be increased by certain cholesterol lowering drugs.

U BKG Fig. 16.2: This pharmacologic manipulation takes advantage of the knowledge that the receptors are normally under feedback regulation and that the number of receptors in the liver can be increased when the liver’s demand for cholesterol is increased.

A2. 50 I O*D I 59

UBS 4: In either case, with VLDL breakdown, the cholesterol ester would arrive at LDL. U BKG 3. I .7: The number of receptors in fibroblasts can be increased by certain hormones that stimulate

cell growth. including insulin, thyroxine. and platelet-derived growth factor. U BKG 4.6.7: This increased efftciency of LDL clearance, plus a 50 percent reduction in the synthetic rate

of LDL produced by mevinolin, contributed to a remarkable 75 percent drop in the plasma level of LDL-cholesterol in the treated dogs.

U BKG 2.8.3: When dietary cholesterol is available, the liver uses that source of sterol. derived from the receptor-mediated uptake of chylomicron remnants, for lipoprotein synthesis.

A3. 40 c O*D = 49

UBS 3: The cholesterol ester formed in HDL may be transferred either to VLDL in exchange for triglyceride or. in part, directly to LDL.

U BKG 2.7.4: During each cycle a portion of the cholesterol and bile acid escapes reabsorption and is lost in the feces.

U BKG 2.9.5: As the size of the VLDL particle diminishes owing to its interaction with lipoprotein lipase, its density increases, and the particles are converted to intermediate density lipoproteins (IDL).

U BKG 4.6.3: By trapping bile acids in the intestine, colestipol causes the liver to convert more cholesterol to bile acids and hence increases the heptic demand for cholesterol.

U BKG 2.8.4: When dietary cholesterol is insufficient, the liver synthesizes its own cholesterol by increasing the activity of a rate-controlling enzyme, 3-hydroxy-3-methlglutarylcoenzyme A reductase (HMG CoA reductase).

U BKG 8.4.3: Epidemiologic evidence suggests that excessive dietary intake of fat, cholesterol, and calories is responsible for the increased blood cholesterol.

Page 17: A method of measuring information in language, applied to medical texts

Measuring information in language, applied to medical texts 285

articles based on values of the overall complexity (O*D). In order that the reader may verify the discriminatory effect of grouping by complexity values, Table 3 shows the text from the low-complexity groups of THEORY sentences, each group in turn divided into subgroups, and Table 4 shows the original text of all sentences from the high- complexity groups of THEORY sentences.

DISCUSSION

Although the REPORT material was too simple to divide meaningfully into groups, there are a number of sentences with relatively high overall complexities, compared with others in the set, that merit discussion. Table 5 lists all sentences from the REPORT texts with complexities of 6 or more. Of these 39 sentences, 31 contained either co- ordinate conjunctions. semicolons or both. (They are marked in the CONJ column of Table 5 with an asterisk.) Although these repeated conjunctions were not themselves counted as operators, the operator count for these sentences was higher because of the expansion of conjoined nouns into conjoined assertions. In these sentences, the inverse density is quite low, reflecting the presence of the zeroed elements.

In the REPORT sentences there were few connectives other than coordinate con- junctions. Coordinate conjunctions are a relatively weak semantic link between as- sertions, as compared with, e.g. causal or instrumental connectives.

With regard to the THEORY material, a qualitative summary of the main linguistic-

Table S. High-complexity sentences from REPORT texts

Sent Sent WDS O’D Conj

HlDShl I. 1.1 17 6 HIDSM 1. I .2 16 12 HlDSM1.1.3 20 24 HIDSh11.1.8 II 9 HlDShll.1.12 9 6 HIDSM1.1.13 17 84 HIDShl2.1 ..s 13 6 HIDSMZ. 1.8 16 24 HlDSM2.1.13 23 24 HIDSM?. 1 .I7 14 8 HIDSM3.1.2 12 6 HIDSM3. I .3 10 8 HIDSM3. I .6 8 6 HIDSM3.1.13 9 6 HIDSM3.1.14 7 27 EXDSM 1.1.1 6 I’ EXDShl1. I. 11 5 II EXDSM1.1.15 5 9 EXDSM1.1.19 10 12 EXDSM?. 1.4 12 20 EXDShQ. 1.7 11 16 EXDSM2.1.10 7 12 EXDSM2.1.14 11 24 EXDSM3.1.8 7 15 EXDSM3.1.9 5 15 EXDSM3.1.17 12 33 EXDSM3.1.24 4 6 LDDSMI.I.1 11 36 LDDSM1.1.2 19 16 LDDSM2.1.14 7 6 LDDSM3.1.6 10 8 LDDSM3.1.8 16 14 LDDSM3.1.12 20 30 LDDSM3.1.15 10 I LDDSM3.1.18 9 12 LDDSM3.1.22 9 6 CODSM 1.1.5 9 18 CODSM1.1.6 9 9 CODSM?. 1.8 9 9

* * *

* I

*

*

*

*

*

*

*

*

*

f *

1;

*

*

*

*

*

*

*

+

*

*

*

1;

+

*

Page 18: A method of measuring information in language, applied to medical texts

286 D. B. GORDON and N. SAGER

informational features of sentences in portions of Tables 3 and 4 may be helpful in interpreting the results. The summary appears in Table 6. In the summary we note the number of facts (independent assertions) contained in the sentence, and distinguish complexfacrs as those in which two informational verb operators are involved, e.g. escapes reabsorption in U BKG 2.7.4 in A3 of Table 4. Among the binary operators we note the presence of modal operators (e.g. can, could, would, should, may), meta operators (e.g. if is rhoughr, 1t.e assume), and quantity operators, both nouns and verbs (e.g. rate, increases). We note among binary operators the presence of connectives (e.g. \%*hen, because, -ing, as). as distinct from coordinate conjunctions. Each of these different types of linguistic elements makes a distinct informational contribution to the sentence.

In the low-complexity groups, as the value of O*D rises, the number of facts rises from 1 or 2 to 2 or 3, and facts begin to appear under quantity or metaimodal operators. There are few connectives and few coordinate conjunctions, the additional facts coming in as local modifiers.

As we jump to the lowest of the high-complexity groups, (O*D = 40-49), the average number of facts per sentence is 3, which is not very much higher than the highest of the low-complexity group. However. the average number of higher-level elements (meta, modal. quantity operator, connective, coord-conj) is equal to 4, in contrast to the corresponding average number for the low-complexity groups, which was less than I.

Climbing up the complexity scale, in the O*D = 50-59 group, the average number of facts per sentence is 4 or more (depending on how UBS 4 is counted; see below) and the average number of higher-level elements is also 4 or more.

In the O*D 2 60 group. the relations are rather too complicated to summarize in the same fashion as above. The number of facts ranges from 4-8, with a corresponding

Table 6. Summarl of linguistic-informational features of sentences

Lov+-Complexit) Group of THEORY Text Sentences (Cf. Table 3)

O’D = 6 1 fact (BKG 2.4.9) 1 fact; I meta operator (UBS 8) 2 facts: 1 connective (BKG 2.4.7)

O*D = 9 1 fact: I meta’modal operator (UBS 2. UBS 5) 2 facts: 1 (local) coord-conj CUBS 12. BKG 2.9.6) 2 facts: I modal operator (BKG 2.8.2)

O*D = 12 2 facts: 1 quantit) operator (BKG 2.9.2) 3 facts; I quantity operator (BKG 2.7.5) 3 facts (BKG 25.1) 3 facts (BKG 25.3)

High-Complexit!, Group of THEORY Text Sentences (Cf. Table 4)

O*D = 40-49 2 complex (double operator) facts under quantity operators: 2 coord-conj 3 facts: 1 connective, 2 modals. I coord-conj 4 facts. two under quantity operators: 2 connectives. I coord-conj 3 facts. one complex. one under quantity operator; 2 connectives. 1 coord-conj 4 facts, one under quantity operator. one definitronal: 2 connectives 3 facts: tB’o under quantity operators. one containing a local (not-expanded) coord-conj:

1 connective. 1 meta.

O*D = 50-59 2 facts. one under a modal. both under a connective, the total unit present twice+ under

a coord-conj 4 facts: 5 quantity operators. I connective 5 facts: a quantity operator, 2 connectives. 1 coord-conj 4 facts. I complex (derived from uptake): 2 connectives

(BKG 2.7.4) (UBS 3) (BKG 2.9.5) (BKG 4.6.3) (BKG 2.8.4) (BKG 8.4.3)

CUBS 4)

(BKG 4.6.7) (BKG 4.6.7) tBKG 2.8.3)

t See discussion of UBS 4. below

Page 19: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 287

number of higher-level operators. In the texts of this subject matter, the quantity op- erators become important and interrelated in the two groups of highest complexity.

With regard to the text materials in Table 4, two anomalous sentences should be noted.

(I) In Table 4, group A 2, the presence of UBS 4 requires some explanation, since the other sentences in this group seem manifestly more complicated. The reason for UBS 4’s presence is the existence of the logical connector “In either case . . 9 ” within the sentence. This connector requires expansion, i.e. “In either case, z” 4 “In case X. then 2. or in case Y, then 2”. Here X and Y are references to sections of the previous sentence. In UBS 4, 2 contains four operators, which must be repeated for the two cases, giving a higher operator count and a corre- spondingly higher overall complexity.

(2) In Table 4, group A3. UBKG 2.7.4 is interesting because it has a relatively large number of operators (11) but a complexity value (44) that is less than that of a number of others with fewer operators (UBS 3 O*D/OPS=45/9, UBKG 2.8.3 54/ 9, UBKG 2.9.5 4519. UBKG 4.63 4519). The reason appears to be the number of coordinate conjunctions that lead to expansion. This sentence has two nouns that are under a quantifier and a coordinate conjunction, which are then the arguments of two verbs under a coordinate conjunction. Expansion of all coordinate con- junctions results in a total operator count of 11 but a relatively low depth of nesting 14), yielding a complexity value of 44, The fact that the operators are of the “zer- oing” type is reflected in the low inverse density value (1.72). One might contrast this sentence with ones that have a lower operator count and a higher complexity value (UBKG 2.8.3 from group A2 of Table 4 and UBKG FIG. 16.2 of group Al of Table 4). These two sentences have complexity/operator ratios 5419 and 6000, respectively. Neither of these sentences has an expanded coordinate conjunction that restores zeroed material. While use of the operator-argument representation provides a uniform basis for

measuring structural properties. the measures based on counting the operators, without further distinction as to their type. is admittedly crude. Future refinements of the meas- ures should take account of (at least) the differences between (a) A sublanguage verb: this operator carries a whole proposition in the “object lan-

guage” and should have more weight than (b) or (c); (b) A sentence operator (or “meta” operator), such as know,. beliel>e, that occurs with

an embedded subject or object; this operator should carry perhaps half the weight of a full proposition;

(c) A connective (e.g. because, when) or connective verb (e.g. in&cares, leads to) in contrast with coordinate conjunctions; the former provides substantive relational information among propositions while the latter are logical connectors, with and often serving as a loose link. Their weights should be investigated.

CONCLUSlONS

In this study, it was demonstrated that a linguistically based analysis of the sttuc- tural properties of text sentences can lead to quantitative measures of the amount and complexity of the information carried by the sentences. Types of texts which clearly differ in these properties turn out to have correspondingly contrastive measures. Fur- thermore, within a single type of text, groups of sentences graded according to one measure (the overall complexity) appear to reflect perceived differences in complexity of information.

These results are a first step toward defining and measuring important properties of textual material that have hitherto been only qualitatively perceived and described. Many practical and theoretical questions have to be answered before we can implement a quantitative system of some generality.

The first issue is finding a common representation that can be obtained by computer from a wide variety of texts. For the purposes of this study, we worked out the cor-

Page 20: A method of measuring information in language, applied to medical texts

288 D. B.Go~oox and K. SAGER

respondences between different representations so that valid comparisons could be made. For future work, it would be desirable to map all representations to a “lowest- common-denominator’* informational form. This is not an elementary task, but it is within reach of current techniques.

Further study is also required to refine the measures with regard to different types of operators that have been counted, to give different weight to the amount of infor- mation contributed by each type of operator. For example, more weight could be given to operators that bring new propositions into the discourse, as opposed to operators that connect existing propositions. Also, the relatively weak conjunction AND should perhaps have less weight than would other conjunctions and certain subordinate con- junctions.

In addition, it would be desirable to account for the role of reference in the total information content of an individual sentence, a project that was beyond the scope of the present investigation. With such an accounting for the weight of total information in each sentence of a discourse, it would be possible to map the informational com- plexity of a discourse as a function of sentence position in a paragraph, or progression through a paragraph or section.

With regard to applications. a measurement of information in a discourse could be of benefit to systems for indexing or abstracting from free text. Selection of terms for indexing could take account of the complexities of the sentences into which those terms entered. Retrievals of text passages in a text-retrieval system could be weighted by the measures of amount, density and complexity of the information in each candidate passage in a list of possible retrievals.

More generally, situations arise in which it is desired to grade text as to level of difficulty. Examples are training manuals .and graded reading materials for use in schools. Current applications use measures whose relation to actual complexity of information are not clearly defined.

In another area, the designers of text-derived database systems need information about the maximum depth of nesting and overall complexity to be expected in texts that will be mapped into the database.

Finally. continued research in this area can help to define more precisely the properties that distinguish textual information from purely quantitative data, and can contribute to a precise de~nition of information that takes account of properties beyond those studied in the mathematical theory of information.

Acl;no,r,l~dRr~le,lrs-This research uas supported in pan b! the National Science Foundation under Grant IST79-20788 from the Division of Information Science and Technology. and in part by Kational Librar), of Medicine Grant I-ROI-LM02616. auarded b) the National Institutes of Health. DHHS.

111

El

131

143

PI

161

I71

REFERENCES

C. SH.~NKON. Tire ,2~athematical Theor? of Commro?icatiorr. University of Illinois Press, Urbana. II (1949). Z. HARRIS, Co-occurrence and transformation in linguistic structure. Presidential address, Linguistic Society of America. Language 1957, 33(3), 283-340. Z. HARRIS, The elementary transformations. Tra~sf~rmarions and Discozrrse Anal~sjs Pa- pers (T.D.A.P.) 1964. 54, University of Pennsylvania. Reprinted in part in Z. S. Harris, Papers in Structural and Trattsformational Linguistics. D. Reidel. Dordrecht, Holland (1970). Z. HARRIS. The two systems of grammar: report and paraphrase. T.D.A.P. 1969, 79. Re- printed in Papers in Swuctural and Transformational Lingiristics. D. Reidel. Dordrecht, Holland (1970). Z. HARRIS, A theory of language structure. American P?lilosop~~icai Quarterly 1976, 13, 237-255. Z. HARRIS. A Grammar of English on Mathemafical Principles. Wiley Interscience, New York (1982). G. SALTON. and M. J. MCGILL. introduction to Information Retrieval. McGraw-Hill. New York (1983).

Page 21: A method of measuring information in language, applied to medical texts

Measuring information in language. applied to medical texts 289

181 T. WINOGRAD. Langlrap~ as a Coglli(i\.e PI-ocess. Addison-Wesley,. Reading. Mass. (1983). [9] R. KITTREDGE and J. LEHRBERGER. Sllblonglrage: Strtdies of Larlglrage in Resrricted Se-

mantic Domains. Walter de Gruyter, Berlin (1982). [IO] N. SAGER. Arafura1 Lang/tape lnformarion Processing. Addison-Wesley. Reading. Mass.

(1981). [I I] A. JOSHI. B. L. WEBBER and I. S.~G. E/e/?Iellts qf Diswrcrsc Undersrundirq. Cambridge

University Press. Cambridge. England (1981). [12] D. E. WALKER. C’udersrandirlg Spoken Lnupunge. Elsevier!North-Holland. Nen York

(1978). 1131 W. PLATH. REQUEST: A natural language question-ansuerinp system. IBM Jo~rrnal of

Research and De\,elopmeur 1976. 20(4). 326-335. [14] S. PETRICK. Transformational analysis. In ,l’rrtrrral Language Undrrsrarldiug (edited by R.

Rustin). Alporithmics Press, New York (1973). [IS] J. HOBBS and R. GRIsHhl.w, The automatic transformational analysis of English sentences:

An implementation. Inrernarioual Joltrun/ of Comprtrrr Mathematics 1976. 5. Section A, 267-283.

(161 E. MARSH. L. HIRSCHMAN and C. FRIED~IA~. The implementation of English transformations. Srring Program Reporrs 1982. No. 15. pp. 59-82. Linguistic String Project, New York University,.

[17] N. SAGER. Natural language information formatting: The automatic conversion of texts to a structured data base. In Adl,unces in Compurers 17 (edited by M. C. Yovits). 89-162. Academic Press. New York (1978).

[18] R. GRISHMAN and L. HIRSCHSIA~. Question answering from natural language medical data bases. Artificial Inrelligerlce 1978. 11, 25-43.

[19] E. MARSH C. CHRISTENSON and C. WHITE. Transformational decomposition of a fact-report text. String Program Reports 1982. No. 15. Linguistic String Project. New, York University.

[20] M. KOSAKA, Transformational decomposition of a theoretical, text. Sfl.irlg Program Reports 1982. No. IS. Linguistic String Project. Neu York Universit!‘.

[21] D. GORDOS and N. SAGER, The measurement of properties of textual information. Srring Program Reporrs 1982. No. 15. Linguistic String Project. New York Uni\,ersity.