1958 - the automatic creation of literature abstracts

Upload: franck-dernoncourt

Post on 09-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    1/7

    H. P. Luhn

    The Automatic Creation of Literature Abstracts*

    Abstract: Excerptsof technical papers andmag azine articles that serve the purposes of conventionalabstracts have been created entirely by automatic means. n the exploratory research described, the com-plete text of an article n machine-readable form i s scanned by an IBM 704 data-processing machine an danalyzed in accordance with a standard program. Statistical information derived from word frequency anddistribution i s used by the machine to compute a relative measure of significance, first for individual wordsan d then for sentences. Sentences scoring highest in significance are extracted an d printed out to becomethe auto-abstract.

    IntroductionThe purposeof abstracts in technical literature is to facil-itatequick andaccurate identification of the topic ofpublished papers. The objective is to save a prospectivereader time and effort in finding useful information in agiven article or report.

    The prepara tion of abstracts is an intellectual effort,requiring general familiarity with the subject. To bringout the salient points of an authors argument calls forskill and experience. Consequently a considerable mountof qualified manpower that could be used o advantage nother ways must be diverted to he task of facilitatingaccess to informa tion. This widespread problem is beingaggravated by the ever-increasing output of technicalliterature. But another problem- erhaps equally acute-is hat of achievingconsistence and objectivity inabstracts.

    The abstracters product is almost always influenced byhis background, attitude, and disposition. The abst ract-ers own opinions or immediate interests may sometimesbias his interpretation of the authors ideas. The qualityof anabstract of a given article may hereforevarywidely among abstracters, and if the same person wereto abstract an article again at some other time, he mightcome up with a different product.

    The appli catio n of machinemethods o iteraturesearching is currently receiving a great deal of attentionand now indicates that both human effort and bias maybe eliminated from he abstracting process. Althoughrapid progress is being made in the development of sys-tems using modern electronicdata-processing devices,*Presentedat IRE NationalConvention, N e w York, March 24, 1 9 5 8 .

    their efficiency depends on availability of literary infor-mation in machine-readable form. It is evident that thetranscription of existing printed text into this form wouldhave to be done manually at this time.In hefuture,however, rint-reading devices should e sufficientlydeveloped for this ask. For materialnotyetprinted,tape-punching devices attached to typewriters and type-setting machines could readilyproduce machine-readablerecords as by-products.

    Thispaper describes some exploratory research onautomatic methods of obtainingabstracts. The systemoutlined here begins with the document in machine-read-able formand proceedsbymeans of a programmedsampling process comparable to the scanning a humanreader would do. However, instead of sampling at ran-dom, as a reader normally does when scanning, the newmechanical method selects those among all the sentencesof an article that are themost representative of pertinentinformation. Thesekey sentences are then enumerated oserveas clues for judging the charac ter of thearticle.Thus, citations of the authors own statements constitutethe auto-abstract.

    The programs for creating uto-abstractsmust ebased on properties of writing ascertained by analys is ofspeci f ic types of i terature. Because the use of abstractsis an established practice n science and technology, itseemed desirable to develop the method first for pape rsand articles in this area. A primary objective of the de-velopment was to arrive at a system that could take fulladvantage of the capabilities of a modern electronic data-processing system such as the IB M 704 or 705, while atthe ame time eeping the scheme as simple as possible. 1 5 9

    IBM J O U R N A L A P R IL 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    2/7

    Measur ing significanceTo determine which sentences of an articlemay bestserve as the auto-abstract,a measure is required by whichthe information content of all the sentences can be com-pared and graded. Since the suitability of each sentenceis relative, a value can be assigned to each in accordancewith the quality criterion of significance.

    The significance factor of a sentence is derived froman analysis of its words. It is here proposed that the fre-quency of word occurrence in an article urnishesausefulmeasurement of word significance. It is furtherproposed that the relative position within a sentence ofwords having given values of significance furnishes a use-ful measurement for determining the significance of sen-tences. The significance factor of a sentencewill thereforebe based on a combination of these two measurements.

    It should be emphasized that this system is based onthe capabilities of machines, not of human beings. There-fore, regrettable as it might appear,he intellectualaspects of writing and of meaning cannot serve as ele-ments of such machine systems. To a machine, words canbe only so many physical things. It can find out whetheror not certain such things are similar and how many ofthem there are. The machine canemember such findingsand anperform rithmetic on thosewhich can becounted. It can do all of this by means of suitable pro-gram instructions. Thehuman intellect need be reliedupon only to prepare theprogram.Establishing a set of significant wordsThe justification of measuring word significance by use-frequency is based on hefact hat awriternormallyrepeats certa in words as he advances or varies his argu-ments and as he elaborates on anaspect of a subject. Thismeans of emphasis is taken as an ndicator of signifi-cance. The more often certain words are found in eachothers company within a sentence, the more significancemay be attributed to each of these words. Though certainother words must be present to serve the important func-tion of tying these words together, the type f significancesought here does not reside in such words. If such com-mon words can be segregated substantially by non-intel-lectualmethods,hey ouldhen be excluded fromconsideration.

    This rather unsophisticated argument on significanceavoids such linguistic implications as grammar and syn-tax. In general, themethod doesnot even propose todifferentiate between word forms.Thushe variantsdi f f e r ,d i f f e ren t i a t e ,d i f f e ren t ,d i f f e ren t l y ,d i f f e ren ce anddif ferent ial could ordinarily be considered dentical no-tions and regarded as the same word. No attention is paidto the logical and semantic relationships the author hasestablished. In other words, an inventory is taken and aword list compiled in descending orde r of frequency.

    Procedures as simple as hese, of course, a re rewardingfrom the standpoint of economy. The more complex themethod, the more operations must the machine performI 6 0 and herefore themore costly will be the process. But in

    this case an even more fundamental justification for sim-plicity can be found in the nature of technical writing.Within a technical discussion, there is a very small proba-bility that a given word is used to reflect more than onenotion. The probability is also small that an author willuse different words to reflect the same notion. Even if theauthor makes a reasonable effort to select synonyms forstylistic reasons, he soon runs out of legitimate alterna-tives and falls into repetition if the notion being ex-pressed was potentially significant in the first place.

    A word list compiled in accordance with the methodoutlined will generally take the form of the diagram inFig. 1 . The presence in the region of highest frequencyof many of the words previously described as too commonto have the type of significance being sought would con-stitute noise in the system. This noise can be materiallyreduced by an elimination technique in which text wordsare compared with a stored common-word list. A sim-pler way might be to determine a high-frequency cutoffthroughstatisticalmethods oestablishconfidencelimits. If the line C in the figure represents this cutoff,only words to its right would be considered suitable forindicating significance. Sincedegree of frequency hasbeen proposed as a criterion, a lower boundary, line D ,would also be established to bracket the portion of thespectrum hat would contain he mostuseful range ofwords.Establishing optimum locations forboth lineswould be a matter of experience with appropriately largesamples of published articles. It should even be possibleto adjust hese locations to alter the haracteristics of theoutput.

    The curve for the egree of discrimination, or resolv-ing power, of the bracketed words in the figure mightlook something like the dotted line, E . It is apparent thatwords that cannot be put n he category of commonwords may sometimes fall to he left of line C. If theprogram has been properly formulated, the location ofthese words on the diagram would indicate their loss ofdiscriminatory power. The word cell in an article onbiology may be an example of this. It may be anticipatedthat the cutoff line, once established, may be stable overmany different degrees of specialization within a field, oreven over many different fields. Moreover, the resolvingpower would increase with the need for finer resolution.The case of a common word falling in the region to theright of line C can be olerated because of its lesserdegree of interference.Establishing relat ive significance of sentencesAs pointed out earlier, the method to be developed hereis a probabilistic one based on the physical properties ofwritten texts. No consideration is to be given to themean-ing of words or the arguments expressed by word com-binations. nstead it is hereargued hat, whatever hetopic, the closer certain words are associated, the morespecifically an aspect of the subject is being treated.There fore, wherever the greatest number of frequentlyoccurr ing different words are found in greatest physicalproximity to each other, the probability is very high that

    I B M JOURNALAPRIL 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    3/7

    the information being conveyed is most representative ofthe article.

    The significance of degree of proximity is based on thecharacteristics of spokenand written anguage in thatideas most closely associated intellectually are found tobe implemented by words most closely associated physi-cally. The divisions of writtenext into sentences,paragraphs, chapters, et cetera, is another physical mani-festation of the graduating degree of association of ideas.These aspects have been discussed in detail in an earlierpaper by the writer.

    From these considerations a significance factor canbe derived which reflects the number of occurrences ofsignificant words within a sentence and the linear distancebetween them due to the intervention of non-significantwords. All sentencesmay be ranked norder of theirsignificance according to this factor, and one or severalof the highest ranking sentences may then be selected toserve as the auto-abstract.*H . P. Luhn , A Statisticalpproach to Mechanizedncoding andw d o p m e n f , 1, No . 4, 309- 317 O ct ob er1957) .Searching fLiterary nformation, IB M Journal of R esea rchan d De -

    It must be kept in mind that, when a statistical proce-dure is applied to produce such ankings, the criterion isthe relationship of the significant words to each otherrather han their distribution over a whole sentence. Ittherefore appears proper to consider only those portionsof sentenceswhich arebracketed by significant wordsand to set a limit for the distance at which any two sig-nificant words shall be considered as being significantlyrelated. A significant word beyond that limit would thenbedisregarded from consideration in a given bracket,although it might form a bracket, or cluster, in conjunc-tionwith other words n the sentence. An analysis ofmany documents has indicated that a useful limit is fouror five non-significant words between significant words.If with this separation two o r more clusters result, thehighest one of the several significance factors is taken asthe measure for that sentence.

    A cheme for computing the significance factor isgiven by way of example in Fig. 2 . It consists of ascer-taining he extent of a cluster of words by bracketing,counting the number f significant words contained n thecluster, and dividing the square of this number by the

    Figure I Word-frequency diagram.-Abscissa represents individual words arran ged in orde r of requ en cy .c D

    N O R D S

    \\

    161

    iBM . J O U R N A L * APRIL 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    4/7

    ( - -Sentence- -

    Significant Words* - * * *1 2 3 4 5 6 7

    All Words- -Portio n of senten ce bracke ted byand including significant words notmore han ournon-significantwordsapart. I f eligible, hewholesentence is cited.

    - 1

    - 1

    Figure 2 Computation of significance factor.The square of the number of bracketed signif-icant words (4 ) divided by the otal numberof bracketed words (7 ) = 2 . 3 .

    total number of words within hiscluster. The resultsbased on this formula, as performed on about 0 articlesranging from 300 to 4,500 wordseach,havebeenen-couraging enough for furt her evaluation by a psychologi-calexperiment involving 100 people. This experimentwill determine on an objective basis the effectiveness ofthe abstracts generated.

    The resolving power of significant words derived underthe method described depends on he otalnumber ofwords comprising an article and will decrease as the totalnumber of words increases. In order o overcome thiseffect, the abstracting process may be performed on sub-divisions of the article, and the highest ranking sentencesof each of these divisions may then be selected and com-bined to constitute the auto-abstract. In many cases theauthor provides such divisions as part of the organizationof his paper, and they may therefore serve for the ex-tended process. Whereuch deliberate divisions areabsent they can be made arbitrarily in accordance withsome criteria established by experience. These divisionswould be arranged in such a way that they overlap eachother, for lack of any simple means of mechanically de-tecting the exact point of the authors transition to a newsubject subdivision.

    A more detailed account of these and othe r computingmethods, as well asdetails on programmingelectronicdata-processing machines for this procedure, will be givenin subsequent papers.

    162 By way of example, two auto-abstractsre included

    in this paper. Exhibit 1 shows four selected sentences ofa2,326-wordarticle from The Scientific American. Atable of word frequency is also given. Exhibit 2 shows thehighest ranking sentence of a 783-word article from theScience Section of TheNewYorkTimes. Acompletereproduction of this article is given.Machine proceduresThe abstracts described in this paper were prepared byfirst punching the documents on cards. Punctuationmarks in the printed text not available on the standardkey punch were replaced by other key-punch characters.The cards thusproduced constitute the machine-readableform of the document.

    The abstracting process was initiated by transcribingthe card record onto magnetic tape by means of an auxil-iary card-to-tape unit. The resulting tape was introducedintoanIBM 704 data-processingmachine,which wasprogrammed to read the taped text to separate it into itsindividual words, to note he position of each word inthe document, thesentence and paragraph n which itappeared,and onote hepunctuation preceding andfollowing it. Concurrently, common words such as pro-nouns, prepositions, and articles were deleted from helist by a able-lookup outine. Thisoperation was fol-lowed by a sorting program which arranged the remain-ing words in alphabetic order.

    The next step of the machine operation was a consoli-dation of words which are spelled in the same way attheir beginning, such as similar and similarity. This pro-cedure was a simple statistical analysis routine consistingof a etter-by-lettercomparison of pairs of succeedingwords in the alphabetized list. From hepoint whereletters failed to coincide, a combined count was taken ofthe non-similar subsequent letters of both words. Whenthis count was six or below, the wordswere assumedto be similarnotions; above six, different notions. Al-though this method of word consolidation is not infallible,errors up to 5 % did not seem to affect the final results ofthe abstracting process. The mach ine then counted theoccurrence of similar words derived in this way. Wordsof a stipulated low frequency were then deleted from thelist and locations of the remainingwordsweresortedintoorder. Thesewords herebyattained the status ofsignificant words.

    The significance factor for eachsentence was deter-mined by a computing routine in accordance with theformula previously mentioned. All sentences which scoredabove a predetermined cutoff value were written on anoutput tape along with their respective values. The basisfor this cutoff value depends on the amount of detailedinformation needed for a given type of abstract. Resultswere then printed out from this tape.Extended applicationsAlthough a standard abstract has thus far been assumedin order to simplify the explanation of the machine pro-cess, extracts or condensations of literature are used fordiverse purposes and may vary in length and orientation.

    IBM JOURNAL - A P R I L 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    5/7

    Exhibit 1Source: Th e Scientific American, Vol. 196, No. 2, 86-94,February, 1957Title:Messengers of theNervousSystemAuthor: Amodeo S. MarrazziEditor's Sub-heading: Th e internal communication of the body is mediated by chemicals as well as by nerve impulses.Study of their interaction has developed important leads to the understanding and therapy of mental illness.

    Auto-Abstract*It seems reasonable to credi t the single-celled organisms also with a system of chemical communicat ion by di'flusion ofstimulating substances hrough he cell, and these correspond to the chemical messengers (e.g., hormones) that carrystimuli from cell to cell in the more complex organisms. (7.0)PFinally, in the vertebrate animals there are special glands (e.g., the adrenals) for producing chemical messengers, andthe nervous and chemical communication systems are intertwined: or nstance, release of adrenalin by he adrenalgland is subject to control both by nerve impulses and by chemicals brought to the gland by the blood. (6.4)The experim ents clearly demonstrated that acetylcholine (and related substances) and adrenalin (and its relatives) exertopposing actions which maintain a balanced regulation of the transmiss ion of nerve impulses. (6.3)It is reasonable to uppose hat he ranquilizingdrugscounteract he nhibitoryeffec t of excessive adrenalinorserotonin or some related inhibitor in the human nervous system. (7.3)tSignificarlce factor is given at the en d of each sentence.*Sentences selected by me a n s of statistical analysis as hat iog n degree of significance of 6 an d over .

    Significant words in descending order of frequency (common words omitted).4640282219I 81816161515131313

    nervechemicalsystemcommunicationadrenalincellsynapseimpulsesinhibitionbraintransmissionacetylcholineexperimentsubstances

    1212121212I OI O887777

    bodyeffectselectricalmentalmessengerssignalsstimulationactionganglionanimalblooddrugsnormal

    665555555555

    disturbance 1 4relatedcontroldiagramfibersglandmechanismsmediatorsorganismproduceregulateserotonin

    4444444444444

    accumulatebalanceblockdisordersendexcitationhealthhumanoutgoingreachingrecordingreleasesupplytranquilizing

    Total word occurrences in the document:Different words in document:

    . . . . . . . 2326Tota l of different words . . . . . . . . . . . . . . . . . . . 741Less different common words . . . . . . . . . . . . . . . . . 170

    Different non-common words . . . . . . . . . . . . . . . . 571Ratio of all word occurrences to different non-common words . . . . . . . . -4: lNon-common words having a frequency of occurrence of 5 and over:

    _.

    Total occurrences . . . . . . . . . . . . . . . . . . . . 478Different words . . . . . . . . . . . . . . . . . . . . . 3 9 163

    IB M J O U R N A L A P R I L 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    6/7

    164

    Exhibit 2Source: The New York Times, September8, 1957, page E l lTitle: Chemistry Is Employed in a Search for New Met hods to onquer Mental IllnessAuthor: RobertK . Plumb

    easerocesseshemselves. Only__ndring t o fruitionansongthen will the metaboli c era mature ~- hoped foralvationromheav-At thesychologistseetinghere, a technique or racingclec-trical ctivitvn oecific ortions

    ~

    I/ ~ o n e vo financeesrea& on the chiatrists o predict hat he admin- ast week n an announcement from; inenta l illness is Isfration of ACTH a.nd cortisoneWashington. I

    heir I oppositeandevenmutua lly excl- let ter sdescribing work theyma yof Physiologic dis- have inprogress-to the Technical3ances; theyaid.nformationnit of th eenter in I

    IBM JOURNAL APRIL 1958

  • 8/7/2019 1958 - The Automatic Creation of Literature Abstracts

    7/7

    Exhibit 2 Auto-AbstractT w o major recent developmen ts have called the attention of chemists , physiologists , physicists and other scientists tomental diseases: I t has been foun d that extremely minu te quantities of chemicals can induce hallucinations and bizarrepsychic d is turbances in normal people , and m ood-al ter ing drugs ( tranquil izers , for ins tance) have m ade long-ins t i tut ion-alized people amenable to the r apy . (4.0)This poses new possibilities for studying brain chemistry chan ges in health and sickness and their alleviation, the Cali-fornia researchers emphas ized.( 5 . 4 )Thenew tudies of brain chemistryhaveprovidedpractical herapeutic esultsand remendousencouragement othose who must care for mental patients . ( 5 . 4 )

    A condensation of a document to a given fract ion of theoriginal could be readily accomplished with the systemoutlined by adjusting the cutoff value of sentence signifi-cance. On the other hand , a fixed number of sentencesmight be required irrespective of document length. Hereit would be a simple matter to print out xactly that num-ber of the highest ranking sentences which fulfilled therequirement.

    Inmany instances condensations of documents aremade emphasizing the relationship of the information inthe document to a special nterest or field of investiga-tion. In such ases sentences could be weighted by assign-ing a premium value to a predetermined class of words.

    These two features of the auto-abstract, variable lengthand emphasis,might at times be usefully combined.In he case of aong, omprehensive paper, severalcondensed versions could be prepared, each of a lengthsuitable to the requirements of its recipient and biased tohis particular sphere of interest.

    Along these same lines, a specificity ranking techniquemight prove feasible. If none of the sentences in an articleattaineda certa in significance factor, it would be pos-sible to reject the article as too generalized for the pur-pose at hand.

    Incertain cases an abstract might be amplified byfollowing it with anenumeration of specifics, such asnames of persons, places, organizations, products, mate-rials, processes, et cetera. Such specific words could beselected by the machine either because they are capital-ized or by means of lookup in a stored special dictionary.

    Auto-abstractingcould also be used to alleviate thetranslation burden. To avoid total ranslation initially,auto-abstracts of appropriate length could be producedin the original language and only the abstracts translatedfor subsequent analysis.

    Finally, the process of deriving key words for encodingdocuments for mechanical information retrieval could besimplified by auto-abstracting techniques.

    ConclusionsThe results so far obtained for technicalarticles haveindicated the feasibility of automatically selecting sen-tences that will indicate the general subject matter, verymuch as do conventional bstracts. What uchauto-abstracts might lack in sophistication they will more thancompensate for by theiruniformity of derivation. Be-cause of the absence of the variations of human capabili-ties and orientation, auto-abstracts have a high degree ofreliability, consistency, and stability, as theyare the prod-uct of a statistical analysis of the authors own words. Inmany cases the abstract obtained is the type generallyreferred to as the indicative abstract.

    Once auto-abstracts are generally available, their userswill learn how to interpret them and how to detect theirimplications. They will realize, for instance, that certainwords contained in the sample sentencesstand for notionswhich must have been elaborated upon somewhere in thearticle. If this were not so for a substantial portion ofthe words in the selected sentences, these sentences couldnot have attained their status based on word frequency.

    There is, of course, the chance that an uthors style ofwriting deviates from the average to an extent thatmightcause the method to select sentences of inferior signifi-cance. Since the title of the paper is always given in con-junction with the auto-abstract, there is a high probabilitythat it will favorably supplement the abstract. However,there will always be a residue of inadequate results, andit appears to be entirely feasible to establish criteria bywhich a machine may recognize such exceptions and ear-mark them for human attention.

    If machines can perform satisfactorily within the rangeoutlined in this paper, a ubstantial and worthwhilesaving in human effort will have been realized. Theauto-abstract is perhaps the first example of a machine-generated equivalent of a completely intellectual task inthe field of literature evaluation.Receivedecember 2,1957 165

    IB M JOURNAL APRIL 1958