i sessionno. i - stacksfv118rr1216/fv118rr1216.pdf · "1 40 i i i f sessionno. 2 application*...

9
"1 I 40 I i f Session No. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward A. Felganbaum Joshua Ladarbarg Stanford University Stanford, California, USA Abstract The Meta-DENDRAL program Is a vehicle for studying problems of theory formation in science. The general strategy of Meta-DENDRAL is to reason from data to plausible generalizations and then to organize the generalizations into a unified theory. Three main subproblems are discussed: (1) explain the experimental data for each indi- vidual chemical structure, (2) generalize the results from each structure to all structures, and (3) organize the generalizations into a uni- fied theory. The program is built upon the con- cepts and progranmed routines already available in the Heuristic DENDRAL performance program, but goes beyond the performance program in attempting to formulate the theory which the performance program will use. I. Introduction Theory formation in science embodies many elements of creativity which make it both an in- teresting and challenging task for artificial intelligence research. One of the goals of the Heuristic DENDRAL project has long been the study of processes underlying theory formation. This paper presents the first steps we are taking to achieve that goal, in a program called Meta- DENDRAL . Because we believe there is value in repor- ting ideas in their formative stages —in terms of feedback to us and, hopefully, stimulation of the thinking of others--we are presenting here a des- cription of work on Meta-DENDRAL even though not all of the program has been written. Just like the scientists we attempt to model, we often fail to make explicit the thinking steps we go through. the designs of the unfinished pieces of program are described as they will be initially programmed, and several outstanding problems are mentioned. It is hoped that this discussion will provoke comments and criticisms, for that is also part of its purpose. *This research was supported by the Advanced Research Projects Agency (SD-183). The Heuristic DENDRAL project has concentrated its efforts on the inductive analysis of empirical data for the formation of explanatory hypotheses. This is the type of inference task that calls for the use of a scientific theory by a performance program, but not for the formation of that theory. When we started on Heuristic DENDRAL we did not have the insight, understanding, and daring to tackle ab initio the problem of theory formation. But now we feel the time is ripe for us to turn our attention to the problem of theory formation. Our understanding and our technical tools have matured along with the Heuristic DENDRAL program to the point where we now see clear ways to proceed. As always, the proper choice of task environ- ment is crucial, but for us the choice was abso- lutely clear. Because the Heuristic DENDRAL per- formance program uses the theory of a specialized branch of chemistry, formulating statements of that theory is the task most accessible to us. The theory itself will be briefly introduced in Section 11, although it is not expected that readers understand it to understand the directions of this paper. The goal of the Meta-DENDRAL program is to infer the theory that the performance program (Heuristic DENDRAL) uses to analyze experimental chemical data from a mass spectrometer. The following table attempts to sketch some differences between the programs at the performance level and the meta-level. Input : Heuristic DENDRAL - The analytic data from a molecule whose structure is not known (except of course in our test cases) . Meta-DENDRAL - A large number of sets of data and the associated (known) molecular struc- tures. Output: Heuristic DENDRAL - A molecular structure in- ferred from the data. Meta-DENDRAL - A set of cleavage and rearrange- ment rules constituting a subset of the theory of mass spectrometry.

Upload: others

Post on 01-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

"1I40

I if

Session No. 2 Application*

A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE*

Bruce G. BuchananEdward A. Felganbaum

Joshua Ladarbarg

Stanford UniversityStanford, California, USA

Abstract

The Meta-DENDRAL program Is a vehicle forstudying problems of theory formation in science.The general strategy of Meta-DENDRAL is to reasonfrom data to plausible generalizations and thento organize the generalizations into a unifiedtheory. Three main subproblems are discussed:(1) explain the experimental data for each indi-vidual chemical structure, (2) generalize theresults from each structure to all structures,and (3) organize the generalizations into a uni-fied theory. The program is built upon the con-cepts and progranmed routines already availablein the Heuristic DENDRAL performance program, butgoes beyond the performance program in attemptingto formulate the theory which the performanceprogram will use.

I. Introduction

Theory formation in science embodies manyelements of creativity which make it both an in-teresting and challenging task for artificialintelligence research. One of the goals of theHeuristic DENDRAL project has long been the studyof processes underlying theory formation. Thispaper presents the first steps we are taking toachieve that goal, in a program called Meta-DENDRAL .

Because we believe there is value in repor-ting ideas in their formative stages —in terms offeedback to us and, hopefully, stimulation of thethinking of others--we are presenting here a des-cription of work on Meta-DENDRAL even though notall of the program has been written. Just likethe scientists we attempt to model, we often failto make explicit the thinking steps we go through.

Therefore,

the designs of the unfinished piecesof program are described as they will be initiallyprogrammed, and several outstanding problems arementioned. It is hoped that this discussion willprovoke comments and criticisms, for that is alsopart of its purpose.

*This research was supported by the AdvancedResearch Projects Agency (SD-183).

The Heuristic DENDRAL project has concentratedits efforts on the inductive analysis of empiricaldata for the formation of explanatory hypotheses.This is the type of inference task that calls forthe use of a scientific theory by a performanceprogram, but not for the formation of that theory.When we started on Heuristic DENDRAL we did nothave the insight, understanding, and daring totackle ab initio the problem of theory formation.But now we feel the time is ripe for us to turnour attention to the problem of theory formation.Our understanding and our technical tools havematured along with the Heuristic DENDRAL programto the point where we now see clear ways to proceed.

As always, the proper choice of task environ-ment is crucial, but for us the choice was abso-lutely clear. Because the Heuristic DENDRAL per-formance program uses the theory of a specializedbranch of chemistry, formulating statements ofthat theory is the task most accessible to us.The theory itself will be briefly introduced inSection 11, although it is not expected thatreaders understand it to understand the directionsof this paper.

The goal of the Meta-DENDRAL program is toinfer the theory that the performance program(Heuristic DENDRAL) uses to analyze experimentalchemical data from a mass spectrometer. Thefollowing table attempts to sketch some differencesbetween the programs at the performance level andthe meta-level.

Input :Heuristic DENDRAL - The analytic data from a

molecule whose structure is not known (exceptof course in our test cases) .

Meta-DENDRAL - A large number of sets of dataand the associated (known) molecular struc-tures.

Output:Heuristic DENDRAL - A molecular structure in-

ferred from the data.Meta-DENDRAL - A set of cleavage and rearrange-

ment rules constituting a subset of thetheory of mass spectrometry.

Page 2: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

Session No. 2 Applications 41

1

i

i

Example:Heuristic DENDRAL - Uses alpha-carbon fragmen-

tation theory rules in planning and In val-idat lon .

.!■"! _i -DI-'NUIIAI. MLucovars (and valldatoa) alpha-carbon 1 r<i ginenlatlon rules In a space ofpossible patterns of cleavage. Uses set ofprLmitlve concepts but does not invent newprimitives

In our view, the continuity evident in thistable reflects a continuity in the processes ofinductive explanation in science. Moves toward

_<;

ca-levels of scientific inference are movescoward encompassing broader data bases and con-structing more general rules for describing regu-larities in the data.

Beyond this level of Meta-DENDRAL there arestill higher levels. Not all theory formation isas simple as the program described here assumes itts. For example, the representation of chemicalmolecules and the list of basic processes are bothfixed for this program, yet these are conceptswhich a higher level program should be expected todiscover. Also, there is no postulation of newtheoretical entities in this program. But, again,higher levels of theory formation certainly do in-clude this process.

The task of theory formation can be and hasleen discussed out of the context of any particu-lar theory. (4) However, writing a computer pro-gram to perform the general task Is more difficultthan working within the context of one particularscientific discipline. While it is not clear howscience proceeds in general, it may be possibleto describe in detail how the scientists in oneparticular discipline perform their work. Fromthere, it is not a large step to designing a com-puter program. Thus this paper attacks the generalproblems of theory formation by discussing theproblems of designing a computer program to formu-late a theory in a specific branch of science(cf. 2).

The general strategy of Meta-DENDRAL is toreason from data to plausible generalizations andChen to integrate the generalizations into a uni-fied theory. The input to the Meta-DENDRAL systemis a set of structure-data pairs. It receivesessentially the same data as a chemist might choosewhen he attempts to elucidate the processes under-lying the behavior of a class of molecules in amass spectrometer. When chemists turn their atten-tion to a class of chemical compounds whose massspectrometric processes (MS processes) are not wellunderstood, they must collect mass spectrometrydata for a number of the compounds and look forgeneralizations. The generalizations have to betested against new data and against the estab-lished theory. If new data provide counterexamples,the generalizations are changed. If the general-isations are not compatible with the old theory-tner the old theory or the generalizations are

anged.

This paper is organized by the three mainsubproblems around which the program is also or-ganized. The first is to explain the experimentaldata of each Individual molecular structure. Thatla, determine the processes (or alternative lataof processes) which account for tha experimentaldata. The second subproblem ls to generalize theresults from each structure to all structures. Inother words, find the common processes and sets ofprocesses which can explain several sets of experi-mental data. The last ls to integrate the gen-eralizations into the existing theory in such away that the theory is consistent and economical.Within each of the three main sections, the sub-sections indicate further subproblems which theprogram must solve.

11. The Problem Domain

Because this paper discusses theory formationin the context of a particular branch of science,mass spectrometry, the theory of this science willbe explained briefly for readers wishing an under-standing of the Meta-DENDRAL program at this level.

The mass spectrometer is an analytic instru-ment which bombards molecules of a chemical samplewith electrons and records the relative numbers ofresulting charged fragments by mass. When mole-cules are bombarded, they tend to fragment atdifferent locations and fragments tend to rearrangeand break apart as determined by the environmentsaround the critical chemical bonds and atoms. Thedescription of these processes is called "massspectrometry theory". The output of the instrumentthe mass spectrum or fragment-mass table (FMT)*,is commonly represented as a graph of masses offragments plotted against their relative abundance.By examining the FMT, an analytic chemist oftencan determine the molecular structure of the sampleuniquely.

Mass spectrometry theory (MS theory), as usedby the DENDRAL programs and many chemists, is acollection of statements about the fragmentationpatterns of various types of molecules upon elec-tron impact. It contains, for example, numerousstatements about the likelihood that links (bonds)between chemical atoms will break apart or remainstable, in light of the local environment of thebonds within the graph structure of the molecule.The probability of a fragment splitting off themolecule is determined by the configurations ofchemical atoms and bonds in the fragment and inits complement. Further splitting of the fragmentis determined in like manner. In addition to rulesabout fragmentations, the theory also containsru l is relating graph features of molecules andfragments to the probabilities that an atom or

*The term 'fragment-mass table' is used here inplace of the slightly misleading term 'mass spec-trum' . The latter is well entrenched in the lit-erature, but the former is more suggestive ofthe form of the data.

Page 3: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

42 Session No. 2 Applications

group of atoms will migrate from one part of thegraph to another. Fortunately, mass spectrometryresults are reproducible, or nearly so, whichmeans that identical sampLes will produce nearlyidentical FMT's (under the same operating con-ditions of the same type of instrument) .

As mentioned earlier, there are alternativelevels for expressing this, as any other theory.The model in whose terms the theory is stated isa "ball and stick" model of chemistry, in which'atom' and 'bond' are primary terms, and not, forexample, an electron density model. Some of theprimitive terms of the program's theory are listedin Appendix A.

111. First Subproblem: Explaining Each Spectrum

The so-called "method of hypothesis" inscience is sometimes proposed as the essence ofscientific work. Restating it, in a deliberatelyimprecise way, the method is to formulate a hypo-thesis to account for some of the observed dataand make successively finer adjustments to it asmore observations are made. Very little is knownabout the details of a scientist's intellectualprocesses as he goes through the method. Thinkingof hypotheses, for example, is a mysterious taskwhich must be elucidated before the method can beprogrammed. That is the task we have designatedas the first subproblem.

The program starts with individual structure-FMT (fragment-mass table) pairs as separate fromone another. It constructs alternative explana-tions for each FMT and then considers the FMT'sall together. An explanation, for the program, asfor the chemist, is a plausible account of the MSprocesses (or mechanisms) which produced themasses in the FMT. The explanation is somethingiiki_ a story of the molecule's adventures in thersass spectrometer: certain data points appear asa result of cleavage, others appear as a resultof more complex processes. At this stage of de-velopment of the theory, the chemist's story doesnot account for every data point because of thecomplexities of the instrument and the vast amountof missing information about MS theory.

The well-known problem of choosing a repre-sentation for the statements of a scientifictheory and the objects mentioned by the theory iscommon to all sciences. In computer science itis recognized as a crucial problem for the effic-ient solution (or for any solution) to each prob-lem.

Some

ways of looking at a problem turn outto be much less helpful than others, as, forexample, considering the mutilated checkerboardproblem (5) as simply a problem of covering rec-tangles (with dominoes) instead of as a parityproblem. At this stage there are no computer pro-grams which successfully choose the representationof objects in a problem domain. Therefore we, thedesigners of the Meta-DENDRAL system, have chosen

representations with which we have some experienceand for which programmed subroutines have alreadybeen written in the Heuristic DENDRAL performancesystem.

It was natural to use these representationssince the meta-program itself will not only inter-face with the Heuristic DENDRAL performance pro-gram, but is built up from many of the LISP func-tions of the performance program. Specifically,for this program, the input data are chemicalstructures paired with their experimental data:

structure-1 - FtfT-l

structure-n - FMT-n

The representation of chemical structures lsjust the DENDRAL representation used In the Heur-istic DENDRAL system. It has been described indetail elsewhere (see 1): essentially It is alinear string which uniquely encodes the graphstructure of the molecule. The FMT's, also, arerepresented in the same way as for the HeuristicDENDRAL performance system. Each FMT Is alist of x-y pairs, where the x-points are massesof fragments and the y-points are the relativeabundances of fragments of those masses.

The Predictor program of the Heuristic DENDRALsystem has been extensively revised so that theinternal representations of molecular structuresand of MS theory statements would be amenable tothe kind of analysis and change suggested in thiswork. As mentioned, Appendix A contains examplesof the terms which are used in statements of thetheory.

B. Search

It is not clear what a scientist does whenhe "casts about" for a good hypothesis. Intuition,genius, insight, creativity and other facultieshave been invoked to explain how a scientistarrives at the hypothesis which he later rejectsor comes to believe or modifies in light of newobservations. From an information processingpoint of view it makes sense to view the hypo-thesis formation problem as a problem of searchinga space of possible hypotheses for the most plaus-ible ones. This presupposes a generator of thesearch space which, admittedly, remains undis-covered for most scientific problems.

In the Heuristic DENDRAL performance systemthe "legal move generator" is the DENDRAL algo-rithm for constructing a complete and Irredundantset of molecuLar models from any specified collec-tion of chemical atoms. Heuristic search throughthis space produces the molecular structures whichare plausible explanations of the data. The meta-problem of finding sets of MS processes to explaineach data set is also conceived as a heuristicsearch problem. Writing a computer program whichsolves a scientific reasoning problem is facilitatedby seeing the problem as one of heuristic search.This is as true of the meta-program which reasons

A . Representation

Page 4: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

Sesaion No. 2 Applications 43

from collections of data to generalizations as forthe performance system which reasons from one set

of data to an explanation. For this reason wehive called the process of induction "a processot efficient selection from the domain of allpossible structures." (3)

In broad terms, the program contains (1) agenerator of the search space, (2) heuristics forpruning the tree, and (3) evaluation criteria forguiding the search. Except for problems inherentin the task, then, the problems of such a programare reasonably well understood. These three maincomponents of the heuristic search program areconsidered one at a time in the Immediate dis-cussion.

1.

Cenerator

For this part of the Meta-DENDRAL system,the generator is a procedure for systematicallybreaking apart chemical molecules to representall possible MS processes. In addition to singlecleavages, the generator must be capable of pro-ducing all possible pairs of cleavages, all poss-ible triples, and so forth. And, for each cleav-age or set of cleavages it must be able to repro-duce the result of atoms or groups of atoms mi-grating from one fragment to another. For exam-ple, after the single break labeled (a) Inr'igure I below, subsequent cleavage (b) may alsooccur. The result of (a) + (b) is the simplefragment CH3.

0CH3 j C | CH2 - CH2 - CH3

(b) (a)

FIGURE

1

Or, for the same molecule, cleavage (c) maybe followed by migration of one hydrogen atomfrom the gamma position (marked with an asterisk)to the oxygen, as shown in Figure 2:

0" SCH3 - C - CH2 j CH2 - CH3

(c) *

FIGURE

2

The generator of the search space will pos-tulate these processes as possible explanationsof the FMT data points at masses 15 (CH3)and 58 (C3H6O) for this particular molecule. Butit will also postulate the simple cleavage (b) inFigure 1 as the explanation of the peak at mass15. And for the peak at mass 58 from the processin Figure 2 it will postulate the alternativemigration of a hydrogen atom from the beta posi-tion (adjacent to the asterisk). From the gen-erator's point of view these processes are at■east as good as the more or less accurate pro-cesses shown in Figures 1 and 2.

Chemists also appeal to the localization ofthe positive charge in the charged molecule toexplain why one peak appears in a set of data butanother does not. Since it is known that only thecharged fragments are recorded by the mass spectrometer, the generator program must also manipulatecharges to account for the data.

The primitive mechanisms of the generator arecharge localization, cleavage, and group migration(where a group can be a positive charge, a singleatom, or a set of connected atoms) . The generatoris a procedure for producing all possible chargedfragments, not just all possible fragments, inother words. Putting these mechanisms togetherin all possible ways leads to an extremely largespace of possible explanations for the peaks inthe experimental FMT of a molecule. The pruningheuristics discussed in the next section alleviatethat problem. Briefly, let us turn to the actualdesign of the generator.

At the first level of branching in the treeall possible single cleavages are performed onthe original molecular structure resulting In allpossible primary fragments. At the next level,the positive charge ls aasigned to all possibleatoms in the fragments. (Switching these two stepsgives the same results and is closer to the concep-tualization used by the chemist; it results in aless efficient program, however.) Starting withlevel

3,

the procedure for generating successivelevels 13 recursive: for each charged fragmentat level n (n > 2) produce the charged fragmentsresulting from (i) cleavage of each bond in thefragment and (ii) migration of each group from itsorigin to each other atom in the fragment, where'group' currently means 'positive charge or hydro-

gen

atom.

2. Pruning Heuristics

Three simple pruning techniques are currentlyused by the program. (1) Since the result ofbreaking a pair of bonds (or n bonds) is inde-pendent of the order in which the bonds are broken,allow only one occurrence of each bond set; (2)Since MS processes tend to follow favorable path-ways, prune any branch in the tree which is nolonger

favorable,

as evidenced by failure of afragment's mass to appear in the experimental FMT;(3) Limit the number of allowable group migrationsafter each cleavage.

The first pruning technique hardly needs ex-plaining: duplications of nodes in the searchspace are unnecessary in this case and can beavoided by removing a bond frcm considerationafter all possible results of breaking it havebeen explored. The second technique carries anelement of risk, because mass spectrometry theoryincludes no guarantee that every fragment in adecomposition pathway will produce a peak in theexperimental FMT. In

fact,

the pruning can only bedone after a complete cycle of cleavage plus mi-gration because these processes occur together in

Page 5: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

44 Session No. 2 Applications

In this example, the program used no migra-tions or charge localization

information,

for pur-poses of simplicity. The program explored allsimple cleavages and found peaks corresponding toevery resulting fragment but two.* For each ofthe successful fragments, the program broke eachof the remaining bonds. From all the secondarybreaks considered, the resulting fragments corres-ponded to only four additional peaks in

the mass spectrometer--without the appearance ofthe intermediate fragments. The third techniquealso is truly heuristic since there are no theo-retical reasons why group migrations might notoccur in complex and exotic patterns betweencleavages. The bias of mass spectroscopists to-ward simple mechanisms, however, leads us to be-lieve that they would place little faith in exoticmechanisms as explanations of peaks in the data,at least not without other corroborating evidence. the FMT. So these four branches of the search tree

were each expanded by one more simple cleavage. Nontof the tertiary fragments were found in the FMTso the program terminated.

3. Evaluation

Evaluation of alternative paths in the searchtree is necessary, either during generation orafter it is completed, in order to distinguish thehighly attractive explanatory mechanisms fromthose which are merely possible. However, withoutbuilding in- the biases of experts toward theircurrent theory it is difficult to evaluate mecha-nisms at all.

The output of this phase of the program is aset of molecule-process pairs. For the one exampleshown in Figure 3, thirteen such pairs would beincluded in the output: the molecule shown therepaired with each of the thirteen processes.

IV. Second Subproblem:Generalizing To All Structures

The program's evaluation routine presentlycontains only one a priori principle, a form ofOccam's razor. In an attempt to measure the sim-plicity of the statements describing mechanisms,the program counts the number of primitive mecha-nisms

necessary

to explain a peak. Thus when

The method of hypothesis, mentioned earlieras a vague description of scientific work, suggeststhat a plausible hypothesis can be successivelymodified in light of new experience to bring ascientist closer and closer to satisfactory ex-planations of data. Apart from the problem offormulating a starting hypothesis discussed aboveand the problem of terminating the procedure, itis not at all clear how the adjustments are to bemade nor how to select the new experiences so asto make the procedure relatively efficient, or atleast workable. These are well-known problems inthe methodology of science. In other terms, theproblem of successive modifications can be viewed

there are alternative explanations of the samedata point, the program chooses the simplest one,that is, the one with the fewest steps. Simplecleavage is preferred to cleavage plus migrationplus cleavage, for example.

The result of the generation process as des-cribed so far, with pruning and evaluation, is aset of candidate MS processes for each structurewhich provides alternative explanations for datapoints In the associated mass table (FMT) . For in-stance, the program breaks the molecular structureshown in Figure 3 ot individual bonds or pairs ofbonds to give the following Information (atoms inthe structure are numbered from left to right) :

as a problem of generalizing a hypothesis fromone set of observations to a larger set.

The task for the second main part of theMeta-DENDRAL system is to construct a consistentand simple set of situation-action (S-A) rulesout of the numerous instances of rules generatedby the first phase of the system. It is necessaryfor this program to determine (a) when two in-stances (molecule-process pairs) are instances ofthe same general S-A rule, and (b) the form of thegeneral rule. In other terms, the program isgiven a set of input/output (I/O) pairs, with re-spect to the MS theory in a "black box". The taskof the program is to construct a model of what isinside the black box. Thus it needs methods for(a) determining when two outputs (processes) areof the same class and (b) constructing an input/output transformation rule which accounts for theinputs (molecules) as well as the outputs.

MassExplained Process

Breakbond: C2-C1or Breakbond: C6-C7

103

89 Breakbond: S3-C2or Breakbond: C5-C6

75 Breakbond: C4-C5Breakbond: S3-C461Breakbond: C4-C5 _ C2-C1Breakbond: C4-S3

6057

Breakbond: S3-C4 _ C2-C1Breakbond: C5-C4

464342 Breakbond: C6-C7 _ C4-S3

Breakbond: C2-S329 For each molecule there will be several assoc-iated processes, as seen from the example from28 Breakbond: C5-C6 _ C4-S3

12 3 4 5 6 7

CH3

- CH2 -

S

- CH2 - CH2 - CH2 - CH3 *The CH3 fragment was produced twice but peaks oflow masses were not recorded in the FMT.

FIGURE

3

Page 6: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

Ses.ion No. 2 Applications 45

Section 111 (Figure 3). So the same molecule willappear in several I/O pairs. Moreover, since theaolecules are chosen for the test because they are

known to exhibit similar MS behavior, there willbe a number of instances of each general MS rule.If the program is

successful,

the resulting set ofexplanations will be a unified description of theyiS behavior of all the molecules in the class. Inoperational terms, this means, at least, that thefinal set of explanations will be smaller than theunion of instances.

The program itself has not been completed.It is hoped that this sketch shows enough detailthat it will be instructive and provocative. Yetwe do not wish to emphasize unfinished pieces ofprograms .

As in Section 111, the issues of represen-tation and search are discussed separately in thissect ion .A . Representation

The general form of the rules the program isto infer has been fixed as S-A rules, as mentionedabove. But representing the instances from whichto infer the rules presents other difficulties.It has been difficult to decide how to representthe instances in such a way that they can be com-pared and

unified,

without building in conceptswhich uould beg the theory-formation question.For example, representing the chemical graphs byfeature vectors is attractive because It is easyto give the program just the right informationfor efficient comparisons. But this is the danger,too, for omitting "superfluous" Information givesthe program much too great a head start on theproblem. It might discover what we believe arein the data--the old principles--but it wouldnever discover anything new.

The difficulty with the representation ofthe instances, i.e., the molecule-process pairsin the input stream, is that the numbering of atomsin the molecules, and the corresponding numberingsin the function arguments of the processes, do notallow simple comparisons. However, by comparingrules two at a time it is possible to determinemappings between the atoms, and the function argu-ments, so that the program can make comparisons.This is described below, as part of the scheme forgeneralizing rules.

B. Search

The program has been designed to generalizeon situations which exhibit the same processes.If situations Ml and M2both exhibit process P,for example, the program attempts to construct arule (S -. P) where S captures the common featuresof Ml and M2. This procedure requires that theprogram knows enough about the syntax of the pro-cess language that it can recognize the "same"

in different contexts. Also, this pro-requires that the

program

can find common

features of situations which satisfy some criteriaof non-triviality.

As in sny learning problem there will need tobe many readjustments of the learned generalizationsas new data are considered. In this case, theaddition of each new molecule-process pair bringsthe potential for revising any S-A rule in theemerging MS theory. Since each molecule initiallyconsidered may be associated with a dozen or moreprocesses, and the emerging theory may containmany dozens of S-A rules, the generalization pro-cess will be lengthy.

All of the molecule-process pairs, which areinstances of the rules the program is supposed to

find,

are compared among themselves. The resultof this comparison is a set of generalized des-criptions which account for the input data. Thisresulting set is then organized hierarchically toform the program's MS theory by the process des-cribed In Section V.

The comparison of the instances is conductedpairwise. The first molecule-process pair ispostulated as a situation-action rule, R. A newmolecule-process rule, N, (the next one) is thencompared with Rin the following way. (1) TheMS processes, or actions, of N and R are comparedat a gross level. (2) If this comparison holds,the graph structures (situations) of N and R arecompared to find common subgraphs. If there areno common subgraphs, N is compared with the nextrule, or, if no more rules, N is postulated as anew rule. (3)

Otherwise,

the common subgraph, S,is expanded to S' to capture alternative allowableatoms beyond the common subgraph as indicated bythe situations of N and R. (4) Finally, the graphof R is replaced by S. These four steps will beillustrated and briefly described below.

Consider the rule12 3 4 5 6

(R) : CH3 - CH2 - NH - CH2 - CH2 - CH3

_>

Breakbond (4 5)

and the new molecule-process pair12 3 4 5 6

(N) : CH3 - NH - CH2 - CH2 - CH2 - CH3-» Breakbond (34).

(1) Compare the processes of R and N (theright-hand sides of R and N) , disregarding thearguments of functions. Comparison of just thenames of the processes shows that both R and Nfollow the same syntactic rules, and thus deservecloser comparison. This is made possible by thegenerator of processes described in Section 111,which names processes and sets of processes uni-quely. Had the form of the processes been differ-ent, N would be compared with the next rule (ifany).

(2) Compare the graph structures in N andR, ignoring hydrogen atoms (H) for the moment.Using the clue that the atoms involved in the

Page 7: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

46

Session No. 2 Applications

processes of both N and R are important, the pro-gram looks for a way of matching these atoms. Thenthe "interesting" subgraphs in both N and R areexpanded, starting with the important atoms andbuilding the greatest subgraph, S, which is commonto both N and R. The criteria of "interesting"subgraphs and for "greatest" common subgraph areheuristic and are specific to chemistry.

Since the nitrogen atom, N, and the adjacentright-hand carbon atom, C, are both involved inthe Breakbond process for both rules (N) and (R) ,the6e are recognized as important atoms. Thusthe subgraph common to (R) and (N) must containthese nodes. Using the numbering of the graphof (R) , nodes 2-6 are found to be common to bothgraphs. This is an "interesting" subgraph because,for example, it contains at least one non-carbonatom and contains more than two nodes. Moreover,it is the greatest subgraph common to the two.Without H's, this subgraph,

S,

is:

2 3 4 5 6(S) : C - N - C - C - C

(3) Expand the subgraph (S) . Now, recon-sider the hydrogen atoms ignored in step (2) .Nodes 2 and 6 in S fail to match exactly on thenumber of hydrogens, but the rest do match. Both2 and 6 are connected to at least two hydrogens,but in each case, the last connection may be toeither an Hora C. This is reflected in the ex-panded subgraph:

(S'): (C,H) - CH2 -NH - CH2 - CH2 - CH2 - (C,H)

The parentheses indicate alternative choices forthe atom linked by the adjacent bond.

The program now extends subgraphs only oneatom beyond the greatest common subgraph (in eachdirection) , but this clearly should be a parameterwhich the system can set.

(4) Replace the graph of R with S. Theresult of comparing N with R, then, has been tochange the conditions under which the process ofR has been observed to apply. The old rule Risreplaced by a revised rule, R' , in which thesituation is

modified,

but the action remains thesame.

(R'): (C,H) - CH2 -NH - CH2 - CH2 - CH2 - (C,H)-» Breakbond (4 5)

The result of this whole process is a set of

S-A

rules which can account for the observed data.This part of the program cautiously tries not togeneralize beyond the observed situations. So itmay miss some sweeping generalizations ("brilliantInsights") which explain several of these cautiousrules. But its result will not, at least, containn "rules" to explain n observations, unless theinput data are wildly discrepant.

V. Third Subproblem: OrganizingNew Rules and Integrating Them

Into The Existing Theory

The scientist's problem does not necessarilyend with the satisfactory formulation of generalstatements explaining all the observed data. Ifhe is working in a discipline for which there isno existing theory, he will still want to organizethe statements. But it is rare to be out of anytheoretical context. Typically, the hypothesesare formulated as extensions of some existingtheory. Thus, the Meta-DENDRAL program must beprepared to merge new MS rules into the theorypreviously constructed by the program (or by achemist). However, as a test exercise we want tosee whether the meta-program builds approximatelythe same MS theory as the performance program nowcontains.

One of the reasons we have rewritten theDENDRAL system's mass table predictor was toseparate the MS theory from the LISP functionsit drives. Making changes to the theory, then,does not require reprogramming, in the usualsense. Consequently, writing a program whichupdates the theory no longer seems to be an in-surmountable task.

The problems of organizing a set of new rulesor integrating new rules into the old theory areindependent of the source of those rules. Inorder to study these problems we have written aprogram which (a) accepts new rules from humanchemists and (b). updates the theory table of theprogram. The program for doing (a) , called thedialog program, is not central to this paper,thus this section will focus on the work to accom-plish (b) , organizing and updating the theory.

In short, the program organizes the new ruleseither into a fresh theory or into an old theory(depending on the test) in the same way. The ruletable is organized hierarchically according tothe situations in the rules. Because the situ-ations are graph structures, determining situationlevels is just determining whether one graph is

contained within another. For example, the graph-NH2 is contained in the graph -CH2-NH2 , sothe former is a higher-level situation in the ruletable. If neither situation is a subgraph of theother and they are not identical, they are put at

the same level in the rule table.

The performance program's MS theory Is rep-resented as a table of situation-action rules(S-A rules), patterned after Waterman's table ofheuristics for good poker play. (6) Situations^are predicate functions which evaluate to 'trueor 'false' in a specific context. For simplicity,only two predicate functions are allowed as situ-

ations at this time (in addition to 'T') --al tho.gha wide range of arguments may be supplied.

Also,

only one simple predicate function at a time can

12 3 4 5 6 7A. Representation

Page 8: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

Session No. 2 Applications 47

serve as a situation; Boolean expressions of predi-cates are not allowed. The first simplifying re-striction will be easy to loosen as new predicatefunctions are discovered which will be useful.Limiting a situation to a single predicate, how-ever, is an important way of limiting the diffi-culties encountered in revising the program's MStheory or analyzing it. Actions are sequences ofprimitive MS processes constituting, in fact, re-write rules for transforming one structural frag-ment into another. In this system, an action placecan also be filled by another S-A rule, allowingnesting of rules in a manner quite natural to thecurrent textbook descriptions of MS theory.

The structure of the rule table in the program,which constitutes the program's MS theory, can beexpressed in a Backus normal form:

<o-A rule> : := (<situation> <action> I(<situation> <default><S-A _ul_> ... <S-A r_le> )

situation.-* ::= (ISIT <subgraph nam_>) |(CHECKFOR <variable n_me>

"..?tion>** ::= (<function name> <arguments>) I(PROG () <action> ...

The performance program is driven by the MStneary in the rule table by the following proce-dure. The program picks up the S-A rule immediatelyfollowing the default action and checks to see if:he current context satisfies the situation by ex-ecuting the named predicate function (with approp-riate arguments). If it does, the program performst.".e associated action by executing the named (ordescribed) function (with appropriate arguments).

"The function ISIT determines whether the subgraphrimed in its argument place is contained in thechemical graph under consideration. The functionCHECKFOR checks to see whether the current valueof the named variable is equal to the value speci-:ied. This predicate allows checking global con-text before determining answers to specific ques-tions about subgraph matching.

♦■♦The basic actions (function names) known to thesystem are listed In Appendix A. Any action whichis built out of several basic functions can begiven its own name. In fact, the MS theory incne present version of the performance programcontains many named complex actions.

The very first situation, *_', Is certain to besatisfied (since 'T* evaluates to 'true'), so thedefault action will be executed if none of theother situations are satisfied.

A simple illustration will make the structureof the rule table clear. Suppose it contains rulesfor two distinct situations: ethers and alcohols,plus a subrule for a special class of ethers, namedetherl. The table would look like:

(T default(alcohol -situation alcohol-action)(ether-situation ether-action

(etherl-situation etherl-action)))

If a compound satisfies the etherl situation,neither the default action nor the ether actionwill be executed. All the processes for eachsituation are collected in the corresponding actionThis may cause duplication if some of the processesin a rule also apply to the subrules. But modifi-cation of the rule table is made easier because ofthis unification.

B. Organization and Integration

The output from the generalization programdiscussed in Section IV is a set of S-A rules (withaccompanying definitions of the situations andactions) . The set of new S-A rules is organizedwithout reference to any existing theory or inte-grated into an existing theory by exactly the sameprocess. Each S-A rule is considered in turn. Itis postulated as a new S-A rule at the top levelof the rule table if its situation does not appearelsewhere in the rule table. If a new situation,

Si,

subsumes a situation, S2, already in the ruletable (i.e., SI is more general than

S2,

or SIis contained in S2) , then the new rule is insertedin the rule table so that the old rule, with

S2,

is below the new one. Or, the reverse may be thecase, namely, that the new situation (SI) is sub-sumed by a situation (S2) already defined. Thenthe new rule must be inserted below the old one inthe hierarchy. These three cases all depend onlyupon the program's ability to determine when onegraph is contained within another. They are brieflyillustrated below.

(1) If the situation does not appear else-where in the rule table, the new S-A rule is merelyadded to the top level of the rule table. Forexample, adding an amine rule to the sample ruletable above would result in

(T default(alcohol -situation alcohol -act ion)(ether-situation ether-action

(etherl-situation etherl-action))(amine-situation amlne-action))

(2 &3) If the situation of the new S-A rulesubsumes a previously defined situation, the oldS-A rule becomes a sub-rule of the new rule. If

<rule table> ::= ((T <default> <S-A rule><S-A rule>)

<_efault> ::= <action>

<value>) |T

<action>)

Page 9: I SessionNo. I - Stacksfv118rr1216/fv118rr1216.pdf · "1 40 I I i f SessionNo. 2 Application* A HEURISTIC PROGRAMMING STUDY OF THEORY FORMATION IN SCIENCE* Bruce G. Buchanan Edward

48 Session No. 2 Applications

the situation of the new rule can be subsumed underan existing one, the new rule becomes a sub-rule ofthe old one. These two cases are both illustratedby the following example. Suppose the program addsa rule (ether2-situation ether2-action) to the ruletable above, where ether2-aituation is an instanceof ether but more general than etherl-situation.This would result in(T default

(alcohol-situation alcohol -action)(ether-situation ether-action

(ether2-situation ether2-action(etherl-situation etherl-action)))

(amine-situation amine- act ion))

After deciding where the rule must be inserted,the program adds the definitions of the new situ-ation names and action names to the system.

As this part of the program becomes more soph-isticated it will have to (a) check the rules tobe sure there are instances which actually dis-tinguish them, (b) look for less cautious ways ofgeneralizing, and (c) associate a measure of con-fidence with each rule so that it can resolve con-flicts between rules.

VI. Conclusion

The Meta-DENDRAL program described here is avehicle for studying problems of theory formationin science. It is built upon the concepts and pro-grammed routines already available in the HeuristicDENDRAL performance program, which uses a scientifictheory to explain analytical data in organic chem-istry. The Meta-DENDRAL system goes beyond theperformance program, however, in attempting toformulate the theory which the performance programwill use.

The Meta-DENDRAL program works much like achemist who is extending his theory of mass spec-trometry by looking at collections of experimentalresults. The data, for both the chemist and pro-gram, are the results of mass spectrometry experi-ments (called FMT's here) and the associated molec-ular structures. By selecting some "typical" ex-amples, first-order general hypotheses about thewhole collection of data can be proposed. Then,by subsequent adjustments, the generalizations aremodified to explain all the data. The new rulesare then integrated into the existing corpus oftheoretical statements in ways dictated by consid-erations of simplicity and personal preference.

The version of the meta-program which is des-cribed here suggests that the design is workable.But it accentuates the arbitrariness of our designdecisions and raises the questions of what alter-native designs would look like and how good theywould be. It also raises a number of issues im-portant to understanding scientific methodology ingeneral. The design question is certainly one suchissue. Others are questions concerning the cri-teria of acceptable generalizations, criteria ofgood scientific theories, and criteria for deciding

on a set of primitive concepts for a theory. Noneof these general issues will be resolved satis-factorily in the context of this program. Yetnone can be resolved for this program withoutsaying something about the general solutions.

References

(1) B.G. Buchanan and J. Lederberg, "The HeuristicDENDRAL Program for Explaining Empirical Data"In Proceedings of the IFIP-71 Congress (inpress). (Also Stanford Artificial IntelligenceProject Memo No. 141, April, 1971.)

(2) C.W. Churchman and B.G. Buchanan, "On theDesign of Inductive Systems: Some PhilosophicalProblems". British Journal for the Philosophyof

Science,

20 (1969), pp. 311-323.(3) J. Lederberg, G.L.

Sutherland,

B.G. Buchanan,and E.A. Feigenbaum, "A Heuristic Program forSolving a Scientific Inference Problem:Summary of Motivation and Implementation". InTheoretical Approaches to Non-Numerical ProblemSolving (R. Banerji and M.D. Mesarovic, eds.)Springer-Verlag, New York (1970). (Also Stan-ford Artificial Intelligence Project MemoNo. 104, November, 1969.)

(4) P.B. Medewar, Induction and Intuition inScientific Thought. American PhilosophicalSociety, Philadelphia (1969).

(5) A. Newell, 'limitations of the Current Stockof Ideas about Problem Solving". In ElectronicInformation Handling (A. Kent and O.E. Taulbee,eds.) Spartan (1965).

(6) D.A. Waterman, "GeneralizationLearning Tech-niques for Automating the Learning of Heuris-tics". Artificial Intelligence, 1:121 (1970).