toward a design apprentice

A C M S I G S O F T S O F T W A R E E N G I N E E R I N G N O T E S vol 16 no 2 A p r 1991 Page 33

T o w a r d a D e s i g n A p p r e n t i c e : S u p p o r t i n g R e u s e and E v o l u t i o n in S o f t w a r e D e s i g n

Richard C. Waters Yang Meng Tan

MIT AI Laboratory 545 Technology Square Cambridge MA 02139

Starting from a specification/high-level-design, the De- sign Apprentice (DA) will be able to assist a programmer in the detailed design of a program. The centerpiece of the DA is a library of commonly-used fragments of specifications, designs, and algorithms. By reusing fragments of specifications and high-level designs, a software engineer can describe a program quickly and concisely. By using its knowledge of fragments of low-level designs and algorithms, the DA can automate many aspects of detailed design. As part of this process, the DA can detect simple kinds of errors stemming from inconsistency and incompleteness in the program description.

The DA will support the methodology of programming by successive elaboration. This methodology encourages the layered, breadth-first description of programs and recognizes modification as the dominant programming activity. The da supports layered description through the use of abstract fragments and a highly interactive approach to communicating with the user. The DA supports the evolution of programs by keeping explicit dependencies between the design decisions that underlie them.

The DA has not yet been completed; however, significant steps have been taken toward its implementation.

1. The Programmer's Apprentice The Programmer 's Apprentice project [8] is studying how

software engineers analyze, synthesize, modify, specify, verify, and document software systems and how these tasks can be automated. Recognizing that it will be a long time before it is possible to fully automate any of these tasks, the near-term goal of the project is the development of a system, called the Programmer 's Apprentice, which can act as a software engineer's junior partner and critic, taking over simple tasks completely and assisting with more complex tasks.

Viewed most simply the software-development process has, at one end, the desires of an end-user and, at the other end, a program that can be executed on a machine (see Fig- ure 1). The part of the software process closest to the user is typically called requirements acquisition; the part of the process nearest the machine is typically called implementation; the area in the middle is generally described as design. The project is following the strategy of building demonstrations

Requirements Design ]mplementat|on

i _ f - I

Figure 1: Incremental approach to the Apprentice.

of parts of the Apprentice, working inward from the two ends of Figure 1.

The central result of the project 's research on the left end of Figure 1 is a demonstration system called the Re- quirements Apprentice [6], which can assist an analyst in the creation and modification of software requirements.

The first phase of research on the right end of Figure 1 resulted in the implementation of a demonstration system called the Knowledge-Based Editor in Emacs (gBEmacs) [8], which can automatically implement a program once a software engineer selects the algorithms to use.

Work is currently underway on a Design Apprentice (DA)

that will go beyond KBEmacs by taking over the low-level aspects of design. This system will be able to critique a specification/high-level-design provided by a software engineer and automatically make low-level design decisions that follow from it.

The approach taken by the DA is presented in Section 2. The main body of the paper (Section 3) discusses a scenario illustrating how the DA is intended to operate. In Section 4, the paper concludes with a discussion of the current status of the DA and the future goals of research on the DA.

2. The Design Apprentice The DA will assist a software engineer in the design, im-

plementation, and later maintenance of programs. The goal is for the DA to take over the grunt work that forms a large part of the process while leaving the key hard decisions to the engineer. In general, this will mean that the engineer makes high-level design decisions while the DA makes low-level design and implementation decisions. However, the engineer can intervene directly at any level.

The input to the DA will be in the form of concise specifications/high-level-designs and requests to make changes that are phrased at th issame high level. The principal capability of the DA will be to consider the basic consequences of the statements made by the engineer, making the detailed design decisions that follow from these statements and detecting any obvious conflicts.

The output of the OA will be in the form of interactive commentary, program code, and a detailed machine manip- ulable design record, which keeps track of the decisions underlying the program created. Recognizing that change is the dominant fact of life for almost any software system, the DA will concentrate on supporting change both during the initial stages of a design and during later maintenance.

Two basic emphases underlie the approach undertaken by the DA: an emphasis on the reuse of standard forms and


an emphasis on supporting evolution and change.

R e u s e o f Cl iches .

Expert engineers rarely construct complex artifacts (au- tomobiles, electronic circuits, or program designs) by starting from first principles. Rather, they bring to the task their previous experience in the form of knowledge of the commonly occurring structures in the domain. The term clichd is used here to refer to these commonly occurring structures. In normal usage, the word clich@ has a pejorative sound that connotes overuse and a lack of creativity. However, in the context of engineering problem solving, this kind of reuse is a positive feature.

Abbreviation in the form of references to cliches is essential for effective communication on any topic. For instance, it is difficult, if not impossible, to understand a description unless one already has a large amount of relevant knowledge. Imagine trying to communicate the design for an inventory control system to a person who knows nothing about either information systems or inventories.

A major goal of research on the DA is the codification of clichds relevant to software specification, design, and implementation. This codification includes both cliches of broad applicability and more specific clich@s in particular applica- tion areas. An important benefit of orienting the DA around the use of clich@s is that domain-specific knowledge can be provided as data, rather than built into the system. New domains can be covered by defining new clichds. The importance of codifying domain knowledge is a theme which the DA shares with systems such as ¢-NIX [2] and DRACO [5].

D e s i g n b y Succes s ive E l a b o r a t i o n

The DA is designed to support design by successive elaboration as opposed to top-down design or transformational implementation of a specification. The successive elaboration approach views design as a process of continual evolution and provides superior communication between the software engineer and the DA.

T h e d o m i n a n t n a t u r e o f c h a n g e . Most traditional design methodologies implicitly assume that the specifications for the system to be designed are unchanging and known in advance. Hierarchical decomposition and/or correctness preserving transformations are then applied when developing a design.

In contrast, design by successive elaboration takes the stance that the dominant activity in design (especially if you consider the entire litbcycle, including maintenance) is non- correctness-preserving lateral change. One typically starts with only a partial specification, which is subsequently fleshed out and modified over the lifetime of the system. The focus of design by successive elaboration is placed on perspicuously describing augmentations and changes.

L a y e r e d e x p o s i t i o n . A key feature of both top-down design and transformational development is their incremen- tality. This means that a large design does not have to be understood all in one big chunk, but rather is broken down into separately understandable, smaller chunks, whose un- derstandings can then be combined to understand the whole. In top-down design, the incremental chunks are modules; in transformational development, they are the transformation steps.

Input Program Program

Description Feedback Text

!Translator Designer I Coder [ I qnch~ ]

Reasoning + Dependency System (Cake)

I Design Roeord

Figure 2: The architecture o f t h e DA.

Design by incremental elaboration goes beyond these traditional methodologies by supporting other kinds of incre- mentality as well. In particular, it does not restrict the structure of a design to be strictly tree-like, or force the designer to grow the design from the root to the leaves. In addition, it allows a designer to use the expositional technique of telling "white l ies"--first presenting an easy-to-understand, but somewhat inaccurate, description and later correcting the anomalies that arise due to it.

Ef f ic ien t c o m m u n i c a t i o n . Interaction with the DA is in the nature of a dialog. The engineer provides a brief initial description and then fleshes it out in response to questions put by the DA. The obvious alternative would be to have the engineer simply give a full initial description that is detailed enough to guarantee that the DA would not have any questions. However, this would require the engineer to say much more than is necessary.

In a situation where a listener (here the DA) has a significant amount of knowledge, but the speaker (here the engineer) does not know exactly how much the hearer knows, a dialog is much more efficient than a monolog, because it allows the speaker to discover what the hearer understands and address later s tatements only to those areas that the hearer does not understand. In a monolog, the speaker must describe everything that the hearer might not understand. In a dialog, the speaker need only describe what the hearer in fact does not understand.

A r c h i t e c t u r e o f t h e D e s i g n A p p r e n t i c e

Figure 2 shows the architecture of the DA. The foundation of the DA is the Cake knowledge representation and reasoning system [7]. Everything the DA knows (from general clich@s to the engineer's specific input to the final program and its underlying detailed design) is represented using Cake's support for frames and the Plan Calculus [8].

The translator module converts design descriptions into the Plan Calculus. The coder module converts plans representing detailed program designs into program code. The Dh'S knowledge of commonly used specifications, designs, and algorithms is stored in the clichd library.

The designer module manages the interaction with the user. More importantly, it manages the detailed design process by controlling the reasoning performed by Cake. The steps of the design process are recorded in a design record

ACM SIGSOFT SOFTWARE ENGINEERING NOTES vol 16 no 2 Apr 1991 5

represented in Cake. In the long run, it is expected that the design record will be the most important output of the DA, serving as a vehicle for communicating everything the DA knows about a given program to other programming tools.

The concepts and supporting technologies behind the DA are predominantly programming-language independent. In particular, all of the modules in the system are programming- language independent except for a part of the translator module and perhaps half of the coder module. As a result, while the initial target language of the DA will be Common Lisp, it should be straightforward to extend the DA to operate on any conventional programming language.

S u m m a r y of Capabil i t ies

By way of summary, the following are the six key capabilities of the DA.

C o n c i s e i n p u t - - By reference to its library of clich6s, the DA will understand commonly used software-engineering terminology. By using these terms, an engineer can work faster and with fewer errors.

P r o p a g a t i o n o f d e c i s i o n s - - The DA will automatically make subsidiary decisions that follow from decisions made by the engineer.

E r r o r d e t e c t i o n - - The DA will check for obvious inconsis- tencies, both within what an engineer has said and between what the engineer has said and the DA's knowledge of clich6s.

S u p p o r t f o r e v o l u t i o n - - The DA will keep track of dependencies throughout the software-engineering process to help the engineer carry out modifications reliably and assess their potential impact.

E z p l a n a t i o n - - The DA will be able to explain its decisions and actions so that an engineer can easily override them. Explanations can be interactive or in the form of perma- nent documentation.

A u t o m a t i o n o f de ta i l s - - Certain routine aspects of software engineering will be totally automated.

3. A D e s i g n S c e n a r i o

This section presents an imaginary scenario illustrating the capabilities of the DA and how they are expected to be used. It is a simplified and abbreviated version of a longer scenario in [9].

The scenario presented here shows the construction and subsequent modification of a paragraph justification program called jus t i fy . The basic goal of paragraph justification is to break paragraphs into lines and insert extra space where needed so that the lines have the same length. A secondary goal is to do this while minimizing the expansion of the gaps in the lines.

The jus t i fy program uses a simplified version of the justification algorithm used by the TEX document preparation system [3]. The essence of this algorithm is the realiza- tion that satisfying the secondary goal of justification can be viewed as finding the shortest path in a graph where the paths represent possible ways of breaking a paragraph into lines.

Scene 1: An Ini t ia l High-Level Descr ipt ion

To construct j u s t i fy using the DA, a software engineer

(design j u s t i f y ( i n f i l e o u t f i l e ) (declare ( ( f i l e - o f a s c i i ) i n f i l e )

((file-of ascii) outfile)) (for-each ((p (segment (tokenize infile) 'para-break))

(for-each ((arc (shortest-path (build-graph p)))) (lineout outfile arc))

(output outfile newline)))

(implementation-guidelines (prefer time-efficiency) (ignore error-checking))

Figure 3: Initial high-level description.

begins by giving the high-level description of the program shown in Figure 3. This approach reflects the assumption that only the engineer knows what needs to be done, and only the engineer has the insight to make the key high-level design decisions.

The description is divided into two parts: a description of j u s t i f y and a set of i m p l e m e n t a t i o n g u i d e l i n e s . The implementation guidelines are used by the DA when making low-level design decisions. Here, the guidelines direct the DA to prefer time efficiency (over space efficiency) and, for the time being, to ignore error checking (such as checking for unusual boundary situations).

The description of jus t i fy illustrates three basic features of the description language: logical statements, references to clich4s, and algorithmic connective tissue.

The form declare is used to introduce logical statements representing arbitrary specification information. Here, it is specified that infile and out f i le hold files containing ASCII char acters.

The heart of the description consists of references to the clich6s ASCII, tokenize, segment, build graph, shortest path, and output. These clich4s are discussed in detail in the next subsection. Here, it suffices to say that the use of clich4s is what allows the engineer to communicate a significant amount of information to the DA in a very compact description.

Programming constructs are used to specify algorithmic information about the way the cliches are combined. For instance, the for-each forms specify nested loops and variables are used to specify data flow. Interaction with clich4s is facilitated by rendering clich4 references in the same way as subroutine calls. (Since Common Lisp is the initial target language of the DA, a Lisp-like syntax is used for algorithmic information.)

Figure 3 specifies that i n f i l e should first be tokenized (to find the words and paragraph breaks in it) and segmented by dividing the input at the places where paragraph breaks appear. Then, for each paragraph in the input, a graph should be constructed (representing the possible ways of breaking the paragraph into lines) and a shortest path through this graph should be found. Then, for each arc in the path, a line of text should be output to outf i le (using the so-far- unspecified subroutine l ineout) . Finally, an extra newline character should be printed after the last line of each paragraph.

A Libra ry of Clichds

In this scenario, it is assumed that the DA does not know about paragraph justification in particular, but that it knows

A C M S I G S O F T S O F T W A R E E N G I N E E R I N G N O T E S vol 16 no 2 A p r 1991 Page 2;6

Shortest Path

Single Source Shortest PatCh ( B e l l m a n - F o r d ) ~

All Pairs Shortest Path Non-Neg~Ltive Cycles SSSP (Dijkstra)

Non-Negative Integer Weights SSSP (Odin) DAG SSSP

Figure 4: Pa r t of the shortest pa th clichd family.

a number of clichds tha t are relevant to the program j u s t i f y . In a given situation, the DA could have either more or less relevant knowledge than is postulated above. This level of knowledge was chosen to illustrate a useful middle ground. On one hand, it is unrealistic to assume tha t the DA knows clichds corresponding to the complete program to be designed. On the other hand, it is unlikely tha t any tool tha t seeks to dramatical ly improve the product ivi ty and reliabil- ity of the programming process can do so without a rich store of relevant prior knowledge.

Conceptually, a clich6 consists of a set of roles and constraints between them. The roles of a clichd are the par ts tha t vary f rom one use of the cliche to the next. The constraints specify how the roles interact and place limits on the parts tha t can be used to fill the roles. The properties of the key clich6s used in the scenario are briefly described below.

S h o r t e s t p a t h . This clichfi captures knowledge about algorithms for finding the shortest paths between nodes in a graph. The pr imary roles are a set of s tar t nodes, a set of end nodes, a cost function specifying the weights of arcs, and the graph itself.

There are a variety of different shortest path algorithms. Which one is best to use depends on exactly what you want to compute and on the exact propert ies of the graph and the cost function. In cognizance of this fact, the shortest pa th concept is represented as a family of related subclich~s (see Figure 4) ra ther than as a single monolithic clich6.

The topmost cliche in Figure 4 captures the general idea of computing one or more shortest paths between one or more nodes in an unrestr icted graph. Below this are clichds with more restricted specifications. The cliche single source shortest path is appropr ia te when you are only interested in looking at the paths s tar t ing from one particular node. In contrast, the cliche all pairs shortest path determines the shortest paths between every pair of nodes in a graph.

Because the single source shortest pa th clichd is of particular importance in the scenario, the complete subfamily it heads is shown in Figure 4. Within this subfamily, the clichds are classified based on properties of the graph. Single source shortest pa th itself, which corresponds to the Bellman Ford algorithm, is applicable to any kind of graph. More efficient algorithms are applicable if the graph is known not to contain any cycles with negative costs, or bet ter yet if it is known that every arc weight is a non-negative integer. Best of all, a linear t ime algori thm can be used if the graph is a directed acyclic graph (DAG).

B u i l d g r a p h . This clichd captures knowledge about con-

s t rut t ing a graph. As with shortest path, build graph is represented as a family of subclichds. Which subclichd is appropriate depends on the properties of the graph to be created (e.g., whether or not it is acyclic) and on the kinds of information the graph is being constructed from (e.g., whether one basically s tar t ing from a set of nodes or a set of arcs). In this scenario, the relevant si tuation is one where you have a set of items potential ly corresponding to nodes and an arc- test that can be used to determine whether or not an arc exists between a given pair of nodes.

P r o r a t e . Prorat ion is the process of distributing a quan- t i ty across the members of a se t - - fo r example, allocating 10 packages to be carried by 4 people. I ts two main roles specify the number to be prorated and the relative shares to be allocated to each individual. For example, if 10 packages are prorated over 4 people in the relative shares 2 0 1 2, the result is 4 0 2 4. As above, there is a family of subclichds depending on factors such as whether the sum of the shares is known in advance and how one is to handle the situation where the number to be prorated is not divisible by the sum of the shares.

T o k e n i z e . The tokenize clichd is analogous to the popu- lar UNIX facility LEX. I t captures the basic notion of convert- ing a sequence of characters into a sequence of tokens--e.g. , prepara tory to parsing. I ts p r imary role is a g rammar that specifies how characters are to be grouped into tokens. These tokens have a contents, which corresponds to the characters parsed to create it and a type, which corresponds to the g rammar rule used to create it.

A S C I I . The ASCII clich6 captures knowledge about the ASCII character set e.g., the identi ty of the characters and familiar concepts such as the difference between printable and non-printable characters.

S e g m e n t . This clich6 takes a sequence and a predicate and divides the sequence into chunks by cutting it at the points where the items satisfy the predicate.

O u t p u t . The output clichd codifies knowledge about how to output different kinds of objects into different kinds of files. It is represented as a family of clich6s corresponding to object types and file types.

S e l e c t i n g cl ich6s. From the detailed nature of the clich6s described above, it should be clear tha t there are thousands of clichds the DA could be expected to know, just as there are thousands of things a good software engineer knows. This brings up the obvious problem of how a software engineer is going to be able to select the right clich6 to use at the right time.

The DA will provide several kinds of support for clich6 selection. Most importantly, the clichds in the DA's library are arranged in families as shown in Figure 4. The DA selects the correct clich6 from a family automatically. As a result, the engineer never has to refer to detailed clich~s such as non-negative cycles sssP, but ra ther only to abstract ones like shortest path. This reduces by an order of magnitude or more the number of clichds the engineer ever has to deal with explicitly.

However, there are still a large number of abstract clich6s that the engineer does have to know about. I t is anticipated that the DA will support facilities for browsing in the clich$ library and for retrieving clichds based on features of their


In justify: para-break is undefined. In justify: lineout is undefined. For tokenize: grammar defining the tokens unspecified. For shortest-path: start-nodes, end-nodes and cost-function unspecified.

For build-graph: description too vague to be implemented.

Figure 5: Critique by the DA.

(within tokenize () (is grammar

"para-break => bof (space [ newline)* [ (space [ newline)* eof I (space)* newline (space)* newline (space [ neeline)*

word => (non-blank-printable-characters)+ gap => (space [ newline)+")

(override (for-all ((gap gap-token))

(let ((prey (contents (previous-token gap)))) (= (contents gap)

(if (member (last prey) "!?.") ........ ))))) (design width (token) (length (contents token))) (design stretch (token)

(if (gap-token-p token) (width token) 0)))

Figure 6: Specifying tokens in more detail.

specifications. However, to a certain extent, the engineer will have to study these clichds and become familiar with them in order to communicate effectively with the DA just as engineers have to learn a common vocabulary in order to communicate with each other.

S c e n e 2: E l a b o r a t i o n

Returning to the scenario, Figure 5 shows how the DA responds to the initial description of j u s t i f y shown in Fig- ure 3. Based on knowledge of the clichds used, the DA can build up an extensive description of the program. However, the software engineer's initial input is too fragmentary for the DA to proceed very far with detailed design.

In Figure 6, the engineer begins a second, more detailed, layer of description of j u s t i f y by giving further specifications for the tokenization. The form within is used to specify information about a particular use of a clichd. The form is is used to specify how a role is filled in. Here, this is used to define the token grammar. (In the grammar, bof matches the beginning of the file and eof matches the end). The terms space, newline, and non-blank-printable-characters are defined by the ASCII clichd.

The form overr ide is used to specify a change in the tokenize clich4. Normally, the contents of a token are the input characters parsed to create it. Here, it is specified that a gap token contains either one space or two spaces (depending on whether or not the prior word ended with punctuation) no mat ter how many spaces and/or newlines were parsed.

The use of overr ide illustrates an important principle. One of the central features of clichds is their ability to be tailored to the context of a particular use. Significant tailor- ing can be done by filling in roles. However, as a practical matter, this is not always enough. It is sometimes necessary to modify a clich4 in order to adapt it to the situation at hand. In recognition of this, the DA will support a variety of clichd-modification facilities such as override.

Constraint violation within tokenize: The characters allowed in the token grammar do not span the input possibilities. Note, the following are standard ways of extending the grammar. (I) Ignore other characters. (2) Treat other characters like paces. (3) Consider other characters to be errors.

(use 2)

Figure 7: A constrint violation.

(within build-graph (para) (declare (directed graph)

(= (domain input-to-node-map) (select 'gap-token-p para))

(forward-wrt-input graph) ) (is input-preprocessing (setq para (concatenate (instance 'gap-token :contents *indent* :stretch 0) para (instance 'gap-token :contents "" :stretch I0000))))

(is root (input-to-node-map (first para))) (is arc-test

(lambda ((x gap-token) (y gap-token)) (let ((r (ratio (seq-between x y))))

(and (node-p (input-to-node-map x)) (<= 0 r *threshold*)))))

(design ratio (tokens) (/ (extras tokens) (sum (map 'stretch tokens))))

(design extras (tokens) (-*line-length* (sum (map 'width tokens)))))

Figure 8: Specifying how to build the graph.

In the last part of Figure 6, the engineer defines two additional properties of tokens. Width is defined as a convenient abbreviation for the length of the contents of a token. Stretch is a fundamental part of the justification algorithm as will be seen below.

The DA responds to the description in Figure 6 by noting that a constraint has been violated (see Figure 7). A fundamental assumption of the tokenize clichd is that the grammar must specify how every possible input character should be handled. Here, however, the fate of non-printable characters other than newline has been left unspecified. Based on knowledge associated with the tokenize clich4 about how to handle this problem, the DA is able to suggest several solutions. The engineer can choose one of these solutions or specify some other alternative or ignore the problem and deal with it later. In Figure 7, the engineer decides to treat the remaining characters the same as if they were spaces. This will cause them to be included as parts of paragraph breaks and gaps rather than words.

In Figure 8, the engineer gives additional specifications for how to build the graph required. The declarations state that the graph to be produced is a directed graph, the nodes of the graph correspond to gap tokens in the input, and the arcs in the graph are forward with respect to the input (i.e., if there is an arc from m to n then the token corresponding to m must precede the token corresponding to n in the input). It is useful to know that the arcs are forward with respect to the input both because this implies that the graph is acyclic and because it points the way to particularly efficient graph construction algorithms.

Like the overr ide form, the input-preprocessing role is

A C M S I G S O F T S O F T W A R E E N G I N E E R I N G N O T E S voi 16 no 2 A p r 1991 Page 38

(design l ineout (out:f i le 1) ( for -each (( token l )

(padding (prorate (extras I) (map 'stretch i))))

(output outfile (content token)) (output outfile space :repetition padding))

(output outfile newline))

Figure 9: Specifying how to output a line.

one of the DA's clich4 adapt ion mechanisms. This role can be used in conjunction with any clich4 to specify preprocessing tha t modifies one or more input values before the main processing of the clich4 is applied. In Figure 8, this is used to specify tha t the input token sequence should be prefixed by a token containing *indent* (which represents the paragraph indentation desired) and followed by a token of very large stretchabil i ty (which causes the justification algorithm to handle the last line of the paragraph properly).

By specifying the r o o t role, the engineer indicates tha t the graph being created is rooted and what the root is. (The mapping input to node map is a partial function that is defined as par t of the build graph clich4. It specifies which input i tems map to nodes and what nodes they map to.)

The use of the arc test role indicates tha t the engineer wishes to follow a graph-building s t ra tegy based foremost on determining what the arcs should be, ra ther than on determining what the nodes should be. The arc test specified states that , given two tokens x and y corresponding to possible nodes, an arc should be created from a node corresponding to x to a node corresponding to y, if and only if there already is a node corresponding to x and the r a t i o associated with the subsequence of tokens delimited by x and y is within the specified bounds. The ratio is a measure of how ugly the subsequence would be if printed as a justified line. It is computed by dividing the number of spaces that have to be added to fill out. the subsequence into a complete line, by the stretchabil i ty of the subsequence.

An interesting aspect of the description in Figure 8 is that it introduces three global parameters of j u s t i f y . The variable *indent* holds a string containing the indentation to use when s tar t ing a paragraph. The variable * l ine- leng th* holds the desired width of the justified lines. The variable *threshold* holds a number tha t limits the arcs created. It is used to prune the graph by eliminating arcs that correspond to lines tha t are so ugly when justified that they should never be considered as par t of the justified paragraph.

In Figure 9, the engineer completes the second-level description of j u s t i f y by describing the operation of l ineout . Given a sequence of tokens corresponding to a line of output , the tokens are printed interspersed with extra spaces. How many ext ra spaces to print after each token is determined by prorat ing the total number of extra spaces required in proportion to the stretchabil i ty of the tokens.

The top par t of Figure 10 shows the DA'S response once the engineer has completed the second-level description of j u s t i f y . It indicates tha t there are still a few things the DA has not been able to figure out.

To s tar t with, the engineer neglected to say anything more about the shortest pa th computat ion. Because the graph is rooted, the shortest pa th clich4 assumes by default

For shortest-path: end-nodes and cost-function unspecified. For lineout: how should an arc be coerced into a sequence of tokens?

(within shortest-path (graph) (is cost-function 'ratio) (is end-node (input-to-node-map (last para))))

(to-coerce arc (sequence-of token) (seq-between (node-to-input-map (source arc)),

(node-to-input-map (destination arc))))

Figure 10: Clearing up a couple of details;.

tha t the root is the one and only s tar t node. However, there is still no basis for assuming what the end-nodes are or what the cost-function is. In the middle of Figure 10, the engineer specifies tha t there is a single end node corresponding to the last token in the paragraph and tha t the ratio associated with an arc is its cost.

The second question asked by the DA concerns l ineout . From the way it is used in Figure 9, it is clear tha t the second argument to l ineout needs to be a sequence of tokens. How- ever, from the way l ineout is used in j u s t i f y (see Figure 3), it is clear tha t the second argument of l ineout is an arc in the shortest path. The DA needs to figure out how to coerce this arc into a sequence of tokens.

The problem of coercion comes up in many places. For example, in Figure 3, para-break has to be coerced from the name of a type of token to a predicate tha t can be applied to tokens, while i n f i l e and o u t f i l e have to be coerced from file names to s t reams that can be used for input and output. I t should not be hard for the DA to handle simple one step coercions like these. However, the coercion from an arc to a sequence of tokens is considerably more complex. In the b o t t o m of Figure 10, the engineer specifies that the required sequence is the sequence of tokens s tar t ing after the one corresponding to the source node of the arc and ending before the destination node of the arc.

The descriptions in Figures 3-10 are a mixture of a variety of different kinds of s ta tements , some of which are similar to tradit ional kinds of specifications and some of which are not. In particular, the s ta tements do not confine themselves to describing "what" instead of "how". Some parts of the descriptions (e.g., the tokenizing g r ammar and some of the s ta tements about the graph to be built) merely say what should be. Other parts (e.g., the description of l ineout) essentially just say how. Still other par ts fall somewhere in between. For instance, the uses of clich4s like shortest path and prorate can be looked upon as shor thands for the associated specifications or as analogous to uses of generic procedures, depending on how you want to look at them.

"How" s ta tements are included in the descriptions for two reasons. For one thing, it is assumed that the DA is not capable of making the central high-level design decisions and must be told. However, more importantly, while it is often said that specifications should be s ta ted in terms of what and not how it is not at all clear that this is a favor to the user. It is likely tha t what users really want is to say as little as possible. If saying "how" is easier than saying "what", that is what they would rather say.

ACM SIGSOFT SOFTWARE ENGINEERING NOTES vol 16 no 2 Apr 1991 Page 39

The Code Produced by the DA

The descriptions in Figures 3-10 are close enough to complete that the DA should be able to implement justify more or less completely. A portion of the code which will result is shown in Figure 11. The full code (see [9]) consists of some 6-8 pages of Common Lisp depending on how many of the subroutines you wish to discount on the theory that they are utilities that might be used by other programs as well. (The engineer could have asked for the DA to create code earlier in the scenario, however, the code would have been much more fragmentary.)

The first four forms in Figure 11 define data structures corresponding to tokens, nodes, arcs, and the graph. The fields of the token structure contain a keyword indicating the type (:gap, :word, or :para-brenk), a pointer to the following token, a string recording the content, and numbers indicating the width and stretch. The first field of the node structure points to the corresponding token (i.e., it implements the node to input mapping, see Figure 10). The second and third fields are used by the shortest path computation. The fields of the arc data structure record the node structures at the beginning and end of the arc and the cost associated with the arc. The fields of the graph data structure contain a list of the nodes (sorted in the order of the corresponding tokens) and a list of the arcs (sorted in the order of their source nodes).

The fifth form in Figure 11 defines the justify program itself. The code for justify looks very much like the description in Figure 3, because this description is primarily algorithmic in nature and the DA has decided to make sep- arate subroutines corresponding to the various clichds used in the description. There are, however, a number of detailed differences. The term para-break has been replaced by an appropriate predicate. Code has been added to open and close £nfile and out:file and the resulting streams have been used where required. The use of the output clich6 has been replaced by a call to the appropriate Lisp primitive. The specification construct for-each has been replaced by the iterate construct. The shortest path, which at the level of the description in Figure 3 is an abstract sequence, has been concretely implemented as a list. The function scan is used to access the elements of this list.

The forms iterate and scan are notable because they are part of the series extension to Common Lisp [10]. This extension supports a new data type called series, which corresponds to potentially unbounded sequences of values. The extension provides several dozen functions and macros that create, consume, and manipulate series. For example, the form i t e r a t e is analogous to d o l i s t - - t h e computation in its body is applied to every element of a series. The function scan creates a series containing the elements of a list.

Without any loss of efficiency, the series extension makes it possible to write most programs without having to use any loops. This makes programs more compact and easier to understand. It also makes programs significantly easier for the DA to manipulate, by eliminating the need to reason about loops.

As is the case with j u s t i f y , the code for l ineout in Fig- ure 11 is similar to the engineer's description (see Figure 9). However, there are again a number of detailed differences.

(defstruct token type next content width stretch) (defstruct node token min-cost best-arc) (defstruct arc source destination ratio) (defstruct graph nodes arcs)

(defun justify (infile outfile) (with-open-file (instream infile :direction :input)

(with-open-file (outstream outfile :direction :output) (iterate ((para (segment (tokenize instream)

#,para-break-token-p))) (iterate ((line (scan (shortest-path

(build-graph para))))) (lineout outstream line))

(write-char #\newline outstream)))))

(defun lineout (outstream line) (let* ((tokens (scan-tokens-in-arc line))

(extras (extras tokens)) ( t o t a l - s t r e t c h

(collect-sum (map-fn #~token-stretch tokens))) (tokens (scan- tokens- in-arc l ine ) ) ( s t r e t c h e s (map-fn #~token-s t re tch tokens)))

( i t e r a t e ((token tokens) (padding (prorate extras stretches total-stretch)))

(write-string (token-content token) outstream) (dotimes (i padding) (write-char #\space stream)))

(write-char #\newline outstream)))

(defun extras (tokens) (declare (optimizable-series-function)) (- *line-length*

(collect-sum (map-fn #'token-~idth tokens))))

(defun shortest-path (graph) (let ((root (car (graph-arcs graph)))

(sink (car (last (graph-arcs graph))))) (iterate ((node (scan (graph-nodes graph))))

(serf (node-min-cost node) most-positive-single-float))

(iterate ((arc (scan (graph-arcs graph)))) (let ((destination (arc-destination arc))

(cost (+ (node-min-cost (arc-source arc)) (arc-ratio arc))))

(if (> (node-min-cost destination) cost) (setf (node-min-cost destination) cost) (setf (node-best-arc destination) arc))))

(n.reverse (collect

( s c a n - f n - i n c l u s i v e t #'(lambda 0 (node-best-arc sink)) #' (lambda (a) (node -bes t -arc (arc -source a ) ) ) #'(lambda (a) (eq (arc-source a) r o o t ) ) ) ) ) ) )

Figure 11: Part of the Lisp code to be produced.

The uses of the output clich~ have been replaced by appropriate Lisp code. The use of prorate has been replaced by a function call. This function returns a series of the prorated values. It is convenient for the function prorate to have an additional argument (the third) containing the sum of the shares provided as the second argument.

The function scan-tokens-in-arc generates a series of the tokens corresponding to an arc in the shortest path. (It corresponds to the coercion specification at the bottom of Fig- ure 10.) The function extras takes a series of tokens and returns the total amount of padding that is required. (The definition of extras, which corresponds to the last statement in Figure 8, is shown in the middle of Figure 11.) The function map-fn maps a function over a series computing a new series of the results. The function collect-sum computes the sum of the elements in a series.


(explain-design-of shor tes t -pa th) (1) Shor tes t -pa th uses the DAG-SSSP (directed acycl ic

graph s ingle source shortest path) algorithm. (1.1) The retf~'n value i s a shortest path represented

as a l i s t of arcs . (1.2) The topo log ica l - so r t s tep i s omitted.

(why 1) (2) The DAG-SSSP algorithm is the best choice, because

(2.1) DAG-SSSP is ~)plicable, because (2.1.1) Graph is directed (a premise). (2.1.2) Graph is acyclic, because

(2.1.2.1) Graph is forward-srt-input (a premise). (2.1.2) There is only one start node.

(2.1.2.1) If a graph has a root, it is the start node (a heuristic).

(2.1.2.2) Graph has a single root (a premise). (2.2) DAG-SSSP is faster than other applicable

algorithms. (2.2.1) DAG-SSSP :is linear. (2.2.2) All o ther algorithms are worse than l inea r .

(2.3) Prefer time-efficiency is an implementation guideline.

Figure 12: The DA can explain its decisions.

The last form in Figure 11 shows the definition of the function shor tes t -pa th . As will be discussed in conjunction with Figure 12 below, the code shown comes from the DAG- SSSP clich@ (see Figure 4). It embodies the standard method for finding a shortest path in a directed acyclic graph. The rain-cost and bes t -a rc fields in the node data structure are used to keep track of intermediate values while searching for a shortest path. The function scan- fn- inc lus ive generates a series given three functions as arguments. The first function is used to compute the initial element of the series. The second function is used to compute each subsequent element from the one before. The series contains elements up to and including the first element for which the third function is true. The function c o l l e c t collects the elements of a series into a list.

S c e n e 3: E x p l a i n i n g D e s i g n R a t i o n a l e

In Figure 12, an engineer asks the DA to explain its design for the sho r t e s t -pa th subroutine. Explanations of this kind are impor tant for anyone who is trying to understand what the code does at a later date. They are also important as a way for the original engineer to inspect the work of the DA.

In its first response in Figure 12, the DA describes the algorithm used in s h o r t e s t - p a t h a s being an instance of DAG-

sssP. It notes that the return value is a list of the arcs in the shortest path as opposed to the length of the path. It also notes that the topologicM-sort step of the algorithm has been omitted.

Ordinarily the first step in the DAG-SSSP algorithm sorts the arcs in an order that is consistent with the partial order they impose on the nodes. Here this step can be omitted because, since the graph is forward with respect to the sequence of input tokens, it is easy to create the arcs in topologically sorted order in the first place. It is expected that this kind of efficiency can be achieved through a kind of balancing between the build graph and shortest path clich@s.

In its second response in Figure 12, the DA explains why it picked the DAG-SSSP algorithm. This choice is made based on the properties of the graph and the fact that the engineer

[ P l a n C a l c u l u s

F r a m e s

I A l g e b r a i c R e a s o n i n g

I P r o p o s i t i o n a l Logic

Figure 13: The layered architecture of Cake.

stated a preference for time efficiency.

A H y b r i d R e a s o n i n g S y s t e m

The degree of automation the DA can provide depends ultimately on its ability to reason about structured objects (clichds, designs) and their properties. This rests on the capabilities of the Cake hybrid knowledge representation and reasoning system [7].

Cake supports reasoning through a combination of special- purpose techniques and general-purpose logical reasoning. Special-purpose representations and algorithms are essential to avoid the combinatorial explosions that typically occur in general-purpose logical reasoning systems. On the other hand, logic-based reasoning is very valuable when used, under tight control, as the glue between inferences made in different special-purpose representations.

Figure 13 shows the architecture of Cake. Note that Cake combines special-purpose representations, such as frames, with general-purpose logical and mathematical reasoning. Each layer of Cake builds on facilities provided by the more primitive layers below.

The propositional layer of Cake provides four principal facilities. First, it automatically performs simple one-step logical deductions--technically, unit propositional resolution. (Placing tight limits on the kinds of deductions that are performed automatically is essential to avoid combinatorial explosions.)

Second, the propositional layer supports a general mech- anism for the pattern-directed invocation of demons. This can be used to simulate various kinds of quantified reasoning. Together with the support for unit resolution this is used to support the reasoning capabilities of the DA. Because these facilities are relatively weak, the DA will only be capable of simple deductions. However, because these facilities are general purpose, the DA will be capable of a wide range of deductions.

Third, the propositional layer detects a certain class of shallow contradictions. (In conjunction with a grammar- analyzing demon associated with the tokenize clichd, this could be used to detect the constraint violation described in Figure 7.) Importantly, contradictions are represented explicitly in such a way that reasoning can continue without having to resolve the contradiction immediately. This feature is motivated by a desire for the DA to support an evolutionary process wherein the software engineer is always in control of the order of events.

Fourth, the propositional layer acts as a recording medium for dependencies--what is often called a truth-maintenance system. In as much as the dependencies are a trace of the system's reasoning, they can be used as the basis :for expla- nation as shown in Figure 12. They are also important as the basis for retraction (nonmonotonic reasoning). This is im-


portant for providing efficient support for the modification of cliches (e.g., the overr ide construct in Figure 6) and changes such as those shown in the remainder of the scenario below. These facilities are motivated by the observation that when you delegate work to an assistant such as the DA, you must have accountability and the ability to recover from mistakes, in case it does not do what you expect.

The algebraic layer of Cake has special-purpose decision procedures for equality (congruence closure), common algebraic properties of operators (such as commutativity, asso- ciativity, and transitivity), partial functions, and the algebra of sets. For example, the congruence closure algorithm determines whether or not terms are equal by substitution of equal subterms. The decision procedure for transitivity determines when elements of a binary relation follow by transitivity from other elements. The algebra of sets involves the theory of membership, subset, union, intersection, and complements. The algebraic layer of Cake also extends the basic propositional logic with the addition of typing (sorts) and limited quantificational facilities. All of these facilities are important as part of the DA'S general reasoning abilities.

The frames layer of Cake supports the standard notions of slots, inheritance, and instantiation. A notable feature of Cake's frame system is that constraints between the slots of a frame can be reasoned about in a general way. This helps the DA acquire information incrementally and in any order. As discussed in the next subsection, the frames layer is used to represent clich6s in the clich6 library. Frame inheritance is the primary organizing principle in the library.

The Plan Calculus [8] layer of Cake supports an internal representation for algorithms that is well suited for automated reasoning and other manipulations. This layer figures prominently in the representation of algorithmic clich6s and will be the foundation of the DA'S ability to manipulate programs, just as an earlier implementation of plans was the foundation of KBEmac's abilities.

It should be realized that while Cake is central to the DA, it is essentially passive. It acts essentially as an intelligent data base performing a small amount of forward deduction whenever a piece of information is put into it and a moderate amount of backward chained deduction whenever a question is put to it. To utilize Cake, there has to be a control module that ensures that the right information is put into Cake and the right questions are asked. In the case of the DA this module is the designer module (see Figure 2).

R e p r e s e n t i n g Cl ich6s

It is expected that three things will figure prominently in the DA's representation for clich6s: frames, plans, and program generators. Much of the information about clich& will be represented by frames. Each cliche will correspond to a frame type and each use of a clich6 will correspond to an instance of the associated frame type. Hierarchies of clich6s such as the one shown in Figure 4 will be represented by hierarchies of clich6 frame types.

The algorithms associated with most clich6s (e.g., the ones in Figure 4) will be represented using the Plan Calculus. As discussed at length in [8], using this algorithmic lingua francs facilitates the analysis, combination, and modification of clich6s.

Our approach i s i n f l u e n c e d by our v i e e on t h e n a t u r e o f progran~ing. Hence , b e f o r e d e l v i n g , i n t o t h e d e p t h s o f our a p p r o a c h , we b r i e f l y c h a r a c t e r i z e our view of programming.

Programming i s K n o w l e d g e - I n t e n s i v e : D i f f e r e n t s o u r c e s of k n o e l e d g e a r e r e q u i r e d . Knowledge o f d a t a s t r u c t u r e s and a l g o r i t h m s a r e k e y components o f programming , so a r e program s t r u c t u r i n g t e c h n i q u e s , program s p e c i f i c a t i o n , and knowledge about t h e a p p l i c a t i o n domain . We b e l i e v e t h a t much of the knowledge needed in programming can be codified so that a computer program can make use of it mechanically.

Figure 14: Initial (buggy) output of j u s t i f y .

(w i th in b u i l d - g r a p h (para) ( i s i n p u t - p r e p r o c e s s i n g (setq pars

(concatenate ( instance 'gap-token :contents "" : s t r e t ch 0) ( instance 'gap-token :contents *indent* : s t r e t ch 0) para ( instance 'gap-token :contents "" : s t r e t ch 10000) ( instance 'gap-token :contents .... : s t r e t ch 0 ) ) ) ) )

Figure 15: Fixing a Bug.

However, it is sometimes the case that the knowledge associated with a cliches is too complex to be readily represented in terms of a family of subclich& linked to plans. When this happens, a program generator will be used instead. A good example of a situation where this is needed is the tokenize clich6. The relationship between the token grammar and the algorithm required is sufficiently complex that it is not practical to represent it in a declarative way.

An important goal of research on the DA is investigating how to combine program-generator-like cliches with cliches represented declaratively in terms of subclich& linked to plans. The key to this will be having the generator associated with a cliche generate a plan including detailed dependencies showing its relationships to other cliches. In ef- fect the generator is used to represent an unbounded family of cliches. When the generator runs, it produces the same information that the DA would have obtained if the cliche had been declaratively represented. As a result, it should be possible to freely combine the plans created by a program generator with plans derived from other sources.

S c e n e 4: F i x i n g a b u g

Returning to the scenario, since an initial version of j u s t i f y is complete, the engineer decides to test it. Given a value of . . . . for *indent* , 60 for *length* and 2.0 for *threshold*, and a test input file, the program produces the output shown in Figure 14. Inspection of this output reveals that while things are mostly correct, the indentation is getting lost and the last line of each paragraph is justified even though it should not be.

After some thought, the engineer notices that the way arcs are converted into sequences of tokens (see the bottom of Figure 10) incorrectly clips off the special tokens added at the beginning and end of each paragraph (to handle the indentation and to prevent the justification of the last line) in the same way that it correctly clips off the gap tokens corresponding to the points where lines are broken. The easiest way to fix this is to add two additional dummy tokens around the paragraph as showv in Figure 15. The DA changes the design as requested and automatically recedes the program.


( i m p l e m e n t a t i o n - g u i d e l i n e s ( c o n s i d e r e r r o r - c h e c k i n g ) )

In t o k e n i z e : I c a n n o t p r o v e t h a t t h a t p r e v i o u s - t o k e n i s n o t a p p l i e d t o t h e f i r s t t o k e n . I f i t i s , a run t i m e e r r o r w i l l o c c u r .

In r a t i o : I c a n n o t p r o v e t h a t t h e s e c o n d a rgument o f / i s n o t z e r o . I f i t i s , a r u n t i m e e r r o r w i l l o c c u r .

In l i n e o u t : I c a n n o t p r o v e t h a t whenever t h e f i r s t a rgument o f p r o r a t e i s g r e a t e r t h a n z e r o , t h e t h i r d argument w i l l be as w e l l . I f i t i s n o t , t h e sum of t h e o u t p u t w i l l be z e r o .

In s h o r t e s t - p a t h : I c~mnot p r o v e t h a t e n d - n o d e i s a node i n g r a p h . I f i t i s n o t , a r u n t i m e e r r o r w i l l o c c u r .

In s h o r t e s t - p a t h : I c a n n o t p r o v e t h a t t h e r e i s a p a t h f rom r o o t t o e n d - n o d e . I f t h e r e i s n o t , a r u n t i m e e r r o r w i l l o c c u r .

Figure 16: Critique of error conditions.

(within tokenize (declare (not (gap-token-p ( f i r s t token~))) ) )

(redesign r a t i o (tokens) ( l e t ( ( ex t ras (ex t ras tokens)) )

( i f (= ext ras O) 0 ( l e t ( ( t o t a l (sum (map ' s t r e t c h tokens) ) ) )

( i f (= t o t a l 0) 10000 ( / extras t o t a l ) ) ) ) ) )

Figure 17: Specifying error handling.

Using the dependency information in the design record, it is able to do this with minimum effort.

S c e n e 5: C o n s i d e r i n g E r r o r s

A more interesting example of the support provided by the DA is illustrated in Figure 16. Up to this point in the scenario, the engineer has instructed the DA not to worry about error checking associated with special situations such as boundary conditions. With the command at the top of the figure, the engineer tells the DA to start considering these questions. In response to this, the DA points out a number of detailed difficulties.

This is an important aspect of design by successive elaboration aproach--namely, worrying about the main features of an algorithm before worrying about special cases. The advantage of this approach is that it enables an engineer to determine whether the overall approach is sound before div- ing into the full details. This helps avoid wasting time on details that turn out to be irrelevant in the long run.

The points brought up by the DA in Figure 16 can be classified in two ways. On one hand, some of the points correspond to boundary conditions (e.g., what happens when you ask for the predecessor of the first token) and other points concern constraints that the DA is unable to verify (e.g., the requirement that the end node must be a node in the graph). While error checking is being ignored, the DA op- timistically assumes that anything that it cannot prove false is true. However, when error checking is being considered, it assumes that anything it cannot prove true is probably false.

On the other hand, some of the problems are not real problems, because the indicated boundary condition cannot actually occur or becanse the constraints involved are true even though the DA cannot prove them. Other problems can be ignored, because even though they are real, their associated behaviors are acceptable. Still other problems require

(design j u s t i f y ( i n f i l e o u t f i l e ) ( d e c l a r e ( a s c i i - f i l e i n f i l e o u t f i l e ) ) ( for-each ((p (segment ( tokenize i n f i l e ) 'para-break))

( for-each ((arc ( shor tes t -pa th (build-graph p) ) ) ) (lineout outfile arc))

(output outfile ne.line)))

( w ' i t h i n t o k e n i z e () ( d e c l a r e ( n o t ( g a p - t o k e n - p ( f i r s t t o k e n s ) ) ) ) ) (is grammar

"para-break => bof (space [ newline)* (space [ newline)* eof (space)* newline (space)* newline (space ] newline)*

word => (non-blank-printable-characters)+ gap => (space ] ne.line)+")

( o v e r r i d e ( f o r - a l l ( ( g a p g a p - t o k e n ) )

( l e t ( ( p r e y ( c o n t e n t s ( p r e v i o u s - t o k e n g a p ) ) ) ) (= (contents gap)

( i f (member ( l a s t prey) " ! ? . " ) . . . . . . . . ) ) ) ) ) ( i s process-unexpected-characters (replace-by space)) (design width (token) ( length (contents token))) (design s t r e t c h (token)

( i f (gap-token-p token) (width token) 0)))

Figure 18: Par t of a summary description.

changes in the design. The engineer's response to the first two problems in Fig-

ure 16 is shown in Figure 17. The first problem is a red herring. If the DA had a sufficiently deep understanding of the tokenization grammar, it would be able to figure out that the first token cannot be a gap and therefore the boundary situation described cannot occur. The engineer adds a dec- laration to guide the DA in the right direction.

The second problem is more interesting. It is a symp- tom of the fact that the definition of r a t i o overlooks the possibility that a sequence of tokens might consist of just a single word. To deal with this problem the engineer changes the definition to specify that the ratio is zero whenever the amount of padding needed is zero and that one word lines are otherwise unacceptable.

The third problem in Figure 16 can safely be ignored. If padding is required when there is no stretchability, there is nothing that can be done in any case. It is acceptable to simply print no padding.

The last two problems in Figure 16 are more interesting. They reflect a fundamental over simplification in the justification algorithm as presented so far. Because grossly unacceptable arcs are pruned out of the graph, there is in fact no guarantee that there will be any path from the start node to the end node. (For example, this problem arises whenever there is a word longer than a line, or three words in a row where each word is shorter than a line, but the first and second, and second and third words together are longer than a line.) Dealing with this reasonably without losing the benefits of pruning bad arcs out of the graph requires careful modification of the strategy for building the graph. IIowever, in the interest of brevity, the scenario is brought to a close at this point.

While changes are being made, the DA continually up- dates the code for j u s t i f y . In addition, it is valuable for the DA to maintain an updated version of the high-level description as shown in Figure 18. It is expected that most

A C M S I G S O F T

of the engineer's interaction will be in terms of this description. One of the goals of research on the DA is ensuring that every kind of change the engineer can make can be directly reflected in the high-level description.

4 . C u r r e n t S t a t u s a n d F u t u r e Goals

The implementation of the DA is far from complete, however, as discussed in [9], progress has been made on several fronts. The underlying reasoning module (Cake, see Fig- ure 2) is essentially complete and has been used to support much of the work in the Programmer 's Apprentice project over the past several years.

The translator and coder modules have not yet been implemented. However, they need not differ significantly from the analyzer and coder modules of KBEmacs. They have to be reimplemented so that they interface with Cake; however, little if any additional research on them is required.

A considerable amount of work has been done codifying the cliches needed to support the scenario in Section 3. De- tailed Plan Calculus representations have been constructed for several relevant families of cliches.

Des ign Steps The primary focus of work on the DA has been, and will

continue to be, on the designer module. The designer module is intended to proceed using the following design cycle.

1. Determine all the choice points in the current design where a selection has to be made between alternative ways to complete the design. If there are none, the design is complete.

2. Select a choice point and select an appropriate alternative.

3. Extend the design by introducing the selected alternative and propagate the new information throughout the design. Sometimes, the new information will imply that only one alternative is acceptable at some other choice point. Whenever this happens, make this choice as well.

If the design cycle leads down a blind alley ending in a contradiction or a state where no further progress can be made, the DA will either have to ask the engineer for assistance, or back up and consider other alternatives.

The scenario in Section 3, and others like it, have been studied in detail to determine exactly what kind of choices have to be made in order to extend the engineer's input into a detailed design. This s tudy has revealed the need for the six basic kinds of choices described below.

It comes as no surprise that two important kinds of choices are algorithm selection and data structure selection. For instance, one has to decide which particular shortest path algorithm to use and what concrete data structures to use when representing the graph.

A more interesting kind of choice involves sharing and omitting parts of computations. As a design evolves, there are typically many opportunities for sharing redundant sub- computations and omitting parts of a computation that are unnecessary in a specific context. This happens even more often than one might expect when combining standard components from a library, because each library component has to be stored in complete general i ty--more generality than is

S O F T W A R E E N G I N E E R I N G N O T E S vol 16 no 2 A p r 1991 Page 43

necessary in almost any particular situation. Several additional kinds of choices stem from the need

to exercise judgment when interpreting the statements in a software engineer's description of a high level design. An important way to make descriptions concise is to leave out details that are obvious. The designer module has to fill in these details or complain when it cannot figure them out.

A key step is disambiguating statements. When the engineer says ( shor tes t -pa th x) the designer module has to figure out what flavor of shortest-path algorithm is desired (e.g., finding all paths or just one path), what role of shortest- path x corresponds to (e.g., the input graph or perhaps the start node), and what kind of output is desired (e.g., a path or just its length). In essence, the designer module has to figure out how to connect each user statement with a cliche.

An important process applied by the designer is propa- gation of type constraints in order to disambiguate the types of the variables in a program description and the type signa- tures of the functions. This often reveals that the description has been simplified by leaving out type converters. (For example, an arc in a path is used where a series of tokens is desired.) The designer module has to decide what type coercion to use in each such situation.

Finally, design is an under-constrained problem. Of- ten the only way that one can forge ahead is to make an assumption c.g., assume a default value for some role in a cliche. An assumption may eventually prove wrong and have to be retracted. However, the ability to push on without ask- ing the engineer for confirmation of every little detail is well worth the cost of occasionally making incorrect assumptions.

The design record is composed of a sequence of steps, where each step records a choice having been made and shows the new design that results. Cake dependencies are used to record the relationship between different choices so that dependency-directed backtracking can be used instead of mere chronological backtracking.

Besides steps corresponding to the six types of choices above, the design record has to contain a seventh kind of step. Much of what is in the design record is not there because of anything the designer module decides, but rather just because the engineer said so.

The only parts of the designer that have been implemented to date are the type information propagator and the choice-point locator. These have been tested by applying them to hand generated plans representing program descriptions like Figure 18. In the future, work will concentrate first on completing the designer module and then on the other modules of the DA.

T h e S e a r c h P r o b l e m

Stepping back a bit, it can be seen that the general out- line of what the designer module does is not unlike what transformational implementation [1] or data structure selection [4] systems do. As with these other kinds of systems, it is easy enough to show that a particular high-level design can be extended to an acceptable low-level design if the right sequence of design steps is applied. However, the real question is not whether this is possible, but rather whether an automatic system can succeed in selecting the correct design path out of the myriad of incorrect paths.


In recognition of this, the central research goal of the DA is attacking the search problem underlying design. In particular, the success (ff the DA will depend on the extent to which it solves (or avoids) this search problem. Four lines of attack are being used on the problem-- two that seek to solve parts of the search problem and two that seek to avoid parts of it.

O p e r a t i n g a t a n a b s t r a c t level . One line of attack is to operate at the abstract level of plans and design dependencies, rather than at the level of textual (or parse tree) representations for specifications, designs, and programs. This has two important virtues.

First, moving to a more canonical representation that abstracts away from unimportant details lessens the need for fix-up transformations that rearrange these details in order to allow other transformations to apply. For example, an oft needed fix-up transformation is swapping the order of a pair of s tatements that are inherently unordered. In the Plan Calculus, this kind of fix-up is not required, because unordered s tatements are directly represented as unordered. There is never a need to fix up an incorrect order, because a gratuitous order is never forced in the first place.

Reducing the number of fix-up transformations required reduces the length of design paths. More importantly, eliminating fix-up transformations eliminates some of the transformations that aggravate the search problem the most. The problem with fix-up transformations is that they can be applied almost anywhere. In contrast, the kind of transformations that actually advance the design typically only apply in a few specific situations.

The second virtue of operating at an abstract level is that it gives primacy to features that can actually guide the design. Rather than being mere decoration on a lower-level representation, such as a parse tree, high-level summary information becomes the centerpiece of the representation where it can be easily accessed, manipulated, and matched against.

U s ing l a r ge a m o u n t s o f k n o w l e d g e . The most important approach used by the DA to solve the search problem is to use large amounts of knowledge. Instead of having a small amount of general-purpose knowledge, the DA seeks to have a very large amount of knowledge relevant to particular situations. This is important , because short sequences of powerful steps are much easier to find than long sequences of general-purpose steps.

Consider a powerful set of general-purpose transformations such as folding and unfolding. The virtue of such op- erations is tha t a great many things can be done with them. However, they have the problem that , since they are applicable almost everywhere, it is hard to decide where to use them fruitfully.

In contrast, consider the DA's knowledge of shortest path algorithms. This knowledge is only useful when one happens to be choosing a shortest path algorithm. However, it has the virtue that when it is applicable, it is very powerful and very easy to determine that it is applicable.

A v o i d i n g a l g o r i t h m d i scovery . Programming can be divided into two basic areas: the creative process of discover- ing new algorithms and the much more mundane (and much more common) process of piecing well known algorithms together. Systems that attack the task of algorithm discovery

are trying to do what it is hardest for software engineers to do. The DA assiduously takes the opposite approach-- attacking the most mundane parts of design. Its goal is to select the right algorithm from a family, not to discover new members of the family. Similarly, it seeks to combine algorithms in a straightforward way making only simple efficiency transformations. The rest is left to the engineer.

G e t t i n g h e l p f r o m t h e use r . Above all else, the whole design of the DA is oriented toward getting help fi~om the software engineer. The challenge in this is that it forces the DA to do everything in a way that it can explain to the engineer. However, if this is done well enough, the DA can ask a question understandable to the engineer whenever it needs guidance.

A c k n o w l e d g m e n t s

In addition to the authors, several people have made major contributions to the research on the DA. In particular, Charles Rich has contributed to every aspect of the DA from its earliest inception. Further, Rich and Yishai Feldman implemented the Cake reasoning system on which the DA is based.

Support for research on the DA has been provided in part by the National Science Foundation under grant CCR- 898273, in part by the Advanced Research Projects Agency of the Depar tment of Defense under Office of Naval Research contract N00014-88-K-0487, and in part by the IBM, NYNEX, Siemens, and Microelectronics and Computer Technology corporations.

R e f e r e n c e s

[1] D.R. Barstow, "An Experiment in Knowledge-Based Automatic Programming; Artificial Intelligence, 12(1):73- 119, 1979.

[2] D.R. Barstow, "Domain-Specific Automatic Programming; IEEE Transactions on Software Engineering, 11(11):1321- 1336, November 1985.

[3] D.E. Knuth and M.P. Plass, "Breaking Paragraphs Into Lines; Software - Practice and Experience, 11:1119-1184, 1981.

[4] J.R. Low, "Automatic Data Structure Selection: An Example and OverviewS' Comm. ACM, 21(5):376-384, May 1978.

[5] J.M. Neighbors, "The Draco Approach to Constructing Software from Reusable Components',' IEEE Transactions on Software Engineering, 10(5):564-574, September 1984.

[6] H.B. Reubenstein and R.C. Waters, "The Requirements Apprentice: Automated Assistance for Requirements Acquisition; IEEE Transactions on Software Engineering, 17(3), to appear March 1991.

[7] C. Rich, "The Layered Architecture of a System for Reasoning about Programs',' Proc. of the 9th Int. Joint Conference on Artit~cial In telligence, pp. 540-546, August 1985.

[8] C. Rich and R.C. Waters, The Programmer's Apprentice, Addison-Wesley, Reading MA, and ACM Press, Baltimore MD, 1990.

[9] Y.M. Tan, Supporting Reuse and Evolution in Software Design, MIT/AIM-1256, October 1990.

[10] R.C. Waters, "Automatic Transformation of Series Ex- pressions into Loops',' A CM Transactions on Programming Languages and Systems, 13(1):52-98, January 1991.

toward a design apprentice

Documents