assistant: an intelligent system for i/i ...the documentation assistant: an intelligent system for...

RD-RI54 709 THE DOCUMENTATION ASSISTANT: AN INTELLIGENT SYSTEM FOR i/I.DOCUMENTATION..(U) ADVANCED INFORMATION AND DECISION

SYSTEMS MOUNTAIN YIEWd CA J S DEAN ET AL. 22 APR 85

UNCLASSIFIED AI/DS-TR-i047-029-i N88814-83-C-0444 F/G 9/2 N

ail

.1" bo. 2.0 'i :

1.6.

1111 1. E'

MICROCOPY RESOLUTION TEST CHARTNATIONAL BUREAU OF STANDARDS-l93-A

O%

% " --*:-'"-"'*" "*;. . " ' " " " "" " "'-"i-"-;"""""-.....: " " ..... .. -' "'

ADVANCED INFORMATION '- .

& DECISION SYSTEMS

201 San Antonio Circle, Suite 286Mountain View, CA 94040(415) 941-3912 "

TR-1047-020-1

THE DOCUMENTATION ASSISTANT:

An Intelligent System for Documentation

I Jeffrey S. Dean

Brian P. McCuneSusan G. Rosenbaum

April 22, 1985

Final Report for 1 July 1984 - 12 December 1984

Approved for public release; distribution unlimited

Prepared for:Office of Naval Research S:T"CDepartment of the Navy E.-..E800 North Quincy Street CArlington, Virginia 22217 JUN 6

L 'ip The views, opinions, and/or findings contained in this report are those of the9- author(s) and should not be construed as an official Department of the Nav.-.position, policy, or decision, unless so designated by other official documentation.'

%%,

... 85 ('068

JKLMSIVI.

r SCURITY CLASSIFICATION OF TFIIS PAGE

REPORT DOCUMENTATION PAGEIs, REPORT SECURITY CLASSIFICATION lb, 1ft!;III0l t§VL MAliKINGtS

* ~Unclassified _______________________

2s. SECURITY CLASSIFICATION AUTHO0RITY 3. DIStIIIII~rIONAVALABILITY OF 14tL6'jfl

__________________________________ Approved for public release;

Zb. 0(CLAS$IF ICA TIONtOOWAIGRAIDING SCHEDULE distribution unlimited

%4. PERFORMING ORGANIZATION REPORT P4UMUERISS 5. mutJilIDIISPNC. 001f.ANOZATION REPORT IdutAVE-RIS)

TR-1047-020-1

G& NAME OF PERFOR4MING ORGANIZATION a, OFFICE SYMBOL Is NAAr (#I 06(,N1T-'IIIING OfGAtjsZArI0O4

Advanced Information &rfplaei Office of Naval ResearchDecision SystemsI

6c. AOORESS (City. State *,ad ZIP* Cadet 7~ AoIi.!*-; lWt,. .1,..a.d ZIP C ...

201 San Antonio Circle, Suite 286 Department of the Navy* Mountain View, CA 94040-1270 800 N. Quincy Street

_______________________________ Arlington,_Virginia_22217L

* . S..NAME OF FUNDINGSOSRN Sta. OFFICE SYMBOL 9, PROCU#IE hdENT INSTRUMENT IOENTIFICA TICJII NUMSER

* . ORGANIZATIONf

Ik ADDRESS W~ily. Sile wad /I1' Code)10 SO 50U14 I 'ifF UP.L)INCA NOS.

PHUI40jPRJECT TASK11 WORK UN4iTELEM\EliJ 1(0 No NO NO.

'e uocumentatlon = ; tuIant: An intelligentSynteam for Dgmntation ________________________

'Den Jefe . McCune, Brian P.; Rosenbaum. Susan G.

I I. 13s. TYPE Of REPORT 113b. Time COVERED 14 DA )I RE PUJIl -V1. Mo1 .. 114Y# T% PAGE COUNTFinal FROM 84/07/01 yB4/1 2 /011 85/04/22 68

* W1E. SUPPLEMENTARY NOTArTiN

57. COSATI CODES IS. SUBJECT TER06S g~misn.,........... . and Idenify 6blI... au-nbeII

#IIEDO RAP SUB G tended M ewra Vqdel, Documentq tion Standards,- L.~ . OOU UU.G.ocumentat o, Sortare Engineering, Prograusuing Environmen0902 Intelligent Program Editor, Artificial Intelligence,

09Documentation Structure _______________

19. ABSTRACT ICungenue on res-erse if necessary and We*n (aI0 bv bl...b te,I,,bet

IThe purpose of this document is to describe a semi-automated system for the documentation ofcomputer software; this system is called the Documentation Assistant (DA). To motivate

'his system, a general discussion of documentation issues and problems is interwoven with

descriptions of how the DA would address many of these topics. A feasibility assessment and ~.plan for this approach is presented.

The Documentation Assistant project is a research effort at Advanced Information &Decision *~

Systems. The purpose of this effort is to study advanced techniques which address one of the

L ~ most pressing problems during the software life cycle: the process of documentation.

lCurrently, the DA exists on paper only; one goal of the research effort is to develop a

prototype version, which would be incorporated in the Intelligent Program Editor (another

researchprototype being developed at AI&DS under ONR sponsorship).

20 OI!,TniBUTIONfAVAILARILIrY OF ASRI is .1 I iI. .A ,IU lI

.JldCASSIFIEO/UtNLIMITEO - SAME AS APT, oti usk~l Unclassified

22. NAME Of A* SPON!;,oLE INDIVIDUAL 721, I It V- '0 -. ,I. 2 .

Dr. Robert B. Grafton '

(202) 696-4713 Code 430

L [2 00 FORM 1473,83 APR EIIVI)N OF I ),%N 73 IS 01II . I,l UNCLASSIFIED 11TIS1..

CONTENTS

TABLE OF CONTENTS

Page

1. OVERVIEW I

1.1 PURPOSE OF THIS DOCUMENT 11.2 PROBLEM 11.3 APPROACH I1.4 FEATURES 31.5 SCENARIOS 31.6 GUIDE TO READING 5

2. THE DOCUMENTATION PROCESS 6

2.1 THE NATURE OF DOCUMENTATION 62.1.1 Name 72.1.2 Structure 72.1.3 Attributes 82.1.4 The Documentation Taxonomy 92.1.5 Classifying Documentation 10

2.2 THE REPRESENTATION OF DOCUMENTATION 11 -'2.2.1 The Extended Program Model (EPM) 13

2.3 THE CONTEXT MODEL 142.3.1 Understanding What the Programmer is Doing 142.3.2 Tracking the Documentation and Code 16

2.4 DOCUMENTATION POLICIES 16

3. CONTROLLING DOCUMENTATION 18

3.1 POLITICAL ASPECTS OF DOCUMENTATION CONTROL 183.2 TECHNICAL ASPECTS OF DOCUMENTATION CONTROL 20

3.2.1 The State of the Documentation and Code 203.2.2 The State of the Programmer 213.2.3 A Rule Base for Controlling Documentation 21

4. THE USER INTERFACE 23

4.1 VIEWS 234.2 INTERACTION CONTROL --. 254.3 TRAVERSAL :> 264.4 RETRIEVAL , 274.5 FORMATTING ,;:,. _ 27

5. FEASIBILITY - 28

5.1 IMPLEMENTATION FEASILLJITY 2-28

.~i .i:.'-. .o'-2

F "o-." K2)h2:t. ~ *- F..~ A ., * * . . .~'* **** **~ " .*.. '. " •

~ j ~ -7. - - .- - - . a - .,- W-7--17 - - a

S . *.

CONTENTS

5.1.1 The Intelligent Program Editor 285.1.2 Documentation Database 295.1.3 Detection of Outdated Documentation 295.1.4 Documentation Retrieval 30 ,..'.5.1.5 Documentation Formatting and Analysis 30

5.2 DEPLOYMENT FEASIBILITY 305.2.1 Current Documentation Problems 305.2.2 Documentation Life Cycle Support 315.2.3 Supporting Documentation Standards 325.2.4 Knowledge Acquisition and Maintenance 39

5.3 FEASIBILITY SUMMARY 40

6. WORK PLAN 41

6.1 TASK SUMMARY 416.2 TASK DESCRIPTIONS 42 -

7. FUTURE RESEARCH 44

7.1 FUTURE RESEARCH ON PROGRAMMING ENVIRONMENTS 447.2 FUTURE RESEARCH ON DOCUMENTATION 44

8. CONCLUSION 47

9. REFERENCES 48

APPENDIX A: THE INTELLIGENT PROGRAM EDITOR

APPENDIX B: RUBRIC: A SYSTEM FOR RULE-BASEDINFORMATION RETRIEVAL

% '. *5

SW.,

•* -

... o.

.; .. ..-. .. "

FIGURES

LIST OF FIGURES

Page S.

4-1: The Structure of a View 244-2: The Module Info View 24 -

4-3: The Algorithm Info View 245-1: Documentation Life Cycle Support 335-2: Overview of SDS Documentation 345-3: The SDS Standard (Top Level) 34 -5-4: Data Item Descriptions (Partial List) 355-5: Software Test Plan DID 365-6: Attributes of the Software Test Plan DID 375-7: Example of Formal Test Requirements 375-8: Representation of the SDS Documentation Hierarchy 387-1: An Architecture For Advanced Programming Environments 45

Overview Section I

1. OVERVIEW

1.1 PURPOSE OF THIS DOCUMENT

The purpose of this document is to describe a semi-automated system forthe documentation of computer software; this system is called the DocumentationAssistant (DA). To motivate this system, a general discussion of documentationissues and problems is interwoven with descriptions of how the DA would addressmany of these topics. A feasibility assessment and plan for this approach ispresented.

The Documentation Assistant project is a research effort at Advanced Infor-mation & Decision Systems. The purpose of this effort is to study advanced tech-niques which address one of the most pressing problems during the software lifecycle: the process of documentation, Currently, the DA exists on paper only; onegoal of the research effort is to develop a prototype version. which would beincorporated in the Intelligent Program Editor (another research prototype beingdeveloped at AI&DS under ONR sponsorship).

1.2 PROBLEM

The software development and maintenance processes currently consumeextraordinary quantities of resources. A great deal of this cost can be attributedto the loss of information and knowledge during the software life cycle [Dean-831.As people work on software, they learn a great deal about it; much of this infor-mation is forgotten (as they move on to new things) or lost (as they change jobs).

The purpose of documentation is to provide a means for capturing informa-tion. Unfortunately, current documentation practices fall far short of being ableto stem the loss of information, since documentation is treated as a separate (andoften less important) activity. The result of this is that documentation is inade-quate: it is often incomplete, out of date, and inaccurate.

Improving the documentation process can make a considerable impact on* the software development and maintenance process. Clearly, the process of writ-

ing documentation is expensive - it is not amenable to full automation, anddoing it correctly is more work than doing it incorrectly. However, over the longterm (and especially in the maintenance phase), the cost of good documentationwill pay for itself many times over.

1.3 APPROACH

The Documentation Assistant addresses issues relevant at all stages of thedocumentation process, from requirements analysis in the beginning to mainte-nance in the end. However, as a starting point, the research Uescribed in this

-1-_-u-..-

Overview Section 1

document focuses primarily on that documentation which is written and main-tained by programmers (i.e., in-line program comments and related documenta-tion). This focus should not be construed to imply that these ideas are usefulonly to programmers. The capabilities provided by the Documentation Assistantwill benefit all those who work with documentation.

The Documentation Assistant presents a unique approach that will alter theway people deal with documentation. It will provide:

" Integration: The DA will be part of the programming environment; it canbe used just like other programming tools in the environment. There isno need to switch contexts in order to work with documentation.

* Assistance: The DA will help the programmer perform documentationtasks by providing both tools and structure. It will not automate docu-mentation - people are an essential part of the documentation process.

" Intelligence: To ensure that large quantities of documentation are kept upto date and consistent requires considerable knowledge about the docu-mentation process; to do this without overburdening or interfering withthe user requires an equal amount of knowledge about users and howthey will use the system.

Not only does the DA provide new ways of working with documentation, it alsoprovides different ways of representing documentation. Documentation will he:

* On-line: The computer is the home for all documentation. While a hard-copy form of documentation can be produced, the primary/original formis always on the computer.

* Structured: Rather than being viewed simply as text, documentation isrecognized to have structure, and this structure is used to help guide thedocumentation process.

* Traceable: The dependencies between different documents (or differentparts of a single document) will be represented by the DA.

e Controlled: The creation and modification of documentation will begoverned by the DA, allowing the system to keep track of the document.a-tion, and ensure that documentation is handled correctly.

The DA represents a paradigm shift in the way documentation is used, by treat-ing documentation with the same care and formality which has been applied inthe past. only to code. By providing new tools for handling documentation andnew techniques for representing documentation, the DA has the potential to %.significantly improve the production and maintenance of documentation.

-2- V.

Overview Section 1

1.4 FEATURES

The DA will provide the following features:

" Integr'ated programming environment: Documentation support is providedas part of the programming environment.

" Structured editor: Documentation is created and modified with an editorknowledgeable about the structure of documentation.

" Documentation tied to programs: Programs are explicitly linked to relateddocumentation.

" Navigation aids: Interactive tools are provided for browsing, traversing, -and searching documentation. ,.-'

* Document formatting: Documents can be formatted using standard textformatting facilities.

" Detection of outdated documentation: Missing or outdated documentationis automatically detected.

" Policy support model: documentaTion policies, standards, and guidelinesare explicitly represented in a parameterized model.

* User preference model: The user interface is based on parameterizedinformation about user preferences.

1.5 SCENARIOS

To provide a better idea of how the DA might appear in use, this sectionpresents a scenario of such a system in operation. The scenario is based on aprogrammer working in a maintenance environment, who is responsible for themaintenance of both code and documentation of a subsystem. The version of theDA described is imbedded in the program editor which the programmer normallyuses.

Two caveats are in order. First, this scenario is hypothetical; there iscurrently no system that does any of this. Second, the interaction between a userand the system would, in practice, be primarily graphical; the DA will make useof graphics for both input and output. Unfortunately, there is no good way toshow this here, and so the interaction between the user and the DA is presentedin narrative form.

-3.

14N,

- - - - - - - - - - - - - - - - -

Overview Section I

Referencing documentation:

The user is examining a program, trying to understand it in order to fix a prob-lem. He brings the program up on the screen and sees a call to a procedure withwhich he is unfamiliar, so he uses the mouse to indicate that he is interested inthat function. A pop-up menu of available documentation for that function ap- .-.pears. The user again uses the mouse to select the entry on the ment4 thatcorresponds to documentation on how the function works. A new window contain-ing the requested information appears on the screen.

Tracing through documentation:

The documentation makes reference to another procedure which he doesn't know,and so the user follows a similar procedure, selecting the function (though thistime the function reference appears inside documentation, and not inside code)and then selecting the documentation entry from the pop-up menu, giving him adisplay of documentation describing how this new procedure works.

Restoring context:

When he is finished reading this documentation, the user asks the system to popback to where he was; the documentation for the second procedure disappearsfrom the screen, and the documentation for the first procedure reappears. Theuser asks to pop back again, and the documentation again disappears, leaving himin the code where he originally started.

Creating documentation and code:

The user feels that he has an adequate understanding of the problem now, and soproceeds to start fixing things. He first adds some new lines of code to the part ofthe program that appears incorrect, but quickly realizes that he needq to define anew procedure to perform a calculation. He moves to an appropriate place in thecode and starts the definition of the new procedure. Since he is writing a newprocedure, the system displays a standard procedure header form on the screen forhim to fill out as he writes the function. While he is working, he alternatesbetween writing code and writing documentation.

Explanation by example:

There is one part of the documentation form that he doe6 not understand, and "•he selects that part with the mouse; when a pop-up menu appears, he selects theentry for sample documentation. The part of the documentation he was unsure ofis now filled in with a sample of what this type of documentation should look like.This -tample makes it clear to him how this field should look, and he then finishesthat part of the documentation.

-4-

_- ...., a. _. ., a. _ .,..a £,.~a... .. .. .....................•.....-.,,.......-....,-.-....-,...............

Overview Section 1

Reminder to finish documentation:

After finishing the new procedure, the user tries to pop back to the original pro-gram he was fixing. However, he has left out some important parts of the pro-cedure header documentation, and so the system asks him if he wishes to write thedocumentation now. He does some of the required documentation, but then de-cides to leave the rest for later, and he tells the system he does not want to updatethe rest of the documentation right now.

Reminders to add new and update old documentation:

Now that he has written the new procedure, he is ready to finish the fix to the firstfunction. He makes the appropriate code changes, and then tries to save thechanges. The system prompts him for new documentation describing the changes 1he has made fnote how he was prompted for the changes after he was finished, ,.,rather than after he made the first few changes). The system then prompts him .

with old documentation that might need to be updated. He realizes that some ofthis old documentation does indeed need changing, and so he updates this docu-mentation.

Reminders to finish documentation:

After completing this documentation, the system then asks him to finish the docu-mentation for the new procedure that he left incomplete. Since he would like totest the program before completing the documentation on that code, he declines tofinish the documentation now, knowing that he will be reminded in the future thatthe documentation needs updating. He saves his work and exits the system.

1.6 GUIDE TO READING

Sections 2 through 4 represent a design plan for the DA; Section 2 describesthe process and structure (and hence representation) of documentation; Section 3addresses the issues of controlling documentation; and Section 4 presents userinterface techniques. The feasibility of the DA approach is covered in Section 5.A plan for implementing the DA is presented in Section 6. Possible directions forfuture research are briefly discussed in Section 7. Section 8 is the conclusion, andreferences for this report are in Section 9. Appendix A is a reprint of a paper ona related effort, the Intelligent Program Editor. Appendix B is a reprint from,another related effort, the RUBRIC information retrieval system.

-5-

The Documentation Process Section 2

2. THE DOCUMENTATION PROCESS

The Documentation Assistant is designed to provide intelligent assistance inall phases of documentation production and maintenance. We use the term %"software documentation" to refer to all written pieces of information pertinentto a software system throughout its life cycle, including (but not limited 1o) prequirements, specifications, design, design rationale, source code, in-line com-ments, test plans, test data, test results, modification history, problem reports,user manuals, operations manuals, and maintenance manuals.

A necessary part of any system purporting to provide intelligent behavior is -.a model of the process/environment in which the system functions. We break lok-the process of documentation into three components:

" structure of documentation, i.e.. the form of the documentation itself

* context/state model, which tracks significant events in environment

" policy model, which represents constraints on documentation such as stan-dards, guidelines, and preferences

The following sections discuss these components in more detail.

2.1 THE NATURE OF DOCUMENTATION

The prerequisite for understanding the documentation process is to under-stand documentation itself. However, there are many views and opinions onwhat documentation is. The view taken by the DA treats documentation interms of the following three components:

" name: Each piece of documentation has a name that can be used to ref'r-ence it.

" structure. Pieces of documentation can be interconnected to furni a iruc-ture.

" attributes: Each piece of documentation may have certain properties or

additional information associated with it

The following sections cover these components in more detail, discussingrepresentation techniques that will be used in the DA. The views taken here arepartially motivated by research in the areas of semantic networks and object-oriented programming.

... - - -... ,;-.-.. . . . .

------- .- - - -- - - T h ' -Iii"

S


2.1.1 Name

Every piece of documentation has a name associated with it. The narn, or .. '.a document has the same usefulness and functionality as the name of a person: it.can be used to refer to that document or person. When objects are to be deltwith a s individuals, names are an obvious characteristic (objects which are d':ilt. ". ..

with as aggregates, on the other hand, do not need to be individually nar,d:they can be named descriptively or procedurally). Names do not have to bemeaningful, though in certain domain such as programming, it is desirable for thenames to have some meaningful interpretation.

As with people, namcs may not necessarily be unique. When this happeris,additional information is needed to provide disambiguation. For example, "JohnSmith who lives on Short Street" can be used to specify a person; "the Require-ments Specification Document for Accounting Package" can be used to specify adocument. Note that both of these specifications might be ambiguous in a glob! l."setting (e.g., "Short. Street in which city?", "Accounting Package for which con-pany?"), but it is necessary to specify only enough information for disambigua- .: . .1tion with respect to some local context.

Names can refer to classes of documentation as well as instances. Forexample, "the user manual for the Emacs editor on TOPS20" is an instance ofdocumentation; "the user manuals for editors" is a claSs uf documentationdescribing a set of documents. From a representational viewpoint, classes andinstances are treated identically.

2.1.2 Structure

Structure is the way in which pieces of documentation are woven together.It is important for those who read documentation as well as for those who writeit.; it, is also a logical model for the representation of documcntation used inter-nally by documentation systems. Knowledge of document structure can assist inmany parts of the documentation process, including:

* creation: Structure can guide the creation process by inakirig sure thatdocumentation is assembled properly.

" searching: Instead of scanning the entire document, structural knowledgeallows searching to be limited to the relevant (sub)section.

" understanding documentation: The connection of high level abstractionsto lower level details provides traceability that makes it easier to under-stand documentation.

It is natural for different kinds of documentation to be structured in avariety of ways; possible structures include sequential/linear (one dimensional),hierarchies/trees (two dimensional), and graphs/networks (n-dimensional). For

-7-

........................................................

Controlling Documentation Section 3

scenario ("if it looks like it has changed, then it has"), this estimate can be car-ried along to the next step of the process.

3.2.2 The State of the Programmer

The preferences of the user are related to the context model, thus allowingthe l)A to make decisions based on user preference with respect to a given state.For example, a user might specify that he wants to update procedure level docu- --

mentation whenever he edits and then leaves the procedure; another user mightwant to do the updating only at the end of the edit session; and another usermight not want to do updates until the code has been tested and is known to beworking.

Thus, via the preference mechanism, users can control when they will be .prompted for documentation creation/update. Since the DA will be capable ofinterrupting the user to ask for documentation, this type of control is importantto prevent the system from getting in the way of the user. The quantity ofinterruptions is also moderated by the mechanisms that allow unimportantchanges to be ignored or put aside.

3.2.3 A Rule Base for Controlling Documentation

The process of asking the user to update documentation is rule-based. Therules are based on various criteria mentioned earlier, such as importance of docu-mentation, likelihood of semantic change, preference of user, state of user. etc.The set. of rules is not fixed; it can be tailored to specific environments. .

The rest of this section is an example of how a rule-base might. be used.Rather than presenting specific rules, however, we present rule specifications.which describe a class of rules. These specifications are in the form of functionalmappings: they map the space of the left side (the Cartesian product of theifdependent variables) to the right hand side (the dependent variable).

Suppose it is necessary to determine when to notify a user about a particu-lar documentation object (attached to some segment. of code) being out of date.To answer this question, start with rule (1), whose right hand side can answerthis question. To evaluate this rule, it. is necessary to evaluate all the dependentvariables on the left hand side. That is, to determine when to notify thc userabout out of date documentation, there are three things to look for: the probabil-ity that the documentation is really out of date, the importance of the documen-tation for that particular part of the program, and the user's preferences.

.robabily of

(I) docmentation X j of link of ipporerncee I notify ,,.er .

out of date

-21- S

. . . .- .° . .°. ° .° , , °. -

., ° , . °L. . . . . . . . . . . . . . . . . . . . . . ."-3i' .d-::- ,. , ,. ,':""". " - . ... ' " "'":. , .- : . , _ : ;,,.. ... .: ....-.-.-.. . .. :.,.' i....: .-


administratively and technically, it is difficult for people to become conversantwith (and able to apply) all appropriate policies. 1By providing tools that helpapply policies, it becomes more likely that policies will be followed. Second, poli-ies arc sometimes neglected because they are difficult or cumbersome to practice.

For example, policies requiring documentation to be kept up to date with thehat.t code require programmers to continually keep track of change anti of thedocumentation affected by those changes. Documentation tools can help alleviatethis problem by providing support for the more mundane documentation choresand thus reduce the effort required to do things correctly.

3.2 TECHNICAL ASPECTS OF DOCUMENTATION CONTROL

The process of keeping documentation updated is controlled by several fac-tors: the state of the documentation and code, the state of the programmer, anda rule-base for controlling documentation.

3.2.1 The State of the Documentation and Code

The DA will keep track of the state of the documentation. By virtue of the LExtended Program Model on which the IPE and DA will be built, it is possible tomaintain a better idea of what is happening to code and documentation than haspreviously been possible. As a program is changed, the PE will have the capa-bility to determine if the meaning of the program has changed.' This is possiblebecause of the multiple program representations provided by the EPM. Forexample, if a program were textually reformatted but unchanged, it is easy todetermine that the program semantics are unaffected, since there is no change inthe syntactic structure; if a statement were inserted into the middle of a cliche, itmight be determined that the cliche didn't, change by examining the flowrepresentation and realizing that the graph of the cliche was disjoint from thegraph of the new statement.

At the coarsest level, the EPM will be able to determine if an object ha .really changed. Detected program changes can be propagated back to documen-tation (following the links that connect the documentation to the EPM describedearlier). At the next level, it, is possible to draw associations between certai,semantic changes and documentation objects. For example, there is an obviousronnertion to the syntactic object. parameterlist and the documentation objectparameters. If a procedure changes, but the parameter list does not, ther, 'here isno need to change the parameters documentation, even though other documenta-tion linked to that. procedure might need to be changed.

The idea of program change is not, a boolean decision. There may be tinicswhen a likelihood of semantic change (and propagation to documentation) can beassessed on a probabilistic scale. Instead of always assuming the worst-case

I Aclira , i! will mak, a irbilisic' h .,is :Ihoif ,Oht /iges il sclalics. s(ir,, il is impos- "'I"t"tofI, d,, ,h' tc .arbfilriryv se r rii' ,h;ms',.

-20-

........................................ ....


suggestion). There will always be events that, rail on the vital end of the scale;there is simply no way of avoiding certain things, and when it comes to the issueor vital policies, preference must give way. On the other hand, to gain accep-tance, it is crucial to avoid annoying the user and continually overruling hispreferences. Luckily, there are many events that do not fall at the exLreme endof the scale, and in these cases, there are several techniques for working out . - -'

compromises.

0 schedule negotiation: If policy requires certain documentation, but doesn'tsay when it is required, it would be possible to make a compromise witha programmer whose preference was not to do the documentation. Thecompromise is basically this: the documentation does not have to be donenow, but it has to be done at some definite time in the future (e.g., beforethe software is released). In this case, longer term policy needs are over-ridden by shorter term preferences. By warning the programmer thatthere is a policy requirement that must be met, there may be time toallow this information to sink into the programmer's mind; knowingabout this future requirement, the programmer might even rearrangethings (e.g., do some planning) so that the documentation will be easierto do.

* balancing policy against preference: If there is some flexibility in policy, itis possible to weigh policy against preference. A user can associate adegree of importance to preferences; this can be modified as the userdesires. As choices/compromises are made by the system, the user canadjust preferences to achieve desired results.

* following policy to the letter. There may be some flexibility in policy thatwill allow adjudicating in favor of preference over policy. For example, ifpolicy required a document, but certain parts of that document were notas important, it would be possible to avoid those sections if the user wereso inclined.

At first glance, these compromises may be viewed as a method for allowing a pro-grammer to avoid responsibility. However, these techniques serve an importantrole in making the DA a system that people will want to use. We believe thatproviding a means for balancing policy and preference will help achieve accep-tance and will not be abused.

Thus, we believe that programmers are generally willing to followpolicy;' however, they often need help in applying policy consistently and intelli-gently. This philosophy is based on two observations from our studies ofsoftware maintenance environments. First, policies are often neglected because ofignorance. Given the complexity of programming environments, both

I This is a lce.sary a.s.mption for "ny 1wy-b:aed Iools; if programmers iuust be forced Ioadhrv to policy, tleii there im a nmuh larger problvm that ImXI alone canlot solve.

-19- ,

, .. b, *,* '. '. *";." :

- V.-I.- T.- 1 7 .7.7"T - -


3. CONTROLLING DOCUMENTATION

The discussion so far has focused primarily on documentation itself andhow the DA handles it. We now turn to how the DA will "handle" the user.This section focuses primarily on the issues of maintaining documentation in a"proper" state; the next section looks at the user interface issues.

3.1 POLITICAL ASPECTS OF DOCUMENTATION CONTROL

To keep documentation in a proper state, it is first necessary to define justwhat the proper state is. The idea of a proper state is relative to each documen- .tation taxonomy; what is good practice in one environment may be forbidden inanother. Of the three components that constitute the documentation process, it.is the policy portion that really determines what is good and what is bad. Policysays what things should be, what things might be, and what things should not.be. By definition, if policy is followed, then things will be in a proper state.

Keeping documentation in a proper state is only half the picture; the otherhalf is helping the user. A documentation system should not be the master thatcontrols the people who use it. Just the opposite: the users should control thedocumentation system. The only way to do documentation correctly is to havethe support and help of the users. A documentation system cannot possibly workcorrectly if people refuse to use it or thwart its activities. Thus, there is a deli-cate balance that must be achieved: on one hand, there are policies that, withincertain limitations, specify how things are to be done; on the other hand, peoplehave their own ideas about how to do things.

The balance can be achieved by providing a policy-driven mechanism th.atakes great care in trying to accommodate individual preferences. There is oftenmore compatibility between policy and preference than might be evident at firstglance. Compatibility may be in the form of non-interference, where policy andpreference do not conflict; it may be in the form of compromise, where there aremutually satisfiable alternatives; or it may be in the form of goal revision, wherethe original plan is modified. The DA will try to traverse this tightrope, simplybecause it is the most reasonable path; it is unrealistic to expect to change eitherpolicy or preference to suit the needs of a documentation system.

Achieving this goal is not an easy problem. The DA will not contain built-in mechanisms that will automatically solve the tension between policy aridpreference. Rather, it will provide a framework; in any particular environment,,the ability of the system to achieve a proper balance rests in the hands of theindividuals who create the documentation taxonomy.

The key idea behind achieving this goal is to balance the policy require-nients, which are generally longer term, against the user preferences, which tendto be shorter term. When dealing with policy, the DA can choose where or) aspectrum of importance any particular issue falls. The spectrum ranges fromvital (the policy must be followed) to inconsequential (as in the case of a

- 1 8-:''"

...............................................

. . -- ,

- -.. . . o


Since it is rare for programmers to work under the constraints of a singleset of policies, it is expected that there will be a great deal of interconnectivitybetween different levels of policy in the policy model. In most organizations, .. -,

there is an apparent hierarchical structuring of policies. For example, there maybe a set of documentation standards for the entire Department of Defense. TheNavy may have documentation standards which augment the DoD standards. Acommand in the Navy may in turn have its own standards; a programmingorganization under that command may have its standards. Finally, a program-mer may have his own preferences for things not specified by any of the stan-dards. In this hierarchy of standards, lower level standards generally refine thehigher level standards, not replace them. However, this need not always be true;organizations may have permission to override standards at a higher level.

Looking at content alone, there is a great deal of similarity between pro- _cedures, standards, guidelines, and preferences. They all specify a way of doingthings; it is primarily in importance or necessity that they differ. Thus, the DAcan represent policies in a uniform way, in terms of constraints that apply to thedocumentation process. However, the importance of the constraints isrepresented separately. For example, suppose there is a constraint specifyingthat module level documentation include a revision history. If this is specified bya standard, then the constraint is vital; if this is specified as a guideline, then theconstraint is recommended; if it is specified as a preference, then the constraint isweak.

-17-

o,

W-"V Q~ 7-----.- V. 77~-.~4II7 MIE


what the programmer is doing.

2.3.2 Tracking the Documentation and Code:-. -.

The next step in tracking the programming process is to keep tabs on the ....status of the documentation and the code. There are a number of states asoftware system may be in, including: preliminary design, design, development, .debugging, unit testing, system testing, beta testing, and released. Knowing thestate of a system as a whole does not mean that all the components of the systemare in the same state; each component, subsystem, etc., must also be tracked.Tracking the documentation may be a somewhat easier job than tracking theprogrammer; since states do not change all that often, it is not unreasonable forthe programmer to tell the system the state of documentation and code.

However, the system is capable of tracking documentation or code that haschanged. This information is used to provide a more accurate assessment of thestate of documentation and code; it is stored in the documentation database (andis not lost between sessions). For example, if the system is in the "released"state, and the code is modified, it should be infer ed that the state of the systemhas changed.

2.4 DOCUMENTATION POLICIES

Thus far, we have talked about two dimensions of the documentation pro-cess: the nature and structure of documentation, and the tracking of program-mers and documentation. The final dimension necessary for the DA concerns thechoice of what and how documentation is created and updated.

The decisions as to what documentation should be written, how it shouldbe written, how it should be updated, etc., are not determined just by the pro-grammer. There are usually administrative procedures that specify what docu-mentation is required, standards/guidelines spe,.ifying/suggesting how to writedocumentation; and only if there are decisions that are unspecified is the pro-grammer allowed to follow his own preferences. These procedures, standards,and guidelines are collectively referred to here as documentation policies.

The DA will maintain models of these policies, which it will use to guide theprocess of creating and updating documentation. Policies will be represented in astructured fashion; connections between policies will be explicitly noted. Thus, itcan easily be determined if part of one policy refers to or overrides (or conflicts)part of another.

The explicit representation of policy is ment to provide for easy accommo-dation of different policies. While different organizations may have similar (oroverlapping) policies, it is rare for separate grolps to have identical sets of poli-cies. The policy model factors out this information, ensuring that knowledge "-'.about. policy is not hardwired into the DA. To make the DA support the policiesof a different organization, it would only be necessary to modify the policy model,rather than rewrite the I)A itself.

-18-


In order to do th;s intelligently, it is firsi necessary for the I)A to under-stand what the programmer is doing. The con .xt which we focus on here is theediting context, where the programmer may be creating, modifying, or readingprograms. Since the [)A will be tied to the Intelligent Program Editor, thischoice of context is logical. From a larger perspective, the context, should cer-tainly include programming activities outside the scope of the editor; it mighteven include activities outside the scope of the computer system (e.g., "what. timedoes the programmer leave for the day?" or "when does the programmer go onvacation?").

There are two basic approaches for determining what the programmer isdoin, id where he is doing it):

* announcement: The programmer "announces" to the system what he isdoing (i.e., the user does the work).

* inference: The system watches the individual actions the programmertakes, and tries to piece them together into a plan to provide a largermodel of what the programmer is doing (i.e., the system does the work).

The first approach is cumbersome, but nonetheless useful because there are cer-tain times when the only way to figure out what is happening is by asking theuser. The second approach requires a good deal of intelligence on the part of thesystem. To achieve this level of understanding, it is first necessary to build alibrary of plans that describe common sequences of actions. Then, techniques forsorting through a large number of plans in search of the plausible one(s) arenecessary. Finally, it is necessary to determine which plans really fit what theuser is doing. Based on current technology, this approach may be rather expen-sive.

However, there is another alternative that is essentially a combination ofthe two. Suppose that an editor had certain functions that, when used by the

programmer, would give insights as to what was currently happening. That is,these functions would be designed so that when they are invoked, the editorwould be able to guess fairly easily and reliably what the programmer was doing. terFor example, in the Emacs editor, there is a command for compiling code withoutleaving the editor; use of this command is an indication that the programmerthinks that the code is complete (at least, complete enough to run). As anotherexample, imagine an editor that has a special command.for editing procedures;when the command is invoked with a procedure as an argument, only that pro-cedure is displayed on the screen, and editing continues on that procedure untilthe programmer gives a command to edit another procedure. This commandwould allow the editor to infer what function the programmer was editing.

Thus, by providing commands that work on semantic units, instead of tex-tual units, an editor may be able to infer a great deal about, what the program-mer is doing. By virtue of the Extended Program Model provided by the IPE,the IPE is in a good position to provide extended commands that are based onthe syntactic or semantic structure of programs, thus providing better clues as to

-15- ,'-


a single monolithic database), an incremental approach to EPM development canbe taken. Databases and tools for their manipulation can be developed and thenintegrated into the EPM. Separate databases for the EPM also mean that eachrepresentation can be stored in the most appropriate type of database. Forexample, the database representation for text (i.e., a linear program representa-tion) will be quite different than the representation for a syntax tree (i.e., a twodimensional representation).

Thus, the architecture of the EPM provides a natural way of adding addi-tional representations. In the case of documentation, a new database supportingdocumentation structure and operations would be added. The documentationdatabase will provide three basic components: documentation objects (represent-ing the documentation itself), attributes (representing properties of documenta- . -

tion), and relationships (representing the connections that tie documentationobjects together).

2.3 THE CONTEXT MODEL

The next step in building an intelligent documentation tool is to build toolsfor understanding the process of documentation. In order to manipulate docu-mentation properly, it is necessary to understand what the user is currently doingand the current state of the documentation. The focus here is narrowed to theprocess of documentation from a programmer's point of view; hence, theemphasis will be primarily on in-line documentation. However, these ideas arenot restricted to in-line documentation; given that documentation is on-line,these ideas can be applied to all phases of the documentation process.

The Context Model is a part of the DA that deals with keeping track of allthat is happening in the programmer's environment. This includes keeping trackof what the programmer is doing (e.g., writing code, debugging code, fixing docu-

0 mentation) and keeping track of the status of code and documentation. TheContext Model has two components, a passive component, in which all relevantinformation is stored, and an active component, that is responsible for collectingand maintaining the information in the database. The Context Model is an-integral part of the DA, and is not directly visible to users; thus, most userswould not even be aware of its existence as a separate component.

2.3.1 Understanding What the Programmer is Doing

One of the key ideas behind the DA is to help the programmer write andupdate documentation without being intrusive. If the system gets in the way ofthe programmer, he will eventually turn it off or ignore it. On the other hand, inorder to ensure that documentation is kept up to date, it may be necessary forthe DA to intrude on the programmer. The DA will be an active partner in thedocumentation process, and, unlike conventional tools, it will be capable of tak-ing the initiative and asking the programmer to do something.

-14-d-::::

.......................................................................................


documentation that hide much of this detail. " "-.

The DA will be able to provide this documentation structuring because itwill build on the Intelligent Program Editor (IPE), which provides a rich environ-ment for program manipulation. In particular, the Extended Program Modelpart of the IPE will provide the mechanisms necessary for this cross linking.'

2.2.1 The Extended Program Model (EPM)

The Extended Program Model can be thought of as a database that main-

tains multiple representations of programs. Currently, there are plans for sixrepresentations: text, syntax, flow, segmented parse, cliche, and intentional aggre-gate [Shapiro-841. The EPM will maintain consistency among these representa-tions, and the IPE will allow any of these representations to be directly viewedand manipulated. For example, if an IPE user were examining the syntacticrepresentation and made a change, the corresponding change would be made tothe text and other representations.

The representations in the EPM are linked together, allowing the IPE tomap between objects in different representations. By adding documentation tothe EPM as an additional representational level, the D. will be able to use the •EPM to connect documentation to program objects as well as to other documen-tation.

Thus, by building on top of the EPM, the DA will be able to integratedocumentation with code. Moreover, the DA can make use of the variousrepresentations in the EPM, enabling documentation to be linked to code at anylevel, and not just the textual level. For example, the documentation for a pro-cedure header might be linked to the syntactic unit corresponding to the entireprocedure, instead of (as in current practice) placing the documentation text justabove the procedure in its text form. This means that the binding between thecode and the documentation is increased; if, for example, the procedure were tobc removed, it would be clear that the documentation should also be eliminat,ed.

The DA should also be able to make use of the version control facilityplanned for the EPM. Since the documentation is so closely tied to the EPM, it,is anticipated that the EPM's version control mechanism will also be able to pro-vide version control for docume itation. Because of the linkages between docu-mentation and the EPM, the version control of documentation will be intimatelytied to the version control of the code itself. It will not be possible to lose syn-chronization between the code and the documentation, as the separation betweencode and documentation itself will be blurred.

While the EPM will provide a uniform view of different program representa-tions, it. is internally composed of a number of databases, linked together to pro-vide required connectivity. By building upon separate databases (instead of using

I The IitA'ligel; ,. Program I,,iiwi r (and i. I .xle h' I'r ogran Mt x li) is viiig dev,-l'I 'd mh'i r ;

ri-se (* ch ttiti t rag p , iis( rtd I ty ( )NH

t .%

-13-

, '.-.. ." ,,,''.'" ._' :.. r ".,"4 ", "- ; ", .",,," ", •. .. "-..., ' " .. .",,- -..--. - . . - -'- .' . -.-.- .- '. .''

* ** .ni e a Dn nmi I ~ l H mH ... . .

The Documentation Proces Section 2

objects, and providing mechanisms for connecting and describing those objects, itbecomes feasible to build tools for documentation manipulation.

The decomposition of documentation should be considered an internalmechanism for the DA; the user does not see his documentation shredded intothousands of objects and thrown together into some incomprehensible structure.Documentation structure is retained using the relationships mechanism describedearlier. One can think of a process where documentation is divided up intoobjects, each object is labelled, and then connections between the objects are

" drawn.

The next step towards elevating the treatment of documentation is thestoring of all documentation objects in a database, separately from all else. Thedatabase provides two functions: it provides a convenient method for storing L

IN. large quantities of documentation objects, and it provides a means for regulatingthe modification of documentation. The means of this regulation is simple: sincedocumentation is no longer treated as just a text file, there is no way of directlygoing into the database to edit the text of a document. The reason for this is topreserve and control the structure of the documentation. If documentation isdecomposed into component objects, it does not make sense to allow unrestrictedediting of documentation, since this might destroy the structure. Instead, the DAwill allow editing of documentation objects in a controlled context, so that struc- .. 'Zture is preserved. Moreover, this level of control allows the status of documenta-tion to be tracked; when documentation is updated, it is easy to determine which

-* part of the documentation was modified.

In addition to the structure inherent in the documentation itself, the DAprovides a way of directly linking documentation to the program code. This isvery different from the traditional practice of in-line comments. In-line corn-ments have no formal standing with respect to most programming tools, whichgenerally discard comments during parsing; even when comments are preserved,there is no formal way of associating comments with a given piece of code. Thisis a task that is very easy for a human user but intractable for a computer. If auser sees a comment next to or on top of a piece of code, the user can generallymake the assumption that this comment refers to the adjacent code. Unfor-tunately, this is an assumption that a computer, unlike a human user, has noway of verifying. Imagine that a comment refers to code which is then changed(or even worse, deleted). What happens to the comment? By providing a formallinkage between code and documentation, the impact on documentation of chang-ing or deleting code can be assessed.

With the same mechanism, documentation that is not normally considered& in-line, such as specification and design documents, can be linked directly to

code. This provides the ability to look at a piece of code and then trace back tothe design or specification; similarly, one can look at the specification and Ir.ethrough to the code which implements part. of that specification.

The result of decomposing documentation into objects and linking them.directly to the EPM results in a plethora of objects and links, possibly to thepoint of unmanageability for the human user. However, this structure is meantonly for the DA; the user interface provides higher level access mechanisms to

-12-

r

~ *.** ** ,~..*.. *'.*-..o" .. * |


desired result. For example, to determine the importance of a particularitem of documentation, the DA might look to see if there is an impor-tance attribute associated with that item; if not, then the DA might try .to see if there is some general rule about the importance for all documen-tation of that type; and if that fails, the DA might try to find similardocumentation objects and see what their importance is. It is possiblethat the inferencing mechanism will fail entirely to come up with theanswer;' in this case, the DA could either make a worst case assumption,or could try another approach.

e automated analysis: Some attributes might be determined by an analyticroutine specifically designed for this purpose. For example, if readabilitywere a document attribute that was needed for some purpose, and adocument did not have that attribute, then a readability metric might beapplied to the document to determine the appropriate value for the attri-bute.

* human analysis: The human user can be called upon to do the job ifnecessary. This should be a last resort, since the user should not have tobother with details that the computer could figure out.

2.2 THE REPRESENTATION OF DOCUMENTATION

One of the primary differences between the DA and current documentation "tools is that the DA treats documentation as a first class citizen. This meansthat documentation is not considered unstructured text, simply appearing in

* manuals or in-line, adjacent to program code. Documentation has both proper-" ties and structure; one should be able to point to a piece of documentation and

say "What kind of documentation is that?" or "To what does this documentatinnrefer?".

The DA will use several techniques to elevate documentation to this level.The above discussion on documentation structure loosely referred to "pieces" of

* documentation. Pieces of documentation are formally called documentationobjects. A documentation object has a name, attributes, and relationships withother documentation objects as well as relationships with parts of the progrm ".7

* code.

The purpose of decomposing documentation into a set of objects is to pro-

vide a handle for the computer-assisted manipulation. While it may be easy Iror'- people to look at documentation and intuit structure and meaning from it, th is ," an intractable problem for the computer. By breaking the documentation itito

t2 I This might happen if t.he answer wt.,4 not. deW rrinalle from thO k nowledge Ie; it, might '. ohappen if the itiferenci ig 'n gine wa., not tising the nevesary straltgiv, to find the' a twer.

-11- -..

':" *"


At this point, there is an interesting observation that can be made aboutthe documentation taxonomy. In a certain sense, names, relationships, and attri-

*-. butes are themselves a form of documentation. One can think of documentationas describing some system, while names, relationships, and attributes are the next

",'. level up, describing the documentation itself.

2.1.5 Clasifying Documentation

The Documentation Taxonomy provides a mechanism for talking aboutdocumentation, but in order to use this mechanism, it is necessary to classifydocumentation with respect to the taxonomy. To classify documentation, it. is

- necessary to decompose the documentation into its component pieces, and thendetermine the name of each piece and its relationship to other pieces. There arebasically three mechanisms the DA will employ for doing this:

definition-time analysis: When documentation is created using the DA, itwill be immediately classified; the decomposition, name, and structure isimplicit in the interface for most documentation (i.e., the documentationis entered in a form that is easily decomposed). This makes classificationprimarily a definition-time operation (as compared to a run-time opera-

-.tion), since the bulk of the work is done when the taxonomy is initiallydefined.

e run-time analysis: If documentation is not created via the DA, it is much" -more difficult to classify. It may be possible to provide automated tools

to segment and classify documentation items. Unfortunately, unless a* "fairly strict set of documentation guidelines have been followed, this

method is both weak and prone to errors. ..*

o human analysis: The alternative to automatic post-analysis is humananalysis. This is the most arduous technique of all, but it is more reliablethan run-time analysis.

Determining attributes calls for slightly different mechanisms because attri-butes are used differently. Since the DA needs the decomposition, names, andrelationships of documentation to construct documentation networks, this infor-mation is necessary just. to get, the documentation into the DA system. Attri-butes, however, are not essential until the associated documentation is actuallymanipulated. When it is necessary to determine attributes, there are several

.techniques (analogouis. but not identical, to the above techniques):r

e automated inference: Attributes can often be logically inferred using ot herinformation in the )A. Inferencing means that by ascertaining a set oflogical premises, one can make certain deductions. InI the simplest case.

L inferencing is simply looking up the fact in the knowledge base. Moreoften, inferencing involves a chain of implications which lead to Ihe

-10-

L:: :': 4.

.8 ,"~~ t %'


programming languages, the semantics of attributes such as type will not be builtinto the interpretatiosi mechanism; instead, there will be support for the declara-tive specification of these properties.

The choice of attributes, and the values which attributes may take on, is;-. partially dependent on the environment. While there are some attributes, such

as type, which are always needed, there are other attributes that may not beneeded. Moreover, even required attributes vary in the values which they canhave; the possible values for the type attribute will vary across environments.

2.1.4 The Documentation Taxonomy

The notions of name, relationship, and attribute provide a means for talk-ing about documentation in general. A Documentation Taxonomy is a set ofnames, relationships, and attributes that can be used to talk about the particu-lars of documentation.

The Documentation Taxonomy provides a basis for reasoning about docu-mentation. The structure provided by the relationships and attributes allows the 1krecording of many kinds of information about documentation. However, thisinformation about documentation need not be complete. By applying inheritancerules to documentation structure, it may be possible to infer facts about docu-mentation objects without having that information explicitly represented. Forexample, to determine how to format a documentation object, the first thing tocheck is to see if there is an attribute for that object which talks about format-ting specifications. If not, then the parent of the object might be checked to see F !if it has any formatting specifications. If the parent doesn't have the attribute,then the search continues up the hierarchy, until an ancestor with the appropri-ate attribute is located.'

One of the more challenging aspects of building a system to manipulatedocumentation is that there are so many ways of using documentation. Notethat in the previous discussion of documentation, there were no absolute state-ments about documentation requirements; for any particular application andenvironment, the particular names, relationships, and attributes will vary. Whilethere may be commonality between different environments, the design of the DArecognizes this as an option rather than a given.

The goal of the documentation taxonomy is to provide a descriptive (rathcrthan prescriptive) framework, allowing the DA to be tailored to particularenvironments. In many cases, taxonomies may be the same for different environ-ments; certainly, parts of taxonomies will always be common to all environments.The process of building up a taxonomy for a new environment might be compar-able to the process of moving an expert system to a different, task in the samedomain.

.I ,.ri. athvr n re dilfi' i. . l ii iquIs for eh~.trmini g d.e vahi e or nissing altri utes; tis, thn , imiist

Iv" rigi's Io delermitill el h liii lt'. call Iw i I hi. heril(d and which ItIil, be l li .enhi ,d ill so.ic.

ol hir waY (si th ws ask i ig I h,' Ii.5Cr).

"-9-

. . .. LiI


example, hierarchical structures are common for textual kinds of documental ion,such as manuals. Unfortunately, most existing systems that provide docuuentstructuring capabilities (such as word processors, mail programs, and documenta-tion browsers) force documentation to fit into one particular type of structure.Because it is based on a more general mechanism for creating different types ofstructures, the DA will have considerably more flexibility.

In the DA, the idea of a relationship is used to capture structural informa-tion. A relationship is a connection among n (usually two) pieces of documenta-tion. Connections themselves have names, which indicate the type of relation-ship. By providing connections with names, a diverse set of structures can besupported. Examples of connection names are: contains ("a procedure header

- contains the name of the procedure"), references ("the Program DescriptionDocument references the Program Design Specification"), depends on ("the Com-puter Program Test Report depends on the Computer Program TestSpecification"), etc.

For example, if a particular procedure header contains information on the

author of that procedure, then the relationship contains will hold between thef procedure header documentation and the author documentation. Relationshipscan be among documentation classes as well as documentation instances, thusproviding a way of specifying that all objects of some class have a certain rela-tionship with all objects of another class (i.e., inheritance). As another example,if all procedure headers contain information pertaining to the date the procedurewas written, then the relationship contains will hold between the classprocedure_header and the class date_written.

2.1.3 Attributes

Attributes of documentation are descriptions or properties that pertain ton the documentation. An attribute refers to a particular piece of documentation.

There are many attributes that one might use to describe documentation, such aspurpose, readability, completeness, correctness, quality, or currency. Attributescan be created and accessed by users as well as by documentation tools.

Attributes are easily represented as property-value pairs. For each piece ofdocumentation, there can be an arbitrary number of property-value pairs. If theDA were posed a question of the form "What is the status of this piece of docu-mentation?", it would attempt to answer this by checking the value of the statusproperty. For example, if a program comment were added to clarify a bug fix,there may be an attribute purpose which has the value "clarifies the bug fix"; if aprogram comment were written by a particular programmer, there may be an

r attribute author which has as its value the name of the programmer.

There are certain attributes that are required for all documentation objects.For example, the type attribute is appropriate for all objects, regardless ofenvironment. As in programming languages, the type attribute is a name whichhas certain semantic associations that characterize the referenced object. Type is

U an important vehicle for talking about documentation. It conveys informationand expectations to anyone who knows the definition of the type. But unlike

• :-.:-8-

II


Rule (2) says that the likelihood that documentation is out of date can bedetermined by the type of the documentation object, the type of the programobject, and the type of the change that was made to the program object. Thefirst two clauses can be determined by lookup; the other clause must be deter-mined by the EPM.

J type of J typtype 1 type ofp probability of

(2) documentation X program X change documentation

object[ object out of date

There are two more clauses from rule (1) that need to be evaluated. Theimportance of a link can be determined by evaluating rule (3), which says that,the importance of a link can be determined by looking at the type of documenta-tion object and the type of program object. For example, if the program objectis a procedure and the documentation object is a procedure header, then theimportance is "high".

-""type of type of {importance

(3) documentation X program - oflink

object object

Finally, the user's preference can be determined by evaluating rule (4),which requires evaluation of the type of documentation and program ohject

.* (already evaluated by the above two rules) and the state of the context model,which can be looked up.

j type of 1 Jtype of) state of)uc(4) documentation X program X context -- r ec

object object model preference

This process of starting from a goal and recursively searching through arule-base in an attempt to find a way to reach the goal is known as goal-directedinferencing or backward chaining. As values are determined for each clause, thevalues are passed hack up, until the goal is reached. Since the proposed evalta-tion scheme for the system is based on a multi-valued logic, answers can be some-where between "yes" and "no". Thus, reaching an answer may require the selec-tion and application of cutoff levels.

-": -22-

L

...............

The User Interface Section 4

4. THE USER INTERFACE

Understanding and manipulating documentation is one half of the DA; theuser interface is the other half. The phrase "Documentation Assistant" is used toconnote a partnership between the computer and the user. The human is anessential part of the documentation process; the purpose of the DA is to help, not p9_replace, the user. To provide this help to the user, the DA must provide an inter-face that is intelligent and user friendly.

The most important aspect of the interface provided by the DA is the- integration of the documentation into the programming environment; in particu-

lar, the DA will be built on the Intelligent Program Editor, and will allow natural paccess to documentation through the editor. There should be no need for theuser to switch context in order to manipulate documentation.

The IPE will provide a window-oriented user interface. It will present adisplay consisting of a series of windows, allowing for the user to know the stateof the system at all times through the visual representation on the screen. Thedocumentation and code under scrutiny can be displayed in a multiple number ofviews. The user can move easily among the different views by moving between(or creating new) windows.

Multiple modes of interaction will be provided by the IPE. User input canoccur through both menu item selection and keyboard input. Each of thesemodes will be available simultaneously; the user can use whichever is more con-venient. The system will be highly customizable, allowing the interface to bemodified to suit the preferences of individual users.

In addition to the PE functions, made available by building th DA on topof the IPE, the user interface will provide a number of documentation-specificfunctions. The rest of this section will describe some of those functions.

4.1 VIEWS

Previous discussion on documentation representation described a way ofsplitting documentation into small chunks or objects. This provides a naturalway for a computer system to manipulate documentation. However, it piovides amost inconvenient model for people, who may prefer to deal with documentationin larger units. The view mechanism of the DA provides a user level mechanismfor handling documentation.

A view is like a template: it provides a frame for displaying a set of docu-mentation objects together (Figure 4. 1).

Views are named entities. For example, to see general information about amodule, a user might ask to see the Module info view for that module (Figure4.2). There may be many views for one particular object; a user might ask to seethe Algorithm Info view for the same module (Figure 4.3).

-23-

LL.

.L ._ .:.. ., , ,. -. . .. . ., _ .. . .-. ., ,:.. . .-.. ...: -. ..-. -.-. : ,. ,u ,, - . .': ' .: .. .... .. . -. .'. • .. .. . . -.- .-

i TOT.


view- name

link-name link-value "-.

link-name line-value

Figure 4-1: The Structure of a View

Module InfoName sortAuthor Joe AdaDate August 1984Revision 2.1

Figure 4-2: The Module Info View

Algorithm InfoName sortKeywords sort, bubble sortDescription bubble sort algorithm using ...Parameters integer arra to be sortedReference Knuth, vol. 3, page xxx

Figure 4-3: The Algorithm Info View

There is no one set of views that will be adequate for all environments,since the specification of views is based on the taxonomy itself. Different librariesof views will need to be provided for each environment; but, since views are pri-marily meant to be a user interface tool, the DA will also permit the creation ofnew views by individual users in order to suit their own needs.

-24-

. . . . . . . . . . . . . . . . . .. .~.., .... ..-

t .. ".. o.


In the above examples, views were used to examine existing documentation.Views can also be used for creating or modifying documentation. To document amodule, a user might ask the DA to create a Module Info view for that module.The system would display the template (with known values, such as date orauthor, automatically filled in); creation of the documentation then becomes theprocess of filling in the template. Such a technique provides a way of structuringthe documentation creation process. If the user wants to set up general moduleinformation, the view makes it clear what should be included in that information.It also provides a way of analyzing documentation: when filling in a template, thedecomposition of the documentation into objects is obvious (since each slot in aview is a separate object).

To go a step further, if the user needs some assistance in writing documen-tation, it would be possible to display an example of how that documentationmight look. This sample could be a fake version, created specially for this pur-pose, or it could be actual documentation for another part of the system. Fordocumentation that is standardized, this information should be readily available,since standards generally give examples or descriptions of how documentationshould look.

4.2 INTERACTION CONTROL

It has been customary for the programmer to have control over the pro-gramming environment; actions are initiated on request only. This mode ofinteraction is known as user initiative. User initiative places the responsibilityfor all actions on the user. In complex environments, there may be too many fac-tors (such as programming standards, guidelines, administrative procedures,module interconnectivity, etc.) for the programmer to deal with in any reasonablefashion. The alternate approach is called system initiative, wherein actions areinitiated by the system without user intervention. System initiative is particu-larly useful in complex environments because computers are good about keepingtrack of large numbers of (sometimes seemingly unimportant) details. On theother hand, system initiative is inadequate for total control because the systemdoesn't always know what the programmer should be doing. The idea of a mixedinitiative approach is a good compromise, where some actions are user initiatedand some are system initiated.

The DA will provide a mixed initiative form of interaction. The user initia-tive part is fairly obvious: the user can ask to create, modify, search, and viewdocumentation. The system initiative aspect is somewhat unique, and providesthe DA with a capability not generally found in programming environments.Based on the model for determining when documentation needs to becreated /updated, and based on the context model of what the programmer isdoing and what state the system is in, the DA can take the initiative to ask theprogrammer to update documentation at an "appropriate" time.

Having the DA prompt the user for documentation is a significant step. Itmeans that remembering to find and update (or create) appropriate documenta-

2 tion is no longer strictly the programmer's responsibility. It means that the pro-grammer does not have to switch contexts in order to update documentation. It

-5-

L , . ."......

,..-.-:


means that the programmer does not even have to be concerned with using thecorrect tools for performing the update. The act of updating documentation isconsidered an integral part of the editing environment; moreover, keeping trackof the state of all the documentation becomes the responsibility of the systemand not the programmer.

p.. 4.3 TRAVERSAL

The term traversal refers to moving through the documentation. There aretwo basic modes of traversal: browsing and navigating. Browsing, or undirectedtraversal, is the process of going through the documentation without looking foranything in particular. For example, if you were given a program you had neverseen before and told to find the bug in the program, the first thing you might dowould be to browse through the documentation just to see what it looks like.Browsing is characterized by wandering or undirected search; if you were to lookover a browser's shoulder, it would be difficult to figure out what he was lookingfor. Browsing is unstructured: the user neither needs nor wants restraints.

On the other hand, navigating, or directed traversal, is a more goal-orientedsearch. For example, if you were given a program you were familiar with andtold to find a bug in the program, you might have a very good idea of where tostart looking. V. .'.

Navigating is a more structured process, and providing support to helpstructure and track the search space is natural. For example, there are two basicstrategies for navigating hierarchies: breadth-first and depth-first.i To providesupport for a user operating in either of these modes, a system can keep track ofwhere the user has been and which place should be visited next. Thus, when alevel is popped, the user will find himself where he left off, without keeping men-tal or written notes. Since it is often the case that a person will use a crossbetween these two strategies, it is important to provide a mechanism that is flexi-ble enough to provide both alternatives.

There is also another technique that is useful during navigation and possi-bly during browsing. This is the idea of guided trails. If you were maintaining aprogram, and knew quite well how that program worked, you might want tosomehow encapsulate that information so that others could learn without repeat-ing all your efforts. Imagine that you were to explain a certain aspect of the pro-gram to someone. You would start out in a certain place, point out relevant ." '".things, move to another place, point out other things, etc. The path followed asyou move through the code is a guided trail; you have figured out what parts ofthe system are necessary to examine in order to understand the operation of someaspect. If there is a way of saving this trail, then it could be "played back" later

(to others, or even to yourself, since you might have forgotten the trail).

I In a study on program comprehension, we found that these straWgies were an imIportant 2..

charm(leri.stic or the process or deimggiftg programs (I)omeshek-841.

-26-

% %n.#!.h

U * * L * -. - . . . .. ---.. . . .. U* ,', ., . - . f * - *- -


4.4 RETRIEVAL

Documentation retrieval is similar in many ways to navigation: searchingthrough the documentation space for particular information. The difference isthat in navigation, each move may be charted on the basis of the previous move;in retrieval, it is known how to get directly to the information.

The standard technique for retrieving documentation (or, for that matter,most any kind of on-line information) is via information/database retrievallanguages. The IPE will provide a retrieval language called the Program Refer-ence Language (PRL); while this language is aimed at retrieving programs, itcould also be used to retrieve other kinds of objects. The PRL may be wellsuited to documentation retrieval because it is designed for retrieval of structuredobjects.

Another technique for retrieving documentation from the IPE is fairly obvi-ous: pointing. If an object is on the screen, retrieval can be done by pointing to Lthe object (with a cursor or a mouse) and asking for all the documentation associ-ated with that object. Retrieval is done simply by traversing the links from theprogram database to the documentation database.

4.5 FORMATTING

While the DA is prnarily concerned with on-line documentation, it needsto be capable of converting documentation into a hardcopy form. Conversion ofdocuments that have a "book-like" form (e.g., manuals, specifications, plans) intodocuments is fairly straightforward; it is only necessary to provide output that iscompatible with a text formatter. On the other hand, conversion of in-line com-ments into hardcopy form requires much more work. Since there may not be any-.-\":*explicit linear structuring in this documentation, the DA will provide a means forselecting and composing these objects.

-27-

..............................

f. . .* . -. ... **.***.*.*. .. ** *.** *.~* *.... ...... -.

I

Feasibility Section 5

5. FEASIBILITY

To assess the feasibility of the DA, we examine here two major issues: feasi-bility of building the system ("implementation") and feasibility of placing the - -

system in use ("deployment").

5.1 IMPLEMENTATION FEASIBILITY

The DA will be rather eclectic: some sections of the DA represent new codethat must be written specially for this application; some sections represent experi-mental code borrowed from other research efforts; and other sections may becommercially available software. These various pieces of software are discussedbelow.

5.1.1 The Intelligent Program Editor

Many of the ideas for the DA have come out of the IPE effort at AI&DS; infact, the DA will actually be built upon/into the IPE system. With thisbootstrapping, much of the start-up effort that would have been renuired for theDA will not be necessary. The IPE will provide part of the database (the EPM)as well as a user interface.

However, it should be noted that the IPE is a research effort. The IPE isstill in the design phase, and it will be some time before any of the IPE systemcould be used to support the DA. Moreover, while one of the goals of the IPEeffort is to produce a runnable prototype, there are no assurances that the IPEeffort will actually produce a system that provides a usable base; as a researcheffort, the goals of the project are to demonstrate concept feasibility, not todeliver usable software.

The DA could be implemented without the Extended Program Model of theIPE. However, the DA would then forfeit the ability to link code directly todocumentation (and hence the ability to automatically detect outdated d)cumen-tation). The DA would still have a structured documentation database (thoughsomewhat less sophisticated), tools for manipulating that database, a policymodel for controlling documentation, and a user preference model for conformingto user desires.

The DA could also be implemented without the IPE user interface. How-ever, unless additional effort were put into the development of a DA user inter-face, much of the interactive, integrated, window-oriented nature planned for theDA would be lost. The DA would still have much of its original functionality; itwould just not be as easy to use.

Thus, the best approach to building the DA would be to utilize as much ofthe IPE as is available. Building on the IPE can only help the I)A; its unavaila-bility would impact only the time necessary to build the DA.

-28-

r

. . . -..

'. . . . . . . .

I ~I *II~I ~ 2-. . -.. .


5.1.2 Documentation Database

The documentation database for the DA will be developed in two steps.The first step is to develop a stand-alone mode: a database for documentationthat provides support for just documentation and is not linked to the ExtendedProgram Model. The second step is to integrate this database into the EPM, sothat documentation and programs can be linked together. *-

Developing a stand-alone documentation database is definitely feasible.The best approach would be to develop a special purpose database in order toaddress intricacies specific to documentation. As a backup approach, an existingdatabase system could be used, and necessary database routines could be added.

Integrating the documentation database into the EPM is more of anunknown. Since the EPM is still under design, determining the ease of adding anew component is difficult. However, the EPM is being designed with the inten-tion of being able to add documentation. Thus, it is likely that adding the docu-mentation database to the EPM will be significantly less work than the develop-ment of the EPM itself.

5.1.3 Detection of Outdated Documentation

One of the most unique features of the DA is its ability to automaticallydetermine when documentation is out of date. There is a great deal of variabilitypossible in the effectiveness of this task. To better describe this process, it is use-ful to borrow a few terms from the field of information retrieval. In the contextof the DA, the term recall refers to the percentage of outdated documents thatwere detected; the term precision refers to the percentage of detected documentsthat were actually outdated documents.

The goal of the DA (and of any retrieval system) is to achieve maximumrecall and maximum precision. Unfortunately, it is often the case that strategiesfor trying to increase one measure results in decreasing the other. Therefore, anystrategy aimed at increasing one measure needs to be carefully monitored todetermine the effect on the other.

The simplest strategy for detecting outdated documentation is to flag docu-mentation as outdated if anything it references is changed. This is likely toresult in high recall 'jut very low precision; along with the documentation thatactually needs changing, the system will flood the user with documentation thatis not outdated. Based on current technology, this is about the best that can bedone. The DA should be able to surpass this, achieving a similar degree of recallbut with a significantly higher degree of precision.

The most prudent approach to increasing precision appears to be the incre-mental addition of knowledge about the documentation taxonomy (provided bythe DA itself) and knowledge about the program semantics (provided through theEPM). The utilization of this knowledge makes it possible to eliminate more

-29-

.............................-

%-%7.m

'.7 --7 •


pieces of documentation from the list of possibly outdated documentation. Thepurpose of incremental addition of knowledge is to add knowledge in smallchunks, making it easier to determine the kinds of knowledge have a negativeside effect (in terms of precision and recall).

By using these techniques, it. is quite likely that the DA can achieve its goalof higher recall and precision. Just what levels can be achieved is an open ques-tion, as is the question of what levels users will find acceptable.

5.1.4 Documentation Retrieval

The documentation retrieval facilities discussed earlier should be relativelyeasy to build. In the case that users find themselves in need of a more generalpurpose retrieval capability (e.g., unformatted text retrieval), it may be possibleto make use of an existing information retrieval system. Commercially availableinformation retrieval systems provide basic capabilities for locating documenta-tion. Better yet, AI&DS has already developed a prototype of an intelligentinformation retrieval system that provides retrieval capabilities beyond that ofany commercially available system [McCune-83]. Incorporating one of theseretrieval systems should be a straightforward effort.

5.1.5 Documentation Formatting and Analysis

The DA effort will not require the development of any new documentationformatting tools; rather, it will rely on existing document/text formatters. Theonly difficult part of document formatting is linearizing the database/network ofdocumentation, coercing it into a form acceptable by the text formatter. Giventhat the documentation database (and the tools for manipulating it) are avail-able, there should be little difficulty in achieving this. Similarly, existing tools forchecking spelling, grammar, diction, readability, etc. could easily be added to thesystem.

5.2 DEPLOYMENT FEASIBILITY

Assuming that the DA can be successfully built, the next question is thefeasibility of moving such a system out of the research environment and into aproduction environment. The following subsections provide some insight intothese issues.

5.2.1 Current Documentation Problems

The need for a tool like the DA is great, especially in government/militaryprogramming environments. Current documentation practices in these environ-ments suffer from many problems. A number of these problems were identified inan earlier study of software maintenance [Dean-83], which found that documenta-tion takes a back seat to software, for a variety of reasons:

-30-

...................................................................

7- V


1. Documentation is not considered an important part of the end product:

" only a portion of the total documentation is specified as adeliverable

" the requirements for deliverable documentation are much lessrigid than the corresponding requirements for the software

" documentation is often not done until after the programming

2. Documentation is not considered an important part of the software lifecycle:

" documentation is alloy ed to lag behind the software

* programmers dislike writing documentation

" documentation writers have inadequate training

" writers are presented with little structure and inadequate guide-

lines

3. Documentation is poorly handled:

" tools for manipulating documentation are inadequate

* documentation is sometimes done off-line

" it is difficult to evaluate the completeness or correctness of docu-mentation

The DA provides a basis for helping reduce these problems. The first problemwould be partially alleviated if contractors used the DA during the programdevelopment process, since their documentation task would be eased, and theircoding environment would be integrated with their documentation environment.The second problem is also partially addressed by the DA; while no tool canmake someone write documentation, a good tool can encourage and assist theprocess. The DA certainly forces the recognition that documentation is a criticalpart of the software life cycle. The last problem is directly addressed by thecapabilities of the DA.

5.2.2 Documentation Life Cycle Support

The DA, as described in this document, represents only one viewpoint of adocumentation system (that of the programmer). For any documentation systemto be truly usable, it must provide facilities for all those involved in the produc-tion and maintenance of documentation, including designers, managers, technicalwriters, typists, and testers. Each of these user categories has its own uniqueneeds. For example, a designer might want support for the development of

-31-

- -- -""-- -


documentation in the form of a program design language; a project managermight want tools to do consistency checking on documentation, to make sure .that the documentation corresponds to the code.

The DA is being designed to accommodate the needs of all these users. Thebasic underlying mechanisms (i.e., database, user interface) will be in place; allthat will be required is the development of additional tools, built on the existingsystem, to support these different needs. Thus, the proposed version of the DArepresents one facet of a documentation support system (see Figure 5.1). It willeventually be capable of supporting the entire documentation life cycle, but thiswill require further development.

5.2.3 Supporting Documentation Standards L_

The DA is designed to support different documentation standards. To givea better picture of how this might be done, this section presents a portion of theproposed SDS military standard [SDS-83]. In the example presented here, justone path of the SDS documentation tree is traversed, and an example of how theDA might represent this information is shown. This example is meant to providea rough sketch; it should not be construed as a complete picture of SDS nor as acomplete picture of documentation representation techniques.

Figure 5.2 serves as a map to this discussion; it shows the path through theSDS documentation that will be traversed in this example. As presented here,the documentation looks hierarchically arranged, but this is a simplification madefor the purpose of presentation. At each level, the item in boldface is the itemthat will be focused on at the next level.

At the top level of the hierarchy (Figure 5.3) are two items, DOD-STD-SDS, the proposed military standard on defense system software development,and a set of Data Item Descriptions (DIDs). The SDS standard references all ofthe DIDs.

The Data Item Descriptions describe the documents that contractors mustdeliver with software. There are approximately 25 DIDs referenced by SDS.Each DID may reference a number of other DIDs. Figure 5.4 shows the overlapbetween DIDs. The figure also distinguishes between several different categoriesof DIDs: requirements/design documents, testing documents, and user documents.

Each DID describes the structure of a document. Figure 5.5 presents thestructure in the Software Test Plan (STP). From Figure 5.4, it can be seen thatthe STP references several other documents. These references are shown in moredetail in Figure 5.5, where the arrows represent specific places in the STP thatreference other documents.

In addition to the document structure, each documentation object has attri-butes associated with it, as discussed earlier. In Figure 5.6, there is an exampleof some of the attributes that might be associated with the Software Test Plan VDID.

-32-


,PROGRAMMING1

DESIGN B TESTING

NTION OCMNAIVMANAGEMENT

CHANGE INTEGRATIONCONTROL

QUAL ITYASSURANCE

Figure 5-1: Documentation Life Cycle Support

-33-


Standard on Defense System Software Development (Figure 5.3)

Data Item Descriptions (Figure 5.4)

Software Test Plan (Figure 5.5)

Formal Test Requirements (Figure 5.6)

Figure 5-2: Overview of SDS Documentation

DOD-STD-SDS

Data Item Descriptions

Figure 5-3: The SDS Standard (top level)

There may be additional substructure to the structure presented in Figure5.5. Figure 5.7 presents an example of how to fill in the section entitled "FormalTest Requirements." Unlike the higher level structures, this structure presentssuggestions to the author of the document, rather than stating necessities.

The DA weaves these levels together into a network that represents therelationships and interconnections. Figure-5.8 is an example of how this networkmight be represented by the DA. Objects on the left hand side of the figurerepresent classes of objects; objects on the right hand side represent instances orsubclasses of these classes. For example, DOD-STD-SDS is an instance of a stan-dard; the Software Test Plan is a subclass of Data Item Descriptions. L

SDS is but one example of what documentation structure might look like;certainly, many other structurings are possible. As indicated by this example,the structuring mechanisms provided by the DA are meant to be quite general inorder to provide support for any type of documentation structure.

-34-

L °

. ..........

COMPUTER SYSTEMSORAORS MANUAL

* SOFTWARE USER'S MANUAL

SOFTWARE REQU IREMENTS SPEC IF ICAT ION-

/ 0 INTERFACE 0,,,,,.,.,,,,,,,, ~REQUIREMENTS

FTWAR TESTSPECIFICATIONSOFTWARE

REPORT PLN 0

___________ /INTERFACE -

____ _ /DESIGN DOCUMENT

SOFT WARE TESTSOFTWARE

PROCEDURE TS

DESCRI PT ION

00SOFTWARE ~ ,'~

01TOP LEVEL /DESIGN SOFTWAREDOUMN 00 - DETAILED -1

/ - DESIGNDOCUMENT 00

tj iire 5-4: 1):i1t ,xI I v m I (sc r1 iti s (p r IM Ia I st.

-35-

APPENDIX A

A Knowledge Base for Supporting an Intelligent Program Editor

References Section 9

9. REFERENCES

[Dean-831 Dean, J. and B. P. McCune, "An Informal Study of Software Mainte-nance Problems," Software Maintenance Workshop, Montery, Call-fornia, December 1983.

[Domeshek-84]Domeshek, E., D. Shapiro, J. Dean, and B. McCune, "An InformalStudy of Program Comprehension," AI&DS TM-1014-3, March1984.

[McCune-831 McCune, B., R. Tong, J. Dean, and D. Shapiro, "RUBRIC: A Sys--tern for Rule-Based Information Retrieval," Proceedings, COMP-SAC '83, IEEE Computer Society, pp. 166-172.

[SDS-83] Proposed Military Standard on Defense System Software Development,DOD-STD-SDS, December 1983.

[Shapiro-841 Shapiro, D., J. Dean, and B. McCune, "A Knowledge Base for Sup-porting an Intelligent Program Editor," Proceedings, 7th Interna-tional Conference on Software Engineering, March 1984, pp. 381-386.

-48-

7-,-"

. . . . . . . . . . . .-. . . .. . . .

Conclusion Section 8

8. CONCLUSION L..

Despite the tremendous need for support of the documentation process,there has been little research in this area aimcd at making significant improve-ments to the current techniques used. As an intelligent tool designed to assistusers in all phases of the documentation life cycle, the Documentation Assistantrepresents a significant step in this direction.

The focus in this report has been on documentation from the programmingviewpoint; this is a logical place to start, given that programmers already usevarious computer-based tools on a regular basis. In addition, current research onprogramming environments is leading towards tools like the Intelligent ProgramEditor (under development at AI&DS), which provide a natural base for buildingdocumentation tools.

However, to reach the goal of providing total documentation life cycle sup-port, it is clearly necessary to address the needs of the many other users of docu-mentation. This report should be viewed as a first step in that direction. TheDA provides a framework for manipulating documentation that should easilyextend to provide this support. To provide support for the entire life cycle, addi-tional tools and techniques must be developed.

To construct the DA, it will be necessary to make use of existing technol-ogy, experimental technology, and brand new technology. The building of a sys-tem like the DA definitely has risks associated with it, especially the parts of thesystem that require new technology to be developed. However, given the mix oftechnologies that will be used by the system, it is likely that a significant portionof a DA prototype can be built with minimal risk.

With the rapidly escalating growth (and cost) of software maintenance,there is a clear need for the development of new tools and techniques to handlethe problems and bottlenecks associated with the software process. Of theseissues, documentation is possibly the biggest and most crucial one. The Docu-mentation Assistant is aimed precisely in this direction.

-47-

A... ...

S...... ...

. . ---------- ~.wI I. l -. - -. ,, ! - .- i . .... =.

Future Research Section 7

documentation that is not in the database. Tools and techniques for aid-ing the conversion of this documentation may eventually need to bedeveloped.

*Programming Context Model: The purpose of the Programming ContextModel is to keep track of what the programmer is doing. However,current techniques for doing this are overly simplistic, and if accuracy is -

required, a good deal of user cooperation is needed. The development ofmore sophisticated techniques for determining what the user is doingwould ease the task of the user, as well as increase the reliability of thiscomponent of the system.

* Improved Detection of Outdated Documentation: By making use of thesemantic information that the EPM will provide, it is possible to increasethe accuracy of determining outdated documentation. There is a greatdeal of information to be gained from the various semantic representa-tions; just how much accuracy can be achieved is an open question.

e Consistency Maintenance: The issue of consistency maintenance is a crit-ical one for both the IPE and the DA. Consistency checking can be doneat many levels. For example, when a documentation object changes, it isnecessary to track down any other dependent documentation objects.However, determining these dependencies can be difficult. Even if thedependent documentation explicitly references the changed documenta-tion, it is unclear if the change actually affects the dependent object; evenworse, the dependent object might reference the changed documentationimplicitly. As the DA grows more sophisticated, it is necessary for theconsistency maintenance mechanisms to follow.

e Knowledge Acquisition: A great deal of information/knowledge is neces-sary for the optimal functioning of the DA. To port the DA to differentenvironments (or to modify what is known about current environments)old knowledge must be modified and new knowledge must be added.Tools for aiding this knowledge acquisition process are essential, espe-cially if the task is to be performed by someone other than the systemdevelopers. For example, in the case of rule-based systems, knowledgeacquisition tools include structured rule editors (to insure that only syn-tactically correct rules are entered), rule evaluators (for determining theeffect of rule sets), rule analyzers (for determining properties such as sen-sitivity, connectivity, consistency), etc. Similar tools will be needed forthe DA.

-46-

....................................................................................................... %

-TV


PROGRAMMER

V.-.

INTELLIGENT

PROGRAMMING

ENVIRONMENT

DOCUMENTATION

PROGRAM DEBUGGER *ASSISTANT I

EDITOR I

-45-EXENEDPRGAMDAABS

Fgr7-:AArhtcueFrAv nce rgamn Enironmen-s

-4.5-f:

%7. -51.


7. FUTURE RESEARCH

The plan for building a prototype of the DA necessarily omits many of theissues and areas requiring significant research efforts. Some of the research issuesthat might be addressed in the future are discussed below.

7.1 FUTURE RESEARCH ON PROGRAMMING ENVIRONMENTS

The IPE and the DA are two important components of an advanced pro-gramming environment currently being developed at AI&DS. The architecturalbasis for this environment is the intelligent program/documentation databasecapability provided by the Extended Program Model, which will provide func-tionality usable by a variety of tools. While the combined IPE/DA system wouldprovide a great boost in capability over existing programming environments,there is still at least one important aspect of programming environments that ourresearch does not currently address.

To increase the usefulness of the IPE/DA, the next step would be to incor-porate support for the dynamic aspects of programming (i.e., tools to supportprogram execution). These tools would also be based on the EPM (Figure 7.1),and thus would benefit by having access to the multiple representations providedby the EPM, including the linkages connecting various program segments anddocumentation. With access to this additional information, these tools could pro-vide capabilities beyond those provided by current runtime support environ-mentd . .:ii

7.2 FUTURE RESEARCH ON DOCUMENTATION

With respect to documentation, there are a number of directions that wouldbe logical for further advanced study. A brief discussion of these issues follows.

*Documentation Life Cycle Support: Support for all people involved withthe production and maintenance of documentation must be provided.This includes support tools for different documentation types (e.g.,requirements, specification), document entry (e.g., checking spelling,grammar, style), document monitoring and control, and consistencychecking. The database provided by the DA will be an ideal base for thedevelopment of these tools, since it provides documentation already in ahighly structured form suitable for machine analysis.

9 Retrofitting to Existing Systems: The discussion so far has assumed thatall documentation is already in the documentation database. For newsystems that are built using the IPE/DA, this might be a reasonableassumption, since documentation will be entered into the database assoon as it is entered into the computer. However, for programs that havebeen developed without the aid of these tools, there will be -:

-44- .'

.. ,. ... .'. .. - . - .. .. • ... F

... ,....-...... ............................................... . ...

Work Plan Section 6

moderate, primarily because the IPE must be modified to accommodate docu-mentation. However, the highest risk is assuming the availability of a workableversion of the IPE. Even if a prototype of the IPE is available, it is unclear howeasy it will be to modify the system.- .

Year 5 (1989):

The fifth year of the project will result in a more fully integrated version ofthe IPE/DA, allowing a user to easily move between programs and documenta-tion. Advanced capabilities will begin to be developed, including the ability tomodel user preferences about various states of the system, as well as the abilityto automatically detect outdated documentation.

As in the previous year, if the IPE system is not sufficiently developed,integration of the DA into the IPE will be hindered. The task of providingadvanced capabilities for user modelling and outdated documentation detectionshould be considered research topics, and thus may be fairly high risk.

The integrated version of the DA will be demonstrated at the end of thisyear. An evaluation of the system will also be performed, by applying the DA tosome subset of the code and documentation for a real software system (to beselected at some future time).

-43,'..U :

-43-..

*..* ~ * * * ~ *1U Ug*~**. .*b -... *.* . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .- U.- .' * -~ %.,'- % * %' U~.% % .% %' ~ ** ***. * * *

Work Plan Section 6

6.2 TASK DESCRIPTIONS

The following paragraphs describe the tasks in greater detail and describethe risk associated with each set of tasks.

Year 1 (1985):

The major focus of the first year of effort will be the design of the DA sys-tem. A large part of the effort will be to examine currently existing databases -"and database tools to discover current technology that could be used in the sys-tem itself or as an aid during system development. During this year, the internalrepresentation for the documentation will be developed. The design of the userinterface will be undertaken and will make use of the Rapid Interface Prototypersystem that is currently under development at AI&DS. This system allows forthe user interface designer to study different styles of user interface display andactions before actually committing the ideas to code. "'

Years 2 & 3 (1986/1987):

The second and third years of effort will be concerned with the constructionof the actual database, making use of existing tools and technology whenever pos-sible. The database manipulation functions will be developed during this time.To enable incorporating existing documentation into the DA, methods will bedeveloped to allow the user to interact with the system to add preexisting text tothe database. Finally, a primitive "word processing" mode will be added to theuser interface to ease the process of entering larger amounts of text into the sys-tern.

Depending on the availability of existing software, the risk involved in theconstruction of the documentation database may vary; it would range from lowrisk (in the case that an existing database system could be used) to moderate risk(in the case that the database must be designed from scratch).

This initial version of the DA will be used for internal experimentation. Atthe end of this period, a demonstration of this system will be provided.Year 4 (1988):

Much of the fourth year of effort will be spent in the integration of the DAinto the IPE. As a major step in this task, the ability to associate documentationwith code will be added. This will allow the user to point at a piece of code,request to see the documentation associated with that code, and have the textautomatically displayed. In a similar fashion, it will be easy to add code-leveldocumentation to the database at the same time the code is being written. Thisyear will also include work on providing documentation formatting functions thatwill allow hardcopy documents to be produced from the documentation database.

For tasks not involving the IPE, the risk for this year is low; it mainlyinvolves providing different ways of manipulating documentation. Since a largeportion of the effort for this year will build on the IPE, the risk is compounded.Assuming that the IPE prototype is available, the risk for the remaining tasks is

-42-

Work Plan Section 6

6. WORK PLAN L

This section presents a plan for the design and implementation of anexploratory development prototype of the DA. This prototype will be able tomake use of the technology and tools being developed as part of the IntelligentProgram Editor Project, another Navy research project currently in progress atAI&DS.

6.1 TASK SUMMARY

The tasks involved in building a research prototype of the DA are summar-ized below on a yearly basis.

Year 1 (1985) [estimated effort: 0.8 person/years]

e Develop an overall system design* Design the documentation represeptation formalism* Design and prototype user interface

Years 2 & 3 (1986/1987) [estimated effort: 1.0 person/years per year]

* Construct database* Design and implement database functions* Implement a "word processing" mode for documentation* Demonstrate initial prototype (end of 1987)

Year 4 (1988) [estimated effort: 1.5 person/years]

* Design needed functionality to integrate into IPE• Associate documentation with codee Provide documentation formatting capability

Year 5 (1989) [estimated effort: 2-5 person/years]

* Complete integration into IPE, with ability to document different views* Basic method for automatically detecting outdated documentationo Basic user modeling capabilityo Demonstrate and evaluate integrated prototype (end of 1989) r

-41-

. . **.. **'P** € . o

.. ... . ... . . -._ ,. -, -,. -. .

mr- ~~ ~ V-V: -_


It will also be important for knowledge-based tools to provide functionalityeven in applications where only a minimal amount of knowledge has been col-lected. Of course, there is a real tradeoff here: as knowledge is reduced, the abil-ity to act intelligently is also reduced. It makes little sense to employ intelligenttools without any of the knowledge needed to act intelligently (one might saythat this is a non-intelligent application of technology). While the DA is depen-dent on a documentation knowledge base, it does provide a number of usefulcapabilities requiring only a small amount of knowledge engineering. Thesefeatures include the documentation database, documentation interconnection(though all interconnections must be explicitly specified by the user), and track-ing (though the user must manually enter the appropriate information).

These tradeoffs can be described as striking a balance between the workrequired by the user and the work required by the system. As the knowledgebase grows (along with the ability of the system to handle that knowledge base),the amount of work required by the user decreases. Cutting back on theknowledge base does not eliminate the possibility of using an intelligent system;it just shifts the burden for much of the legwork from the computer to the user.

The issue of who performs knowledge engineering (and its subsequentmaintenance) is one that the research community has still not adequatelyanswered. To date, the people who have done knowledge engineering have beenAl researchers (or people trained by them). The issue is particularly important inproduction environments, where continual environmental changes necessitatecorresponding changes to appropriate knowledge bases. For intelligent systems toachieve wider usage, it is necessary to develop tools and techniques for allowingpeople who are not AI experts to perform knowledge engineering and mainte-nance tasks.

5.3 FEASIBILITY SUMMARY

The more "intelligent" aspects of the DA will be based primarily onresearch already in progress at AI&DS (as part of the ONR-funded IntelligentProgram Editor project). Existing software tools that provide other capabilities(e.g., text formatting, information retrieval) could be incorporated into the DAsystem.

Many current documentation problems faced in real programming environ-ments are addressed, either directly or indirectly, by the DA. The primary costassociated with using the DA (apart from development costs) is the initial collec-tion and codification of the documentation knowledge base. This is a one-timecost for each programming project/environment; however, much of this informa-tion should be reusable, especially between similar environments.

- 40-

-40-


*. 5.2.4 Knowledge Acquisition and Maintenance

The difference between the current generation of programming systems andfuture generations (e.g., the so-called Fifth Generation systems) is knowledge.Current generation systems have very little knowledge about how they are beingused and about the semantics of their intended application. For example, pro-grammers today usually use text editors to edit programs. There is a great dealof syntactic and semantic information that a program editor could use; ignoringthis information reduces the capabilities of the system and fails to reflect the con- -ceptual levels at which programmers work. Yet this increased capability is pre-cisely what is required to increase the usefulness and productivity of computersystems.

There is, however, a price to be paid for making systems more intelligent,

and that is the cost of knowledge. Knowledge is a rather expensive commodity. ..

It is costly to hire an expert to help apply knowledge; it is costly to codifyknowledge; it will be costly to take that knowledge and incorporate it intosoftware that exhibits intelligence.

-'. The artificial intelligence community is quite aware of the value (and cost)of knowledge. They have coined the term knowledge engineering to describe theprocess of acquiring and codifying knowledge. Most Al systems are restricted tonarrow domains in order to reduce the amount of knowledge necessary, thusU reducing acquisition costs as well as reducing the size of the search space.

Despite the cost of knowledge, intelligent systems can indeed be worth theexpense. The gains in productivity and effectiveness provided by intelligenceshould more than compensate for the increased costs. However, it should berecognized from the beginning that it will cost more to develop intelligent sys-

w tems; the payoffs come later as systems are put to actual use, and the intelligentsystems prove more economical than their "dumb" counterparts.

There are two basic approaches to reducing the costs and risks associatedwith intelligent systems development. First, as knowledge based systems becomemore common, it may be possible to reuse existing knowledge, at a much lowercost than redoing the entire knowledge engineering process. It will also be possi-

- ble to share knowledge among different instances of a particular system. Second,*: as a means for evolving gradually towards more intelligent systems, intelligent

tools can be usable even with minimal knowledge bases. By reducing the initialknowledge engineering effort, costs can be controlled until the concepts have beenproven.

The knowledge in the DA takes many forms; for example, there isknowledge about the structure of documentation, knowledge about the documen-tation process, knowledge about what the user is doing, knowledge about policiesand user preferences, and knowledge about interacting with the user. Some ofthis knowledge is environment or site specific, and some of it will be valid formany sites. The cost of reusing knowledge will be less than the cost of regenerat-ing it (just as the cost of reusing software is less than the cost of rewriting it).

-39-

L"- A._Z


STANDARD

Iliiisupercedes: MIL-STD-167S

DATA ITEM DESCRIPTIONHAS-A

IS-ASOFTWARE TEST PLAN DID

t number: DI-T-X116

abbreviation: STP

supercedes: DI-T-2142

phase: testing

HAS-A

TABLE OF CONTENTS

*IS-A 'SOFTWdARE TEST PLAN TOC

sections: 1

SECTION

IS-A FORMAL TEST RQIEET

L Figure 5-8: Representation of the SDS Documentation Hierarchy

-38-


Software Test Planattribute value

Number DI-T-X 116Abbreviation STPPredecessor DI-T-2142

Agency NavyPhase testing

Scope single CSCILength 13 pagesDate 5 December 1983

*Figure 5-6: Attributes of the Software Test Plan DID

All formal tests shall include the following test requirements:

* The size and execution shall be measured.

- .* Nominal, maximum, and erroneous input and output values.

* Error detection and proper error recovery, including appropriate errormessages.

Formal Tests for radar tracking requirements shall include the following testrequirements:

o Simulated test data on all possible combinations of environmental con-ditions.

e Input data taken from the environment ("live data").

I [Figure 5-7: Example of Formal Test Requirements

•., ..'. -37-

LI-.

• ". 4."


Limitations/TraceabilityLimitations

__ oftware Rtequirements SperifiationTraceability Itrar1rurmnsS~fet~Informal Test Plans

Unit TestingUnit Test RequirementsUnit Test ManagementUnit Test Schedule

CSC Integration TestingCSC Integration Test RequirementsCSC Integration Test ManagementCSC Integration Test ClassesCSC Integration Tests

CSC Integration Test Table -. Software Requirements SpecificationCSC Integration Test Schedule

Resources RequiredFacilitiesHardwareIn terfaci ng/Su pport SoftwareSource

Formal Test PlansFormal Test RequirementsFormal Test ManagementFormal Test ClassesFormal Tests

Data RQetion and AnalysisFormal Test Table . Software Requirements Specification

is Formal Test ScheduleFormal Test Reports -~Software Test ReportResources Required

Facilities -.

HardwareInterfacing/Sup port SoftwareSource

NotesAppendix

Figure 5-5:- Software Test Plan DID

-36-

I~ii i ~t~'r i~ , ) i ;inI ~ . I t i I. I ~ I k 8."',~~~ ~~~ ;I~tl .tn, ldt 1~ ,ti~. .1 1 r 1 19 /1

I of 6

A KNOWLEDGE RASE FOR SUPPORTING AN INTELLIGENT PROGRAM EDITOR

Daniel G. Shapiro

Jeffrey S. Dean

Brian P. McCune

Advanced Information & Decision Systems

201 San Antonio CircleMountain View, CA 94040

ABSTRACT context of a program search.

This paper presents work in progress towards a 2. H)TIV£TOIprogram development and maintenance aid called theIntelligent Program .ditor (IZl), which appliesartificial intelligence techniques to the task of An intelligent editing system is a sophisti-

manipulating and analyzing programs. The IPE is a cated tool for developing and maintaining programs.

knowledge based tool: it gains its power by expli- The goal, insofar as it is possible, is to decreasecitly representing textual, syntactic, and many of the amount of information a programmer needs to

the semantic (meaning related) and pragmatic supply in order to create and maintain a program.(application oriented) structures in programs. To and to simultaneously increase the reliability of

demonstrate this approach, we implement a subset of the resulting code. This can be accomplished bythis knowledge base, and a search mechanism called incorporating knowledge about the structure and

the Program Reference Language (PIL), which is able intention of programs into the editing tools used

to locate portion@ o programs based on a descrip- to develop and maintain them. Perhaps the best way

tion provided by a user. to illustrate this approach is to present anallegory having to do with the production of atechnical manuscript.

This research was supported by the Air Force Officeof Scientific eearch under contract F420--C- a manuscript which needs06 SitOfice a Research under contract to be typed for publication. If it is given to a~", .-.. 0067, the Office of Naval Research under contra ct tysthooenoseaEgisherut ud

'00014-82-C-0119, and Rose Air Development Center typist who does not speak English. the result wouldunder contract F30602-80-C-0176. be, at best, a word-for-word copy of the originalu o manuscript. If it is given to an English-speaking

typist, simple errors, such as misspellings and1. INTODCIUON punctuation problems, might be fixed during the

typing process. If the manuscript is given to anEnglish teacher moonlighting as a typist, the

The effort and expense involved in software result might well be a version in which the prosemaintenance have been recognized as a major limits- is smoothed and otherwise improved. Finally, iftion on the capabilities of current software sys- one is lucky enough to find a typist familiar withtems. In a study on software maintenance issues in the domain of discourse (such as the author), thethe Air Force. we found that the process of resulting document might even have factual errorscomprehending the form and function of existing corrected and incomplete thoughts identified.software (i.e., what it does and how it does it) isthe largest task in the maintenance process (21. A programmer selecting an editor system for

writing code is in a similar situation. A standardThe basic cause of this "comprehension prob- text editor is comparable to the non-English-

lea" is the loss of knowledge during the program- speaking typist; text appears exactly as it is

r ming process, caused by factors such as poorly typed, with no enhancements. The English-speakingwritten software, inadequate documentation, pro- typist could be compared to a syntax-oriented edi-grammar forgetfulness, and personnel turnover. To tor, which can eliminate syntactic program errorsaddress these issues, we have started a project to and misspelled keywords. The Englishdevelop intelligent, knowledge-based programing teacher/typist knows about the language itself butaids, designed to help the programmer overcome Iim- not about the content of the thoughts. This situs-itations of more traditional tools. This paper tion is comparable to a programming language-describes the initial phase of one of these tools, specific editor which applies knowledge about thean editor known as the Intelligent PLgja Editor donain of programming; this editor can instantiate(lPE). The following sections discuss the motive- general programming techniques, catch certain typestjon behind intelligent editing, the design of an of semantic errors, asks style suggestions, andintelligent editor, a database for the editor, and improve the overall flow of the program. Thea scenario demonstrating an actual implementation technical typist who understands the content ofof a portion of the lIPE' database, used in the what is being said is analogous to an editor -hat

%......................................... . .. .. "'' . .

2 of 6

utilizes knowledge about the applhcation domain; it 4. 115 EMPlDXD PROGRAM MODEL

can help in algorithm development and can catchcertain types of pragmatic errors which are depen- T t o dl )edent upon the specific application domain. The Extended Program Model CEPM) provides a ..~dnew way of representing and accessing programs by

3. ii zIrICJLLIGIa fGIUI EDzTOR defining a vocabulary for discussing programs whichuses terms that are much closer to the ones whichusers naturally employ. The EPH provides this

The Intelligent Program Editor (IPE) described capability through the use of a database thatin this paper most closely corresponds to the represents the structure of programs ram a variety.English teacher/typist mentioned above, in that it of vews. . The EPK can form the backbone for a

will support textual and syntactic manipulations, number of systems which exhibit a deep understand-

and have the ability to assist in the implementa- ing of the organizational structure and meaning of f.tion of typical programing actions. This power is code.obtained through the use of a database that expli- T i o c t f ocitly represents the functional organization of The EPM is constructed in terms of two major

programs in terms of textual, syntactic, and subsystems (see Figure 1) : a program structuresintention-oriented structures. With this database, database and a search and update component calledthe IPE is in a position to become more of a pro- the Program Reference Language, which providesgramming environment than solely an editing tool. access to the database. In addition, the EPM willIn this vein, we are interested in addressing the contain a library of "rational form" constraints ,_folloving design goals 15). that will monitor program composition for its

structure and intentional content. As a whole, theThe lPE should provide a means for naturally system can be thought of as a database management

incorporating documentation into the program system for creating and maintaining code. It pro-development process. In our view, this requires vides a search language for accessing itsthe ability to link documentation into the organi- knowledge, a facility for performing updates, aszational structure of a oroeram (similar to well as a set of semantic integrity and consistencyhelson's (31 concept of hypertext), and the ability constraints for monitoring the validity of the datato actively use any information that is supplied it contains.(to provide programmers with a motivation forincluding descriptive data). In the IPE, documen- EPhItation will provide input to a program search *'.

facility.

The system should support incremental program-analysis. The object here is to employ the SEARCH Isystem's understanding of program structure to (PR MANIPULATION-catch syntactic and certain semantic errors prior L".'"to execution. Examples include identifying vari-ables that are accessed before being set (via dataflow analysis) and detecting programming clichesthat have been incompletely implemented. There isalso a role for error prevention: some editors PROGRAM STRUCTURES

* (e.g., [61) prevent syntactic errors from ever DATA BASEoccurring.

The IPE will allow the user to employ alter-

nate program visualizations. This means allowingthe programmer to examine or modify code throughany of the representations mentioned above. For 9 CONSISTENCY CONSTRAINTS

example, a syntax based approach might be appropri-ate during program construction, while a graphical

" data flow display may be useful within the debug-ging process.

All of these capabilities require the use of Figure I. The Extended Program Modelmultiple program representations, as well asmechanisms for searching and manipulating the 4.1 TUE PROGRAM ST&UCTURES r&TA BASEinformation they contain. Therefore, in the firstphase of the IPE project, we constructed a proto- The EPM's program structures database is con-type version of this program database, called the structed in terms of a collection of represents-E. xtended Praa Model (EPM), and demonstrated it tions which reflect the transition from a syntactic

in the context of program search. The remainder of to a more intention-oriented analysis of code (Fig-this paper discusses the EPH and the search example ure 2). We are considering these viewpoints to bethat was produced. abstract data types which facilitate different

sorts of retrieval operations.

L

.".

•.. ... .. ..... .. ;,,-.. .. '._...;.,.,.,.,-.., ........ ,.......,. -.- , ;-....

3 of 6

"- uKRIATlN defined a library of such TMe Mi1 (h. uses thetern cliche; in this paper. we use both terms

IsI(TIONAL AWGATES interchangeably).

The remaining databases (intentional aggre-

o P PAIERNS gates and documentation) provide methods for asso-ciating the intentions behind a program withspecific features of code. They capture pragmaticknowledge relating to the domain of application of

* .- ARSE the program. Intentional aggregates are a type of"' -formal documentation that allow the association of

larger program fragments with key concepts (sup-CplOI.M ATAFtWd plied by the user). They can be used to collect a

set of T??s and other program segments that imple-ment a single conceptual function; for example, a

S"T" collection of TPPs representing queue operationsmight be grouped (by the user) into an intentionalaggregate representing a scheduler.

The documentation database allows the user tofigure 2. Representation Levels in the £PH associate comments with any of the program features

already described. At the lowest (i.e., textual)level, this would take the form of in-line con-

The textual representation gives the EPH the meats. At other representational levals, the userview that most text editors provide. It is a low- could, for example, document the data flow in alevel approach, concerned with words and delim- particular module (saying why an input-output rela-

- iters. but it allows for important textual search tionship occurs), justify his use of particularoperations. TPPs, or explain why particular syntactic features

are employed. The advantage of this technique overThe syntactic viewpoint embodies the rules of current documentation practice is the ability to

grammar for particular programming languages. The make a direct association (via links maintained bysyntactic database provides the EKP with a vocabu- the IPZ) between the documentation and what itlary for programming constructs such as "for" talks about, at an appropriate conceptual level.loops, parameters, and procedures.

The next level of representation is the flow 4.2 KNOWLEDI ACQUISITIONlevel, which provides standard data and controlflow information. It provides a vocabulary relat-ing to the logical structure of programs. Since the EPM's database is intended to sup-

port an actual editing system in the near future,The segmented parse representation defines a it is important to address the question of where

vocabulary for a program in terms of its component its information is obtained. In our approach, thedata and control flow. For example, iterations are different knowledge sources are acquired in partdecomposed into a set of roles which identify the from the user, and in part by automatic means.subfunctions of a loop. In the breakdown we are Specifically, the syntactic representation can beusing, loops contain generators, filters, termina- obtained directly from the textual representation,tors, and augmentations [7). Generators are sag- and the segmented parse viewpoint can be con-seats which produce a sequence of values. They can atructed through data flow analysis techniques of,be further refined into initializations and a body, the kind developed by Waters 171.which is the portion that is executed many times.Filters restrict that sequence of values. A termi- The TPP structures are harder to obtain.nator is like a filter, except that it has the Recent research efforts indicate that generaladditional potential to stop execution of the loop. recognition of cliches may be possible Ill, but atAn augmentation consumes values and produces the current time, these techniques have not actu-results. There are other vocabulary elements for ally been demonstrated. The EPH will use manualdescribing straight line code. recognition techniques (at least until automatic

recognition techniques have been refined). ThereThe taxonomy discussed up to this point pri- are two manual recognition techniques planned for

marily captures information about the form of pro- the system. In the first, the user points to a[ grams (as opposed to their meaning). The only piece of code and identifies it as being a psrticu-

semantic elements we have introduced describe the La TP (as a way of documenting the system); atsubstructure of built-in entities such as loops, this point, once the scope has been narrowed down,The next (more abstract) viewpoint considers pro- it may be possible to identify the subcomponents ofgrams to be built of objects with stereotyped put- these programing cliches automatically. In theposes. These are called typical programing pat- second method, the user uses TPPs for program gen-terns (TFPs). Examples of TMPs include variable erstion (as in [81); by instantiating s TPP andinterchanges, list insertions, and hash table "filling in the blanks," the EPH can acquire allabstractions. These sbstractions are the tools the necessary information.employed by every expert programmer. Rich has

L 2%

* .*-... '.C''''*

4 of 6

The intentiomal aggregate and documentation representations do not necessarily have a one-to-views must be wholly obtained from the user. At a one correspondence. The information in each data-minimum, the EZP's planned consistency mechanisms base is either automatically derived, or can beviii identify any of this information that may be reasonably obtained from the user. In situationsout of date due to modifications to the code. where the latter is necessary, we have assumed that

information may be provided in an incomplete form.

5. 1E p"CCJA 3l3IUCX ICUIGE 5.1-=Z ?5.1 oDDE PAUMTING-",

In order to demonstrate the feasibility of the From a computational point of view, the manEPK, we implemented a portion of the database problem involved with this multiple representationdescribed above, and built a version of the EP's approach is to define a mechanism that is able tosearch facility, the Program Reference Language compare information obtained from the different(PL) which operates on that data. The PML is a sources of knowledge. The PRL accomplishes thistool for locating regions of program text based via the code region abstraction, which functions asupon a description provided by the user. As a sup- a common language that each of the representationsport systm. it provides programmers with an can use to communicate.intention-oriented vocabulary for specifying por-

" tions of programs in situations where they may be Code regions support two different approachesunfamiliar with the detailed structure of the code. to search. In the first method, which we callThis might occur in the process of editing programs sequential filtering, the user makes a gross stab 8which may be too large to remember explicitly, or at selecting a code region by generating all of the

" in the act of understanding code which has rarely elements which satisfy some fairly general condi-been seen before (as is often the case in mainte- tion. lie then sequentially restricts that set bynance). applying more and more conditions. For example. to

find "the loop which computes the sum of the testThe PRL demonstration system allows program scores', he locates the set of all loops, and then

* search based on four of the representations restricts it to the ones which involve test scores

described above, namely the textual, syntactic, and summations.segmented parse and typical programming pattern'views (Figure 3). These databases are connected In the second approach, the user identifies a

through a ode reion abstraction that associates collection of items, possibly from several dif- 'program features with physical sections of program ferent databases, and intersects them together totext. find the elements which satisfy all of the condi-

tions he wants to impose. In this "code painting"approach, the PRL combines these items ess.otiallyby overlaying the corresponding regions o! code.For example, locating "the loops which c. npute

n -sums" is done (figuratively) by coloring all loopsred and all placea that compute sums yellow. Anyregion which comes up orange has all of the proper-

ties that were desired. .

Code painting is a deliberately coarse affair.CM 410 It is designed to exploit the kind of incomplete or

even slightly inaccurate information which the EPM% will contain, given that much of the data is pro-

vided by the user. In some cases, code paintiagmay not identify the exact section of the program

"FIA which the user desired, but in the context of an155-1 ,useuIN interactive system with a screen oriented display,

£mI aL nt "close" will be good enough. To help the user seethe effects of code painting, it is possible tohighlight the identified section(s).

Figure 3. The Program ReferenceLanguage Implementation AS"A" " T '5.2 A SCEK.2J UI IG THE P&L-"-"

The PRL has a flat information struLture. It- represents each database in terms of a complex tree The following example shows how the PRL uses

12 or graph structure of frames. Although the system the code painting paradigm to answer the question

can arbitrarily convert between viewpoints by using "find the initializations of the loop which com-

code regions as an intermediary, the databases have putes the sum of the test scores, given the Ada .-.

no direct links between one another. These conver- program shown in Figure 4. %

sions are inherently heuristic since the separate %

L%

-. ,...-.,

.':.. ,.. ..,:.:. ,..,.,.

5 of 6

fat MAXIZE in 1.. 10 loopTOTAL:= ARRAYSIM (TEST-SCORES. &XSIZE);

put (TOTAL);end loop;

function ARRAYSUH (A: in ARRAY; H: in INTEGER) return INTEGER isbegin

SUN: REAL:- 0;for I in I..N loop ,

SUm:- SUM * A(l); ,end loop;return SUM;

end ARRAYSUN;

Figure 4. The Ada Program Used in the Scenario.

In this example. the user starts by identify- views this region from the segmented parse perspec-Lag three sets of data, corresponding to the summa- tive (where initializations are represented expli-tion TPPs, syntactic loops, and segmented parse citly), and scans it for segments of the appropri-frames involving the test score array. ate type. This is a filtering operation, in which

the user applies restrictions to a previously iden-

tified set of objects.> (index "sum-ation tpp-database)

-T IPPsetI> (Filter (Segs-Vithin CODE-REIO3),

> (index 'loop s yntax-database) '(Seg-Type "initialization"))- LOOPsetl:(length 21 - S]set2:Iletgth 21

> (index 'TEST-SCORES segp;-database)-> SBsetl:tlength 61 The P31. converts CODE-RE;IONI to a set of sag-

seated parse frames (a heuristic process), and thefunction Segs-Within enumerates the subsegments it

The program only contains one TPP, but there contains. The system identifies two initializa-are two loops, and several segments which relate to tions as a result. The user prints them by con-the variable TEST-SCORES. It is important to verting them to the textual view.notice that all of these segments use the data con-tained in the variable TEST-SCORES but do not

,- necessarily refer to it by that name (for example, > (showl SgsetZ) '-.', the literal A(1)" in the ARIAYSUM function o) for I in *al..N** loop

accesses the test score array). This association > **SUM: REAL:- 0;**is apparent from the data flow analysis within thesegmented parse.

The answers correspond to the initializationsThe user intersects these descriptions by of the iteration variable "I", and the accumulation

invoking the code painting paradigm. The code- variable, "SUM". Note that the P1l. retrieves the

painting algorithm returns the largest region of second initialization, even though it is lexicallytext which can be described in all three ways. outside of the sumation loop itself. It is iden-

tified from the segmented parse analysis, which> (overlay-code-regions TPPsetl LOOPsetl SIGsetl) associates a loop and its initializations no mattur

- CODE-REGIO11 how far apart they might have been in the original*efor I in 1..3 loop code.

SUM:- SUM + AM;and loop;**

6. CURRENT STATUS AND FUTURE WORKIn order to compute this information. the

overlay function automatically converts the inputsets into their corresponding regions of code. AI&DS is now developing a prototype version of ,Most of these translations are automatically avail- the IPE (in a three year. 2-3 person effort), whichable (though heuristic in nature). In the case of is intended to demonstrate the efficacy of our

the TPP, the user had to define that mapping at knowledge based approach to the design of program- "

some time. sing support tools. The prototype will embody a p

portion of sll of the facilities that have been

At this point, the user has identified a loop described. The IE is currently targeted for theAds language. It will initially run on a Symbolicswhict computes the su of the test scores. Inorder to find the initializations of this code, he 3600, a fast, personal LISP computer that features

iiil e

6 of 6

a bigb-resolutio bit-sap display, but it is beingdesigned to be portable to other systeme (in par-ticular. Unix).

We expect to augment the RPM's database toinclude more pragatic inforuation (e.g.. the rela-tion between requirments and program structures),and we intend to extend the PRL to the point whereit will be able to automatically plan and carry out

-" search requests of the kind demonstrated in this" paper (based on a single user query). When these

extensions are complete, the PRL will define a moreformal reference language.

" The task of building a prototype for the IPE*-. involves a number of issues including the incromen-

tal modification of databases, and the recognitionof user intentions in code. In order to solvethese problems in the context of our appliedresearch, we expect to rely heavily on methods foreliciting information from the user, and to focuson tmplste-oriented techniques for manipulating

*. programs.

Acknoledgments

rWe would like to thank Michael clzustowicz and UricDomaeshek for their contributions to this project.

-.

I. Brotsky, D., Naster's Thesis, MIT, forthcom-ing.

2. Dean, Jeffrey B., and Brian P. McCune,. "Advanced Tools for Software Maintenance".

A&DS T 3006-1, October 1982. %• •

p,. ,

3. Loa., T., "A New Rome for the Mind." ALL"&-Sal Lion. March 1962.

4. Rich, Charles. "Inspection Methods in Program-ming", Ai-TR-604, Artificial intelligenceLaboratory. MIT, 1981.

S. Shapiro. Daniel G., Brian P. McCune, andGerald A. Wilson. "Design of an Intelligent F:Program U4itor". A1836 TI 3023-1. September1982.

6. Teitelbaum. T., T. Reps, and S. Rorvitz. "rheWhy and Wherefore of the Cornell Program Syn-thesizer", Puoceedings &M SIGL.&II OAConferenc 9A JIM iulation, June 1981,pp. 8-16.

7. Vaters, Richard C.. "Automatic Analysis of theLogical Structure of Programs", AI-TR-492,Artificial Intelligence Laboratory, MIT, 1976.

S. Waters, R., "rhe Ptogrmoser's Apprentice:Knowledge Based Program Editing," ULgE Trn_ J\

sectionss 2&. Softwazrt gnain SDL -5 , 1,January 1982, pp. 1-12.

%, A.

'%.

L ,-- . .. ;"% ; -•-. ' -. ' . . ..-.. ., ., .-... ,% ' ' ... ' ' ' .,.' % ... , ,,. .. °.,.:' . ... ,°•-. - . . ,.-.

* . A a-.k..a. a ... b ~A S. .. 1..~ ..* - A .. - -- - a - -. . -. a -

I.*. -,

p

APPENDIX B

RUBRIC: A System ror Rule-BasedInformation Retrieval

I

I

I

I

* ..

'-a',

a *

6,

.d ~

~aA.

- I,

a.

s *.a%:..:.a: -.-. %~*.->.-:%->.-v~.~-*A'................*. 'a

[Invited paper, COMPSAC '83, November 1983.)

RUBRIC: A SYSTEM FOR RULKI-ASED IUFORMATION R.ETRIEVAL.

Brian P. McCune, Richard H. Tong, Jeffrey S. Dean. Daniel C. Shapiro

Advanced Information & Decision Systems -".Mountain View, California

ABSTRACT @ C., -, euemc,A research prototype softvare system for con- TATISTICAI 1116N CHIN

sfl~ACNAPPROACHceptual information retrieval has been developed. F G- _The goal of the system, called RUBRIC, is to pro- Wat" 92ESSIO-6vide-ore automated and relevant access to unfor- ,m.A,, Sic (81W450. .matted textual databases. The approach is to use ,-'%I-N.G

production rules from artificial intelligence todefine a hierarchy of retrieval subtopics, vith •(sAV.

fuzzy context expressions and specific word phrases csuagcs-at the bottom. RUBRIC allows the definition of '"""" /Si i.rttaccdetailed queries starting at a conceptual level, "rim ,',, IU

partial matching of a query and a document, selec- c(ua. 1.ccu

tion of only the highest ranked documents for "WL LsWA

presentation to the user. and detailed explanation v.srum.'of how and why a particular document was selected. .SE-MTICjInitial experiments indicate that a RUBRIC rule set APPROACHbetter matches human retrieval judgment than astandard Boolean keyword expression, given equalamounts of effort in defining each. The techniques Figure 1: The Information Retrieval Triangle

presented may be useful in stand-alone retrievalsystems, front-ends to existing informationretrieval systems, or real-time document filteringand routing. relatively naive computer users. For both classes

of users, it is important that future retrievalsystems possess the following attributes:

1. TEK INFORMATION RITRIEVAL PROILEI - Detailed queries should be posed at the user'sown conceptual. level, using his or her vocabu-lary of concepts and without requiring complex

The three most comon approaches to textual progrmming.

information retrieval (see the vertices of the tri-angle in Figure 1). when used in isolation, suffer - Partial matching of queries and documentsfrom problems of precision and recall, understands- should be provided, in order to mirror thebility. and scope of applicability. For example, imprecision of human interests.Boolean keyword retrieval systems such as the com-mertial DIL.OG system operate at a lexical level, - The number of documents retrieved should beand hence ignore much of the available information dependent upon the needs of the user (e.g..that is syntactic, semantic, pragmatic (subject- uses for the documents, time constraints onmtter specific), or contextual. The underlying reading them).reasoning behind the responses of statisticalretrieval systems (Salton 6 HcGill-831 is difficult - A logical, understandable, and intuitive expla-to explain to a user in an understandable and nation of why each document was retrievedintuitive way. Systems that rely on a semantic should be available.understanding of the natural language that ispresent in documents (Scbank & DeJong-791 oust The user should be able to easily experimentseverely restrict the vocabulary and document with and revise the conceptual queries, instyles allowed (e.g.. to partially formatted, order to handle changing interests or disagree-stereotypic messages). sent with previous system performance. '

In addition to being used by specialists, in - Conceptual queries should be easily stored forthe near future large on-line document repositories periodic use by their author and for sharingwill be made available via computer networks to with ocher users.

--% %

2

2. A EiSOWLZMg-BAS•D Ap.ROA"Although RUBRIC is a knowledge-based system,

it really is not an expert System in the usualsense. 1n an expert System the sytem's knowledge

In order to address the issues raised above, base is an attempt to define what is known aboutwe bare created a prototype knowledge-based full- some field of inquiry (e.g.. infectious diseases.text inforation retrieval system called bUBRIC geology) in a useful form analogous to that used by(f or RUle-Based Retrieval of Information by Com- human experts. Although the knowledge is neverputer). RUBRIC integrates sone of the best charac- complete and perhaps not agreed upon by allteriatics of all three basic approaches to inform- experts there exsts some underlying theory ortioe retrieval (Figure 1) within the framework of a physical model that all concerned believe. In thestandard artificial intelligence techniqu . case of information retrieval, as in other areas of -Queries are represented as a set of logical produc- preference such as politics or matters of style,tion rules that enable the user to define retrieval there is no "right" answer. Hence, RUBRIC iscriteria using much better semantic and heuristic really a system for capturing and evaluating humancontrols than can be found in current retrieval preferences. Preference systems are likely to playsystems. a much larger role in the future, as artificial

The rules define a hierarchy of retrieval intelligence tackles the problem of supporting con-topics (or concepts) and subtopics. By naming a plex, multi-attribute decision mking.single topic, the user automatically invokes agoal-oriented search of the tree defined by all ofthe subtopics that are used to define that topic.The lowest-level subtopics are defined in terms of 3. EXJZPSSIIG QUERY TOPICS AS PRODUCTION RULESpattern expressions in a text reference language,which alloys keywords, positional contexts, andsimple syntactic and semantic notions. Each rule RUBRIC gains its power from the knowledge basemay have a user-provided heuristic weight. This of retrieval rules at* its disposal. An example setweight defines how strongly the user believes that of rules that defines the topic of the 1982 World 1.the rule's pattern indicates the presence of the Series of Baseball is given in Figure 2. Theserule's subtopic. Technical issues that arise when fifteen rules define a main topic, calledinformation retrieval is viewed as a problem in World_Series, and a number of subtopics. The sub-evidentisry reasoning are discussed in (Tong et topics are used to define the main topic. but waysl.-83Bj, also be used as query topics on their own or as

subtopics of other main topics. This rule set isTo perform a retrieval RUBRIC uses the set of by no means complete; however, extensions in the

vres for a topic to create a heuristic Mlot goal. for of additional rules are teasy to sake.tree that defines at its leaves what patterns ofwords should be present in documents, and in what Each rule defines a logical implication; thatcombinations. is, the existence of the pattern on the leftband

side of the arrow ("->") implies the existence stDocument recall by RUBRIC is enhanced by the the topic named on the. rightband side. Thus, a

use of higher-level notions than simple Boolean rule definds the topic or concept named in itscombinations of keywords. Retrieval precision is righthand side. There may be multiple rules aboutimproved by the use of variable weights on each the same topic, and RUBRIC will use each as anrule to define the certainty of match. These equally valid alternate definition (i.e., there isweights make it possible to present to the user an implicit oR). The lefthand side of a rule isonly partial matches above some threshold. By its body, which defines a pattern to be matched.tracing through rule invocation chains, an explana- This can be the topic named in the righthand sidetion facility allows the user to see exactly why a of another rule, a text reference expressiondocument was retrieved and why it was assigned its (defined below), or a compound expression thatoverall certainty or importance weight. This pro- defines the logical AND (denoted by "&") or ORmotes experimentation and appropriate modification ("Mi) of two or more other rule topics or textof the rule base. The retrieval vocabulary to be reference expressions. Explicit text to be matchedused is unrestricted, being left up to whoever without further interpretation is surrounded bycreates the rules. Rule sets may be stored in pub- quotation marks; names of topics and text referencelic or private rule *libraries., so that useful language constructs are not. The last element in asubtopics may be shared among users, thus simplify- rule is its weight, which is a real number in thein& the task of defining new topics. interval (0.11. It represents the rule .definer's

confidence that the existence in a document of theA rule-based query can be more complex than pattern defined by the rule's lefthand side implies

the keyword expression that might be used with a that the document is about the topic named in theBoolean retrieval system. Therefore, we expect rule's righthand side. If a weight is omitted, itrule-based retrieval to be used initially for is assumed to be 1.0 (i.e., absolute confidence).applications in which the eame query is made Note that a weight is a number made up by a humanrepetitively over some period of time. In such user, based upon his or her experience and insight;situations people who are trained RUBRIC users but a weight is nML a statistical quantity.not programmers should be willing to expend moreeffort to develop a detailed rule-based definitionof the query topic.

%. %:

3

team I event VorldSeries increase (or decrease) our belief if seen Lu con-

juvction with some primary evidence. The form of

St. Louis Cardinals I Milwaukee Brevers - tem such a rule is

"Cardinals" -> St. Louis Cardinals (0.7) if A. the& C to degree vl;Cardinals..fullnsme -> St. LouisCardinals (0.9) but if also B. then C to degree w

saint & "Louis" & "C.ardinaWsai Cardinals ual.dname where if v1

is greater than v2 then 3 is discon-d ufirming auxiliary evidence, and if vI is less then

"St." "> saint (0.9) w2 then B is confirming auxiliary evLdence. This"Saint" -> saint has the effect of interpolating betveen v and w2.

depending upon the certainty computed for the auxi-

'Irevers" -> Nilvaukee Brewers (0.5) lisry clause B. Thus we might have a rule of the'Milwaukee Brevers" -) Milvaukee..Brewers (0.9) kind:

"World Series" -> event if (the story contains the literal string "bomb").baseballchampionship -> event (0.9) then (it is about an explosive evic)

to degree 0.6;baseball & championship -> baseball championahip but if also (it mentions a boxint u tch),

then (reduce the strength of the conclusion)"ball" -> baseball (0.5) to degree 0.3"baseball" -> baseball

Mere ye see the concept of disconfiruing evidence"championship" -> championship (0.7) in operation; notice that by itself being about the

concept boxing mUtc is not evidence that can beFigure 2: Rule Base for Topic of World.Series used to support or deny the conclusion ye are try-

ing to establish.

A text reference expression nay be a single Knowledge bases of rules are expected tokeyword or phrase, or a lexical context within evolve over time. Initially the set of rules pro-which two keywords or phrases must be found (e.g.. vided in a knovledge .base will capture a small por-word adjacency, same sentence, same paragraph). tion of the kinds of knowledge required. New rulesSo, for example, one can specify that two patterns are easily added to RUBRIC, currently by means of aare of interest only if they occur in the same standard display-oriented text editor. Existingsentence. Fuzzy (partial) matching versions of rules may be modified for experimentation to pro-these contexts are also allowed. RUBRIC's fuzzy vide feedback for honing their logical structure,pattern matcher returns a value in 10,11 that is keywords, and weights.proportional to the degree that the phrases are inthe desired context. i.e., inversely proportionalto the logical distance between the two objects inthe document. For example, when matching a fuzzy 4. QUFY PROCESSINCame-sentence cootext. two phrases in the ame sen-tence might receive a weight of 1.0, within adja-cent sentences 0.8, etc. A set of rules defines a logical hierarchy of

.retrieval topics and subtopics (Figure 3). ARules often define alternate terms, phrases, specific retrieval request is carried out by a

and spellings for the same concept. Thus, rules goal-oriented inference process similar to thatcan also provide a simple hierarchical thesaurus, used in the HYCIN medical diagnosis systemwith variable weights defining the degree of cer- (Shortliffe-76]. This process creates and evalu-tainty vith which a particular variant is to match. ates an AND/OR tree of logical retrieval patterns.for example, in English "St.* is used as the abbre- The root node of this tree represents a smanticviacion for both "Saint" and "Street4 . and thus topic or concept that the user wants retrieved;"St." is weighted less that the keyword 'Saint" in nodes farther down in the tree represent intermedi-Figure 2. Rules can also aid multilingual informa- ate topics with which the root topic is defined;tion retrieval. For example, if the database con- and nodes at the leaves of the tree represent pat- .'-'

tains text in multiple languages, then the lowest terns of words that are to be searched for in thelevel(s) of rules might define synonyms in each database. Each arc in the tree is weighted such .. --

language of interest, The more conceptual, that the intermediate topics and keyword expres-language-independent rules higher in the hierarchy sions contribute, according to their weight, to thewould remain unchanged, overall confidence that the root topic has also

been found. (Unlabeled arcs in Figure 3 have an ..

It has been found useful to provide a new type implicit weight of 1.0.) Arcs representing theof rule in RUBRIC, called a moifierz jrL. which onjuncts of an AND expression are linked togetherenables the user to incorporate auxiliary (or con- near their con base in Figure 3.textual) evidence into the query. Auxiliary evi-dence is evidence that by itself neither confirms RUBRIC supports a number of calculi for inter-one discoufirms a hypothesis, but which may preting the rule weights. Weights are treated as

certainty or partial truth values, not as

4KW,..tdl..G.,4(k.y1)

-kesoe.. a" *.( 4 "Wa I.Eas(h ..p*

'5.t.0.tI' .4-1 to) -vlid U.ib to

I. .j..SSL1 C:. 1=1.41

S 0 I.& .. ,() 5d ht'(0) t .G.1 i.01 jh . -0 p1041.1)

0..1

,.Iti (a %I.() (as ' do.I 0o) N.1II (1.0) 1 SI11l0 t.eetI 10

'I.* (oo *U4-4 (0)

Figure 3: Rule Evaluation Tree for World_,Series Topic

probabilities. Each calculus defines bow to comn- receive a weight of 0. The baseball code wouldbin. the uncertainties during such logical deduc- then be assigned the value 1.0 because that is thetions as AND. OR. and implication. The default maximum of (1.0 sultipled by 0.5) and (1.0 timesmethod is to use the functions minimum. maximum. 1.0). Similarly, the championship node receivesand product to propagate the weights across AND and the value 0.7. Then. because it is an AND code,01 arcs and! implication nodes, respectively the bseballcbampionship node gets the value 0.7.(Shortliffe-76I. vhich is 1.0 times the minimum of 1.0 and 0.7. The

event node then gets the value 0.63. which is theReferring to Figures 2 and 3. ye nov describe maximum of (0 times 1.0) and (0.7 times 0.9).

how RUBRIC processes a query. (Annotated traces of Sic thece are no keywords in the docuoment thatthe sytma operation are found in (McCune et support the team subtopic, the overall weight of&1.-831.) When the user types in the conceptual the match of the WorldSerios topic on this docu-query World Series. RUBRIC searches its rule base went is 0.63 (1.0 times the maximum of 0 and 0.63).for all rules that provide definitions for thistopic (i.e.. that have WorldSaries on their right- Other combinations of keywords and phrases inhand sides). There is only one such rule in Figure a document can satisfy the concept of World Series2, so RUBRIC expands that rule according to its to varying degrees. Figure 4. shows the- otherlefthand side. The result is that the weights possible for the World Series topic.World Series. tams, and event nodes of Figure 3 are depending upon the dominant phrases that occur increated, as veil as the two arcs between th em, the document.Since team and event art themselves the names Oftopics, rather than textual patterns. RUBRICsearches its rule base for their definitions. This Phrases Present in Document support forprocess continues recursively until all leaf nodes World Series Topicof the tree contain textual pattergs.

"World Series" 1.00At this point each document in the database is

watched against all of the phrases in the leaves of "Saint", "Louis". "Cardinals" 0.90the tree. For a given document, if a phrase is 'Milwaukee Brewers' 0.90found somewhere in the document, the correspondingnode in the tree is assigned a value of 1.0. other- "St.". "Louis", "Cardinials" 0.81vise 0. Then the weights at the leaves are com-bined &ad propagated up through the tree to deter- "Cardinals" 0.70mine the overall weight to be assigned to thisdocument. 'baseball", "championship" 0.63

For example, if a document contained the words 'brewers" 0.50"*ball ", "baseball", and "championship", and noother words referred to in the example rule base, "ball", "championship" 0.45then the modes of the tree would be assigned theveight@ shown is penittheses in Figure 3. The none of the above 0.00"ball". "baseball", and "championship" leaf nodesall receive a veight o1 1.0. and all other leaves Figure 4: Possible Weights for World-Sories Topic

N.~

7|

(denoted MM). The first definition therefore gives

S. USER INTERYACE us an insight into the system's ability to rejectunwanted stories (precision). whereas second givesus insight into the system's ability to select

A user need only see the highest weighted relevant stories (recall). . .-

documents. After the database has been searched.

each document that was considered has an associated We selected as a retrieval concept "violent

weight that represents the system's confidence that acts of terrorism". and then constructed an

the document is relevant to the topic requested by appropriate rule-based query. This is summarized

the user. RUBRIC sorts these documents into des- in Figure S. where we make extensive use of odif-

cending order based upon their weights, and groups ier rules. An auxiliary clause is shown linked to

the documents by applying statistical clustering its conclusion by a directed arc labeled '"odif- I..techniques to the weights. The user i's then ier". Application of this query to the story data-

presented with those documents that lie in a clus- base results in the story profile shown in Figure --

ter containing at least one document with a weight 6. (Notice that for presentation purposes the

above a threshold provided by the user (e.g.. 0.8 stories are ordered such that those determined to

or above). Clustering prevents an arbitrary three- be a priori relevant are to the left in Figure 6).

hold from splitting closely ranked documents. The The performance scores for this experiment are

threshold may be varied depending upon how much Precision: NF - 1 when we ensure that NM 0. and

time the user has available to read documents, bow

important it is not to miss any potentially eal NH-Shnwensrtat 7 -0

relevant ones, etc. R NM 5

RUBRIC is able to explain why a particular This is almost perfect performance, being marred

document was retrieved. This capability is very only by the selection of story 25, which, although

important for instilling confidence in users and it contains many of the elements of a terrorist

helping them get a good enough feel for the oper~a. article, is actually a description of an unsuccess-ful bomb disposal attempt.

tion of the system that they can successfully writeand use their own retrieval rules. RUBRIC can.display each rule that results in a non-zero weight ,.-..

being propagated, as well as the value of thatweight. RUBRIC can also show each attempt to matcha word or phrase to the document, along with REASONU*n 1.0

whether or not it matched.- .5 1.0

.6 REV~OLUIO SCTCErt(OPPOSTIEN.

COV"K5NIl

6. LI2ZRINIENTL RESULTS

We have done preliminary experiments with s

RUBRIC to examsine the improvements that can be ."

achieved over a conventional Boolean keyword SNTN IKILLING. POLTICIA)

approach. As an experimental database for testing Ithe retrieval properties of RUBRIC, ve have used a 1.0

selection of thirty stories taken from the Reuters,News Service. Our basic experimental procedure is 1.0 0

to rate the stories in the database by inspection .4 StIfIC.KACot C€sst-CTO.

(i.e., define a subjective ground truth), constructa rule-based representation of a typical query.apply the query to the database, and then comparethe rating produced by RUBRIC with the a priori VINT-C4EtU ^l e 1.0 I0LNTAF vTt

rating.

We concentrate on two basic measures of per- .i '0(5" "1CAT" "(5S"

foruance. Both of these are based on the idea ofusing a selection threshold to partition theordered stories so that those above it are TIOLtT.ACI

"relevant" (either fully or marginally) and those .O1.0

below it are "not relevant". In the first we lowerthe threshold until we include all those deemed "

riori relevant, and then count the number of. KILLING GOOING KIOlUtING (€OUWI[TE TKEOVER

unwanted stories that are also selected (denoted ,.,

N). In the second we raise the threshold until we

eclude all irrelevant stories, and then count the Figure 5: Rule Base Structure for Concept

ember of relevant ones that are not selected of Violent Acts of Terrorism

S. - - -o". . . .

6

To compare RUBRIC against a more conventional in the specification of his or ber query, therebyoproach. we constructed two Boolean queries by increasing both. precision and recall. Aing the rule-based paradigm and setting all rule traditional Boolean query tends either to over- oreights to 1.0 (thus incidentally @bowing that our under-'coosttain the search procedure. giving poortthod subsumes Boolean retrieval as a special recall or poor precision. Wie feel that. givense). One of these queries is shown in Figure 7. equal Amounts Of effort. RUBRIC allows betterian AND/OR tree of sub-concepts. The only models of human retrieval judgment than can be k

6fference between the two Boolean queries is that achieved with traditional Boolean mechanisms.kthe first vs insist on the conjunction of ACTORkd TERRORIST-EVENT (as shown), whereas in the We have also explored the effects of usingicond we require the disjunction of these con- different calculi for propagating the uncertaintypta. The conjunctive form of the Boolean query values within the system (Tong et al.-83Aj. AmongLsacs five relevant stories and selects one unin- these calculi are well-known classes such as those)rtant story; whereas the disjunctive form selects that use "Wax" and Isin' as disjunct and conjunctLI the relevant stories, but at the cost of also operators, and those (so-called 'layesism-like")riecting seven of the irrelevant ones, that use 'sum" and "product". Our initial conclu-

sion is that the calculus used is not the majorWhile these results represent only a prelim- determinant of performance. hut that it does

ary test. we believe that they indicate that the interact with how rules are defined.JBRIC approach allows the user to he more flexible

STOR RATINGto a .Rating assigned by

RUBRIC (normnalfaed)

rrelevant

STORY NUMER

Figure 6: Story Profile from RUBRIC Experiment

AIL5411SSASINA~TO SSCIFIC-ACTOO GEC15AIcoo

1&fIG ING YAO s ISAYIN POtIICIAN AQ 1w. 0PL 'IRA' '110M . rsx. sas

519155 (190&51O

Figure 7: AND/OR Concept Tree for Boolean Query

7- N. %7.

7

7.~~ ~ [~T~ OK Schank 4 DeJong-791 R. C. Schenk and CDeJong. "Purposive Understanding". Chapter 24.

Much additional research and system develop- in J. E. Hayes. D. Michie. and L. 1. Mikulich,t are needed to make RUBRIC usable. We art editors. Machin Intlli1Senc Volume 9. 1979,rently providing a better user interface and pages 459-478.duct ing sore complete experiments. The inter-e for end users vill include more focused -(Shortliffe-761 Edward Rance Shortliffe.eractive explanation. analysis of results for Comnutgr-ilased Medical Copsultations: !YIN!.sitivity to specific rules and weights. display American Elsevier Publishing Company. Inc.. Niewgraphs such as Figure 6. and rule editing. York. Rev York. 1976.

trimeatation will consist of defining, in con-ction with users, larger rule sets for a realia- - (Tong et &l.-83A1 Richard M. Tong, Daniel C.retrieval domain and then using these rules to Shapira. Jeffrey S. Dean, and Brian P. McCune.naeve documents from a realistic database. "A Comparison of Uncertainty Calculi in an

Expert System for Information Retrieval", inOther areas of possible future work include Alan Bundy. editor, Proceedinim Ao t Eighth-

ing rule evaluation and textual pattern matching International Joint Conference on Artificialt efficient, possibly through the use of beuris- Intelligence, William Kaufmann. Inc., Losa to limit rule evaluation; exploring additional Altos, California, August 1983. Volume 1. pagesa of representing and propagating uncertainty in 194-197.h numeric and symbolic representations; ablativeting to measure how useful each system feature - Tong et al.-83B1 Richard M. Tong, Daniel C.

extending the text reference language to allow Shapiro, Brian P. McCune. and Jeffrey S. Dean,cification of the syntactic role that a word "A Rule-Based Approach to Information

ys in a sentence (e.g., "ship" used as a noun Retrieval: Some Results and Comments",..

:us as a verb); constructing a more general Proceedings Ao th National Conference onsaurus that has a network structure rather than Artificial Intelligence, William Kau,.Aaan,hierarchical one like rules; and allowing Inc.. Los Altos, California. August 1983, pagesrieval from multiple remote databases. 411-415.

8. POTENTIAL APPLICLTIONS

Application systems based on RUBRIC may beFul for information routing and change detec-s, in addition to information retrieval. For)ruation retrieval RUBRIC could be extended toLon formatted documents such as messages oriogrsphlc entries, to work as a front end to

iting databases and information retrieval eye-iand to segment larger documents by subtopics.

LIC could be used to process messages in real-ifiltering the important once and routing themhe appropriate recipient (human or another pro-

). With RUBRIC, analyses of documents overcould detect statistical changes at a concep-level rather than just in the use of indivi-

keywords.

9. RZFEZENCES

[McCune et &1.-83) Brian P. MicCune. Jeffrey S.Dean, Richard M. Tong, and Daniel G. Shapiro.RUBRI: A Systjg jgyr Rule-Based InformationRetrievel, Technical Report 1018-1. AdvancedInformation 4. Decision Systems, mountain View.California, February 1983.

[Salton 4 McGill-831 rerard Salton and MichaelJ. McGill. Introduct-ion 12 Madern InformationReuievial~. Mc~raw-ill Book Company, N1ew York,New York. 1983.

FI.LMED

7-85

DTIC