[ieee comput. soc. press 2nd working conference on reverse engineering - toronto, ont., canada...

10
DE: A Cooperative Environment for Reverse-Engineering Legacy Software * 156 0-8186-7111495 $4.00 0 1995 IEEE Alex Quilici Dept. of Electrical Engineering University of Hawaii 2540 Dole St., Holmes 483 Honolulu, HI 96822 Abstract While automated program understanders have had some success in partially extracting design information from source code, they are unlikely to be able to completely understand existing real-world legacy systems. To address this problem, we have been developing DECODE, an en- vironment in which programmer and system cooperate to extract object-oriented designs from legacy systems. DE- CODE consists of three components: an automated pro- gram understander that extracts some initial stereotypical object-oriented design elements; a structured notebook that provides the user with a graphical view of the system’s un- derstanding and the ability to extend this understanding by linking source codefragments to object-orienteddesign elements; and a query processor that uses this design infor- mation to support conceptual queries about the program’s code and design. This paper describes DECODE and our initial successes and failures with using it to reverse engi- neer several non-trivial COBOL programs. 1 Introduction The goal of many reverse engineering researchers is an automated program understanding tool that can automati- cally extract all of the design underlying a piece of source code. Unfortunately, this goal is far beyond the current state of the art, which is based on trying to understand programs by recognizing instances of known code patterns [lS, 8, 9, 23, 24, 6, 16, 11, 71. Applying this paradigm to reverse engineering real-world legacy systems is prob- lematic, as it appears to require enormous libraries of code pattems, relies on program-understanding algorithms that ‘This projectis supported by the KBSA project, Air Force Rome Labs, under Air Force contract #F30602-93-C-0257. We gratefully acknowl- edge W, Kozaczynski, J. Ning, and A. Engberts of Andersen Consulting for providing us with COBOUSRE upon which much of our work is based. Jimqun Cheng and Jeremy T. Harrison were responsible for im- plementation of significant portions of this project. David N. Chin Dept. Of Information and Computer Sciences University of Hawaii 2565 The Mall Honolulu, HI 96822 are not guaranteed to scale, and fails completely on exist- ing idiosyncratic code that does not fit well into patterns [4, 12, 191. Despite these flaws, however, automated understanders clearly can understand useful portions of programs [8,231. As a result, it is useful to view software understanding as a cooperative process involving both programmer and sys- tem, where at a minimum the system extracts what it can au- tomatically and then assists the programmer in augmenting this understanding. DECODE (Diagram-based Environ- ment for Cooperative Object-oriented Design Extraction) is a prototype environment that supports this cooperative approach by automatically extracting some stereotypical portions of a program’s design, aiding programmers in ex- tracting and recording additional program design informa- tion, and answering conceptual queries about the program using this jointly extracted design information. The queries DECODE supports are conceptual in the sense that they cannot be answered by simple text-based searches, nor can they be answered by referring to struc- tural relationships between program entities. Instead, they can only be answered by accessing the jointly extracted knowledge base describing the program’s design. Some examples of typical conceptual queries include questions such as “Where does this program implement ‘input vali- dation’?’’ and “What transaction-related objects and oper- ations does this imperative program implement?”. DECODE combines three components: an automated programming plan recognizer (the APU), a knowledgebase for recording extracted design information, and a structured notebook for editing and querying this design. The APU takes an existing program and forms an initial knowledge base by examining it for instances of standard code pat- terns (programming plans) and the design elements they implement. The programmer then uses the structured note- book to inspect this program’s source code and to extend the system’s original partial understanding. Essentially, whenever the programmer recognizes the purpose of some arbitrary code segment, the programmer selects this code and maps it to an object or operation in a programmer-

Upload: dn

Post on 22-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

DE: A Cooperative Environment for Reverse-Engineering Legacy Software *

156 0-8186-7111495 $4.00 0 1995 IEEE

Alex Quilici Dept. of Electrical Engineering

University of Hawaii 2540 Dole St., Holmes 483

Honolulu, HI 96822

Abstract

While automated program understanders have had some success in partially extracting design information from source code, they are unlikely to be able to completely understand existing real-world legacy systems. To address this problem, we have been developing DECODE, an en- vironment in which programmer and system cooperate to extract object-oriented designs from legacy systems. DE- CODE consists of three components: an automated pro- gram understander that extracts some initial stereotypical object-oriented design elements; a structured notebook that provides the user with a graphical view of the system’s un- derstanding and the ability to extend this understanding by linking source code fragments to object-oriented design elements; and a query processor that uses this design infor- mation to support conceptual queries about the program’s code and design. This paper describes DECODE and our initial successes and failures with using it to reverse engi- neer several non-trivial COBOL programs.

1 Introduction

The goal of many reverse engineering researchers is an automated program understanding tool that can automati- cally extract all of the design underlying a piece of source code. Unfortunately, this goal is far beyond the current state of the art, which is based on trying to understand programs by recognizing instances of known code patterns [lS, 8, 9, 23, 24, 6, 16, 11, 71. Applying this paradigm to reverse engineering real-world legacy systems is prob- lematic, as it appears to require enormous libraries of code pattems, relies on program-understanding algorithms that

‘This project is supported by the KBSA project, Air Force Rome Labs, under Air Force contract #F30602-93-C-0257. We gratefully acknowl- edge W, Kozaczynski, J. Ning, and A. Engberts of Andersen Consulting for providing us with COBOUSRE upon which much of our work is based. Jimqun Cheng and Jeremy T. Harrison were responsible for im- plementation of significant portions of this project.

David N. Chin Dept. Of Information and Computer Sciences

University of Hawaii 2565 The Mall

Honolulu, HI 96822

are not guaranteed to scale, and fails completely on exist- ing idiosyncratic code that does not fit well into patterns [4, 12, 191.

Despite these flaws, however, automated understanders clearly can understand useful portions of programs [8,231. As a result, it is useful to view software understanding as a cooperative process involving both programmer and sys- tem, where at a minimum the system extracts what it can au- tomatically and then assists the programmer in augmenting this understanding. DECODE (Diagram-based Environ- ment for Cooperative Object-oriented Design Extraction) is a prototype environment that supports this cooperative approach by automatically extracting some stereotypical portions of a program’s design, aiding programmers in ex- tracting and recording additional program design informa- tion, and answering conceptual queries about the program using this jointly extracted design information.

The queries DECODE supports are conceptual in the sense that they cannot be answered by simple text-based searches, nor can they be answered by referring to struc- tural relationships between program entities. Instead, they can only be answered by accessing the jointly extracted knowledge base describing the program’s design. Some examples of typical conceptual queries include questions such as “Where does this program implement ‘input vali- dation’?’’ and “What transaction-related objects and oper- ations does this imperative program implement?”.

DECODE combines three components: an automated programming plan recognizer (the APU), a knowledge base for recording extracted design information, and a structured notebook for editing and querying this design. The APU takes an existing program and forms an initial knowledge base by examining it for instances of standard code pat- terns (programming plans) and the design elements they implement. The programmer then uses the structured note- book to inspect this program’s source code and to extend the system’s original partial understanding. Essentially, whenever the programmer recognizes the purpose of some arbitrary code segment, the programmer selects this code and maps it to an object or operation in a programmer-

Page 2: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

1 dsriraadiir

-3 I_-

Figure 1: A portion of the raw design that DECODE is capable of automatically recognizing. To aid readability, we have deleted some of the intermediate plans the APU recognized.

crehted object-oriented hierarchy. %he structured notebook supports creating this hierarchy. locating and linking d e fragments toitselements, and askingqueries about thecode and its relationship to the extracted design.

In this paper, we describe DECODE'S approaches to automated and assisted understanding, as well as the pm- ticular conceptual queries it supports. We also discuss its shortcomings and our approaches to overcoming them.

Figure 1 shows a portion of the results of running DE- CODES APU on a textbook COBOL program. The system has extracted simple plans which map directly to statements within the COBOL program, compound plans which are composed from a set of related plans, and some design elements (classes and operations) implemented by these plans.

DECODE'S undwstander [15] is based on the Concept Recognizer e8.91, which divides programmhg plans into

two parts: a description of the plan's attributes (which are instantiated when a plan instance is recognized) and a set of rxlmmon implementation patterns. It represents these code patterns as acombination of components (the language items or subplans that must be recognized to have a poten- tial instance of the plan) and construints (the relationships that must hold between these components). Their Con- cept Recognizer used a straightforward top-down. library- driven approach to plan recognition, trying to recognize a plan by recursively trying to recognize its components and then verifying the constraints. Its representation of plms is simple and clear and its algorithm is successful at remgniz- ing plans in real-world COBOL; however, the algorithm is slow and does not appear to scale well to large libraries or large programs L81.

DECODE's M U preserves the Concept Recog~izer's

it to address the scaling problem in program understanding and to try to recognize conceptual design elements as well as progyamming plans. It attacks the scaling problem by using a code-driven (bottom-up) rather than library-&iva (top-down) approach to recognizing plans, where the plan library is augmented with search control orm mat ion. We use a codadriven approach so that only those plans po- tentially present in program are considered, rathex than all plans in the library. However, because using a bottom-up approach has the potential to lead to a combinatoric explo- sion of possible plans to Cry, we have extended the Con- cept Recoguizer's plan representation to explicitly indicate

to representing progr

Id be considered and when one plan a specialization or variant of another

addition, to attack the problem of recognizing conceptual design elements, we have extended the representation of plans to have l i s to stereotypical. high-level conceptual design elements.

2.1 The APU's Augmented Plan Library

figure 2 contains an example of our extended represen- tation for code patterns.

Each plan has an index that says when it should be considered (i.e., fully matched against known program pieces and recognized plans). The index combines a plan component and one or more plan constraints and suggests that the plan should be considered whenever this compo- nent is recognized and the specified constraints hold. The plan DISPLAY-LABELLED-WORD, for example, is in- dexed by a WRITE when the record written has two fields. This means that the APU considers this plan only when it encounters a WRZTE, not every time it encounters a MOVE or a WRlTE (as in most bottom-up approaches). The idea is that indexes suggest when plans are likely to occur as op- posed to when plans might occur. In our tests with example

157

Page 3: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

DISPLAY-LABELLED-RECORD(Record-Dame: ?rec, Hessage: ?mag) components

Clear-Rec: FILL-WITE-SPACES(Dest: Ips) Provide-Hsg: HOVE(Source: ?msg, Dest: .msgfd) Provide-Rec: HOVE(Source: ?rec, Dest: ?recfd) Dump-Rec: WRITE(Souxca: ?pr)

Hsg-8-Dump: DataDep(Dump-Rec, Provide-Hsg, ?msgfd) Rec-D-Dump: DataDep(Dump-Rec, Provide-Rec, ?recfd) Clr-D-tisg: CntlPath(C1ear-Rec, Provide-Hsg) Clr-D-Rec: CntlPathCClear-Rec, Provide-Rec) Hsg-Is-Field: Field(?msgfd, ?pr) Rec-Is-Field: Field(?recfd. ?ur)

constraints

. &

indexes Dump-Rec when Hsg-Is-Field, Rec-Field

FILL-WITE-SPACEScDest: ?pr) specializes HOVE(Source: Spaces, Dest: ?pr)

READ-ALL-RECORDS(Fi1e: ?f) components

Reader:

Looper :

constraints Within-Loop: CntlFlow(?seq, Reader)

indexes

READ-DOTE-EOF(Fi1e: ?f, Ind: ?ind, Eof-Val: ?v)

LOOP-UIITIL-ECJ-PLAIICActs: ?seq, Flag: ?ind. Val: ?v)

Reader when Within-Loop

implies VALID-IIIPUT-RECORDS(Fi1e: ?f, Cond: ?c) when

with lotifier: DOTIFY-BAD-REC(Cond: ?c, Rec: ?r, Hsg: ?r)

In-Loop: CntlFlow(?seq, Dotifier) Each-Read: CntlFloa(Reader, Iotifier) Same-File: RecordFile(?f, ?r)

implies FILTER-COPY-IEPUT(Fi1e: ?f. Cond: ?c) when

with Cond-Dump: COIID-WRITE(Cond: ?c, Record: ?r)

In-Loop: CntlFlow(?seq, Cond-Dump) Each-Copy: CntlFlow(Reader, Cond-Write) Same-File: RecordFile(?f, ?r)

Figure 2: An example of several code patterns from our APU's plan library.

programs, this significantly cuts down the number of times each plan is considered by a bottom-up understander. (For example, we applied our APU to one COBOL program that had 43 MOVES and 12 WRITES and indexing DISPLAY- LABELLED-RECORD by WRITES reduced the number of times it was considered by a factor of 10.)

Plans can also be defined as specializations of existing plans (i.e., as a set of constraints on a plan's attributes). The plan FILL-WITH-SPACES, for example, is defined as a specialization of a MOVE whose Source is SPACES. These specializations correspond to plans containing a single component (the plan being specialized), with con- straints on the component's attributes, and the component as the plan's index (and, in fact, specializations are au- tomatically translated into the standard plan definitions). The idea behind specializations is simply to make it easy to define one common class of plans and to encourage using the specialized plans as components and indexes of other

plans, rather the plans they specialize. This reduces match- ing when specialized plans are indices since they occur less frequently than the plans they specialize.

Finally, plans can be defined as being conditionally im- plied by other plans. Whenever the APU recognizes a plan that conditionally implies othcr plans, it checks whcthcr these conditions hold (which involves searching for ad- ditional components and checking additional constraints). For example, the plan READ-ALL-RECORDS implies the plan VALID-INPUT-RECORDS when there is an addj- tional NOTIFY-BAD-REC that conceptually follows the READ-NOTE-EOF within the input-reading loop. This reduces matching over the situation where VALID-INPUT- RECORDS is a separate plan that shares components in READ-ALL-RECORDS. It also reduces the complexity of plan definitions whcn VALID-INPUT-RECORDS contains READ-ALL-RECORDS as a component and there are con- straints between the two plan implementations.

2.2 APU Support for User-Extracted Design

Besides recognizing plans, DECODE'S APU explicitly tries to connect these recognized plans to higher-level de- sign elements (i.e., entities that users would discuss inde- pendently of how they might be implemented). For ex- ample, it links instances of the plan OPEN-VALIDATE- CLOSE to the ValidateRecords operation on objects in the class RecordFile. To support this behavior, DE- CODE extends the definition of a plan to explicitly indicate whether each plan is design-oriented (of interest to the user) or incremental (necessary for the recognizer but not of gen- eral interest to the user). This is done by labelling each high-level plan definition (where the plan's attributes are specified) with the operations it implements. For exam- ple, OPEN-VALIDATE-CLOSE'S dcfini tion indicates that it implements the conceptual operation Validat eRecords on the class RecordFile.

plan OPEB-VALIDATE-CLOSE isa COMPLETE-FILE-PROCESSOR comments

"Open a file, read through it, write a" "message for each record that fails a" "particular test, and then close the file."

ValidateRecords on RecordFile implements

When the APU recognizes an instance of any plan, it automatically adds the corresponding design elements to the knowledge base. This allows the APU to automatically recognize some conceptual design elements, which means the user doesn't have to start extracting design elements from scratch.

The APU, however, is not always successful at recog- nizing high-level plans and instead may only be able to recognize a set of incrcmental plans. Even though these

158

Page 4: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

plans do not directly connect to abstract design elements, they can still aid the user in understanding the program. DFCODE supports ehis by providing users the same vi- sual access to recognized plans as to higher-level design elements. The user can then use the design editor to link these plans to user-provided design elements. In addition, the system also provides an explicit natural language de- scription of each plan's function (which is simply a set of comments stored with the plan when it is dehed). Theuser can then visually access both this plan's textual description and the relevant lines of the source file. This helps the user understand what abstraction the recognized plans capture.

3 Assisted Understanding

Usually, our APIJ will only be able to understand a portion of an existing program. As a result, after DE- CODES initial stab at extracting design information, the programmer must continue theunderstanding prows. D E CODE provides a structured notebook that allows the user to view the source code and its relationship to the extracted object-oriented design, expand this design. and link de- sign elements to arbitrary portions of the source code. The structured notebook consists of two main windows: a code-browser, which displays the source code, and a design-editor, whichdisplays a graphical view of theobject- oriented design.

Programmers record their understanding by interac- tively adding new conceptual design primitives md link- ing them to the program. Figure 3 shows a programmer- extended version of the MU'S original understanding of the program. This extension is just the first part of the object-oriented design a programmer might ex- tract .from studying our example code. In particular, from examining both the source code and the automat- ically extracted plans, the programmer has recognized a new conceptual class, ReservationsFile (a partic- ular type of RecordFile) and a new conceptual op- eration, ConstructFromTransaction (the operation of building a ReservationsFile by running through a Transact ionFile).

Figure 4 shows how the programmer adds this design information. The programmer fmt uses the design editor to add the new design elements and then links them to high- lighted chunks of source code in the browser window or to already recognized plans. The ability to link to existing plans lessens the need for programmers to identify all the relevant statements on their own. And by recognizing por- tions of the planning structure that captures the underlying relationships between these statements, the APU eases the programmer's task in making this link.

Figure 3: A screen dump showing a user extension to the system-extracted design.

The design-editor lets the user graphically record many of the common relationships between objects recorded in typical object-oriented CASE tools (such as creating classes and operations, indicating relationships between classes, and so on). However, CASE tools do not generally support thelinkage actions and at best allow the user to indicate that certain sections of code correspond to certain functionality in block diagrams of the system.

The design notebook allows a design element to be linked to either a plan or to an arbitrary collection of state- ments. "he user can graphically link a design element to a plan or to code by highlighting arbitrary (and possibly disjoint) sections of the program's source code by pointing in the code-browser window, pointing to an existing object- instance or implementation node, and finally invoking the Link menu item.

The design notebook currently indicates links between the design space and code in two ways. One is that the code- browser window always displays a portion of the source code that is linked to the currently selected design element. In general, this is only a portion because the source code underlying a given plan may be spread out over far more lines than fit within the code-browser window. The other is

159

Page 5: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

derign-editor 1

code-browser m Fllc berg Link E x t r a e t unkmu, Help I

m148 I 00149 00154 I 00151 * t Q W S I M O : HOHE 0 9 1 5 2 V IWIXlNG nOa*a: REIUI-TRFWYCTI~ W E 3 8 W154 EDIT =TION. 00155 E D I T - W R Y . 091% W157 HIVE 'I' TO R(LINWILID-I(ECCT(D

t n0KW "CllW: MIS TRAKSCCTION MCIOULE CtFCIS FILE RECIRDS. F[#l lHpul ERmRs I N

I F TR-FLICHT-hWDER IS m)T NUiERlC

I D Q

Figure 4 A screen dump showing how the user extends a system-extracted design.

through an explicit line-number attribute on each design element that lists the source lines it is linked to. This at- tributeis filled in automatically whenever a design element is linked to either plans or code. This method, however, is not entirely satisfactory for high-level plans which may involve many lines of source code. As a result, we are exploring the use of a window that provides a global visual representation of the source code and the pieces of it that are related to the current design element (others have used a similar technique to display program slices [l]).

4 Conceptual Querying

The result of this combined programmer/system under- standing process is an extracted object-oriented design and a set of links from this design to both the code and its un- derlying plan structure. In essence, this extracted design provides a queryable form of program documentation for use by maintenance programmers. Our notebook supports several different types of conceptual queries:

Code Function: What is the purpose of a particularpiece of source code? (e.g.. Which conceptual operations require the data item TR-FLIGHT-NUMBER?Or, which

conceptual operations involve a particular MOVE state- ment?)

Code Location: Where is the code corresponding to apar- ticular design element? (e.g., What pieces of the source code implement the ReadRecord operation? What pieces of the source code are related to im- plementing any of the class TransactionRecord's operations?) These queries can be thought of as re- questing conceptual slices, an alternative to structural program slicing 1213.

Design Completeness: What design elements have not been recognized in the program? (e.g., Which ReservationFile operations are not implemented in this source file? Which lines of the program have not been linked to the design?)

These queries can be asked simply by browsing an existing design and/or selecting queries from the Query menu. In partidar, selecting any implementation or object-instance in the design-editor window causes the cor- responding lines in the code-browser window to be au- tomatically highlighted (scrolling as needed to make the first corresponding line visible). Similarly, when browsing the source code, the user's selecting lines will automati- cally highlightthe correspondingnodes in thedesign-editor

160

Page 6: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

(likewise, scrolling as needed to make the node visible). 86 RESERVATIOBS-FILE PRIBT-FILE This d O W S the US€% t0 easily S s the IX“Xtion between 87 pERFDRn CRE-ATE UNTIL RP-EBD-OF-TRABS= 1

design and implementation. 88 CLOSE RESERVATIOBS-FILE

Figure 5 is an example query. There, the user has re- quested a complete description of the ValidateRecords operation. This query can be effortlessly asked by high- lighting an operation in the design window and using the Query menu of the Code Browser. The result is a win- dow describing both conceptual design information (such as what types of objects this operation produces or con- sumes) as well as the particular source code statements that have been used to implement it. In addition, the relevant source code lines are highlighted in the code browser. By

115 PERFORH READ-TRABSACTIOB 135 READ TRABSBCTIOB-FILE

AT EBD HOVE ’1’ TO RP-EBD-OF-TRABS 137 HOVE ZERO TO RP-INVALID-RECORD 138 PERFORH EDIT 139 IF RP-IUVBLID-RECORD = ’1’ 140 HOVE SPACES TO PRIBT-RECORD 141 HOVE ’IBLAID TRABSACTIOB’ TO PRT-HESSAGE 142 HOVE TRABSACTIl3!J-RECORD TO PRT-REC 143 URITE PRINT-RECORD 156 IF TR-FLIGHT-BUnBER IS BOT BUHERIC 157 HOVE ’1’ TO RP-IBVALID-RECORD

The user can also ask more general queries about an op- eration or class. In particular, when the user selects the op- erationlclass and invokes the code-browser’s Query menu, the system displays all of the source code involved in that class or operation. DECODE also provides a selection box that allows the user to specify whether the user is interested in just this class or operationor in all subclasses and subop- erations. For example, specifying interest in just the F i l e class results in nothing of interest for our example program, but selecting the File class and all of its subclasses results in much of the program’s data division being hi This allows theuser to quickly determine which parts of the program are involved with different high-level mnceptual classes and operations.

Whenever the user performs a query, the user is given the option to see the inversion of the query’s result, namely all code that is not involved in the selected class or op- eration. This allows the user to determine which parts of the program are not relevant to a given class or operation. Along the same linm, the user can use the code-browser’s Unknown menu to display all lines that have not been m- derstood, that is, that have not been linked to amy part of the object-oriented design. Likewise, the user can use this menu to ask DECODE to highlight any unconnected de- sign elements. The Unknown queries help guide the usex’s attention to those parts of the source code that are not yet understood and to those parts of the design that have not been found implemented in the source code.

In addition to displaying the results of queries by high- lighting source code lines and design elements, DJXODE also provides users with an Extract menu they can use to copy any highlighted lines into a separate window. This allows users to easily peruse lines that might be c~llected from a variety of locations in the original source ude. The user can also save the extracted d e into a file, which fa- cilitates using existing code as the beginnings of a reuse

library or an object-oriented r e i m p ~ e m e n ~ ~ ~ o ~ . These conceptual queries are particularly useful s

code that makes up a particular design element cm be widely separated (or even located within separate fil For example, Figure 6 shows the result of extracting pieces of a COBOL progr mentation of the conceptual operation These pieces are not contiguous, but

m e r captures the un this superficially unrelated set of statements, essentially forming a conceptual “slice” of the program. In this case, this conceptual slice captures the key portions of the pro- gram involved with validating input records.

All of the queries supported by DECODE are useful in understanding legacy software. It is only possible to an- swer them because DECODE explicitly represents the links between the conceptual design and the actual implementa- tion and provides the programmer with a mechanism for recording these links. Given these links, however, all of these queries can be answ te efficiently, essentialIy with simple graph-travers

5 Related Work

Our work is very closely related to COBOL/S%PE [14]. That system allows its users tocreatesegments byhighlight- ing arbitrary sections of code. These segments cm then be combined using set-likeinsertion, union, and difference op- erations, extracted from code into separate files, and stored along with comments describing their function. A major difference in DECODE is that rather than having users pro- vide English “comments” that describe what segments do,

161

Page 7: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

deslsn-edltur code-browser ..

mii5 ha135 mi37 hall O m 4 0 m141 O m 4 2 ha143 Qm61

Figwe 5: An example of the user requesting a complete description of the Validat eRecords operation.

DECODE forces users to associate segments with opera- tions in a design hierarchy. While this requires more work 8x1 the part of the user, it allows DECODE to answer con- ceptual queries about segments. Another key difference is

at although DECODE’S automated program understand- mg component is based on COBOL/SRE’s concept recog- nizer, it is designed to explicitly worry about scaling issues and extended to automatically generate object-oriented de- saiptions of code segments.

Another closely related system is LaSSIE [SI. which uses a classification hierarchy to describe the tasks ac-

by the modules that make up a large phone- software system. It answers queries from pro- trying to locate modules implementing particular

wnceptual functions. This is a considerably more power- ful query-answering capability than DECODE’S (since ob- jects can be located by providing arbitrary constraints on types and attributes), but has the drawback that it requires a knowledge engineer to create the knowledge base, an approach that relies on pre-existing documentation often unavailable in legacy systems. Its other drawback is that it has knowledge only at the module level, so programmers m o t ask detailed queries about module components, such as what conceptual operations a piece of code implements. These types of detailed questions, however, are frequently

asked by programmers trying to understand legacy soft- ware systems because they are essential for any modifica- tion, reimplementation, or porting of the detailed sowce d e . DECODE can be viewed as a compromise: it has a much simpler knowledge base about the program, but it is at a lower level of detail and it is easier for programmers to construct. Fmally, LASSIE made no attempt to extract conceptual information automatically, where DECODE ex- plicitly tries to integrate automated program understanding techniques with assisted ones.

These have been many other research efforts devoted to exploring methods for constructing knowledge bases describing existing software systems. However, most of this effort has gone into automatically forming structural knowledge bases (such as [20,3,22, 101) and making this knowledge easy to access E18.11. These knowledge bases consist of information about a program’s structure, such as data flow diagrams and call graphs. While structural knowledge is often straightforward to extract automatically, its usefulness is limited to answering queries about struc- turd relationships. Our focus is instead on extracting con- ceptual knowledge about the program that can be used to answer conceptual queries involving design relationships. This knowledge cannot always be extracted automatically, which forces us to consider how the programmer and sys-

162

Page 8: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

tem c8n work together to extract it.

6 Evalation and Future Work

We have used DECODE to extract design information fiom several textbook COBOL programs. Doing so has highlighted several strengths and weaknesses of our ap- proach and has suggested several areas for future explo- ration.

6.1 System Strengths

On the positive side, DECODES capabilities for plan and design element recognition are proving to be helpful. One clear win OCCUTS when it manages to recognize de- localized plans or design elements that are implemented by a hierarchy of plans. Even recognizhg only a few high-level design elements, such as Validatelnput or CopyFile. gives the programmer a starting point for ex- tracting domain-specific design elements, such as under- standing exactly what the input validation conditions are or what type of transactions are being copied. The other clear win is that, once design elements have been jointly ex- tracted, the explicit linkages between design elements and the code do aid in understanding the relationships between different parts of the source code-relationships that aren’t often apparent when a user quickly scans the source code.

6.2 System Weaknesses

On the negative side, it is proving very difficult for those other than the system’s designers to formulate and enter plans. In addition, the system currently provides little support for programmers who want to pdorm top- down understanding of a program. Emally, programmers expect DECODE to be considerably more cooperative than it currently is.

Support for Programmer-Provided Code Patterns: We are currently exploring letting users provide code- patterns by example. The idea is to let them select a pro- gram section that corresponds to an instance of a particular design element. This code section can be considwed an overly constrained code pattenz (that will only recognize this one instance) and the system can then provide a list of all the specfic constraints involved in the example. which the user can edit to create a final code pattern. We are also exploring how to let users extend an existing d e pattern to make it fit a selected section of source code. The idea is to have the APU provide users with those code patterns that were indexed by elements of that section of code (even if they do not match it exactly), ranking those patterns by

order of similarity. The user caa then use these code pat- terns as templates for generating new code patterns or edit their constraints until it matches the current source code. Feedback fiom applying thenewly modified code pattern to actual source code can be used to fine-tune the code pattern.

Support for Top-Down Design Extraction: We are currently trying to extend DECODE to provide support for top-down design extraction by providing pre-existing design libraries that capture design elements common to specific classes of applications. The simplest approach is to allow users to merge existing collections of design ele- ments into their current design (by providing the name of a previously saved file containing the desired design ele- ments). IIowever, we are also exploring how to extend DECODES plan indexing approach to have certain design elements index design libraries, so that when the APU rec- ognizes a design element, it can automatically suggest or include certain design libraries. Along the same lines, we are exploring sketchy recognition of design elements (as in in 123). where we don’t verify the design element’s ex- istence but merely suggest that there is a likelihood that it is present. and leave it up to the programmer to verify its presence.

Additional System Cooperation: Programmers CUT- rently expect more cooperation fiom DECODE than it cur- rently provides. For example, when the user highlights a PERFORM as part of a recognized design element, they ex- pect the system to automatically highlight the PERFORMHI statements and include them as part of the design element’s implementation. Similarly, when the user extracts the por- tion of the source code that implements a particular design element, they expect DECODE to automatically extract the relevant data declarations. Users also expect that the ex- tracted statements will be provided in the order they will be executed, not just in line number order as they are presently provided. We are currently providing DECODE with the detailed language knowledge necessary to provide the ex- pected level of cooperation.

Perhaps more importantly, programmers often want to higblight large chunks of related but delocalized code as im- plementing a particular high-level object-oriented deisjp element. TO do so now. they must carefully browse the program and determine exactly which lines are relevant to that element. A better alternative would be to allow pro- gramme~~ to see various types of program slices involving highlighted lines, and then indicate which parts of that slice belong to that particular design element.

6.3 Other Future Directions

Besides the drawbacks that have already shown up with using DECODE tounderstand textbookCOBOL programs, there are a variety of areas where the system will need

163

Page 9: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

work in the future. These include the completeness of the extracted design and support for dealing with multiple Programmers extracting design information.

~ ~ i p ~ o r t for More Complete Design Extraction: The s we have extracted from several existing programs

are only a fraction of the designs needed to reimplement those programs in another language. That’s because DE- CODE now only extracts and records classes and oper- ations. However, a more complete object-oriented design also includes other elements such as explicit state transition diagrams that show the order in which operations occw, the conditions under which they occur, and so on 1171. These high-level state diagrams represent the program’s control flow at the conceptual level rather than at the statement level.

We are currently working on augmenting DECODE with meshanisms for recognizing and recording these object- oriented state transition diagrams. This will involve ex- tending our design-editor to allow users to construct state diagrams and link them to recognized design elements, as well as modifying the system so that as users traverse these state diagrams, the system will automatically highlight the corresponding program source code. Doing so will allow the user to see the relationship between conceptual states and particular statements in the legacy system. For exam- ple, the user will be able to obtain answers to questions such as “What happens before (or after) the F i l t e r copy operation is applied to transaction-file?’ Along sim- ilar lines, we are also attempting to extend our automatic program understanding component to recognize and record conceptual control-flow information (i.e., states and state transitions). Our current plan is to use the abstract control- flow information we maintain about recognized plans to determine the flow relationships between the operations these plans implement.

Support for Team-Oriented Desigu Extraction: Fi- nally, DECODE currently only supports a single program- mer extracting the design for a single program. Unfortu- nately, the job of extracting the design of complex real- world legacy software systems is quite likely to require large teams of programmers perusing multiplemodules. As a result. we are planning several extensions to DECODE to better support team-oriented design extraction.

One extension is to provide automatic notification of team members whenever one team member recognizes ob- jects or operations that are relevant to another’s task. This occws, for example. when one user recognizes an object or operation that is implemented by statements that are also part of objects or operations another user has recognized. We are addressing this issue by adding a standard cen- tralized data base for storing recorded designs and adding a mechanism for notifying users about newly recognized

objects. Another extension is to provide support for locating spe-

cific object classes and operations. This is necessary be- cause when dealing with large software systems, the size of the object and operation design library becomes unwieldy. It can be difficult to find specific object classes among the large numbers of different classes. This arises, for example, because of people choosing different names for the same conceptual object different organizations for the same conceptual hierarchy. We are planning to use simple techniques, such as synonym detection and keyword-based retrieval, to help users locate existing classes and opera- tions.

7 Conclusions

A completely automated, pattern-based program under- standing system is an ideal that is likely to be impossible to attain. That makes it crucial to begin blending automated and assisted program understanding mechanisms. This pa- per has presented DECODE. a prototype environment in- tended to support programmers who are understanding and extracting designs from large, real-world legacy systems. DECODE includes an automatic program recognition com- ponent that explicitly addresses scaling issues, a structured notebook for recording system and programmer observa- tions about a program’s design and its relationship to code, and a simple but powerful menu-based query mechanism.

DECODE is only a prototype (implemented in a com- bination of Common Lisp. C++, and X-WindowsMotif), its current code patterns are targeted to automatic extrac- tion of domain-independent design information, and it sup- ports rearding only a limited amount of design informa- tion. Despite these limitations, however, it still proves surprisingly useful. The code patterns easily detect com- mon constructs, such as input-validation loops,despite their being distributed over many different lines of code. Even this minimal amount of automated understanding eases the user’s task in figuring out what a program does and how it works. In addition, DECODES simple capabilities for visually recording user-extracted design information and linking it to code allows it to answer a wide range of con- ceptual queries, and this conceptual knowledge is crucial to being able to successfully modify, extend, or translate a legacy system.

The key idea in DECODE is that the system and pro- grammer work together to understand the software. This takes advantage of the ability of automated program un- derstanders to recognize many common low-level design elements in the code. It also takes advantage of the ability of programmers to impose a high-level design framework on a piece of software and then locate where the elements

164

Page 10: [IEEE Comput. Soc. Press 2nd Working Conference on Reverse Engineering - Toronto, Ont., Canada (14-16 July 1995)] Proceedings of 2nd Working Conference on Reverse Engineering - DECODE:

of that framework are implemented. DECODE attempts to combine both to speed the process of completely under- standing a piece of legacy software. Our hope is that this cooperative approach will significantly reduce the time and effort it takes for programmers to perform maintenance and program understanding tasks.

References

[l] Ball, T.; and Eck, S . C. 1994. ~ s u ~ ~ ~ i n g Program Slices. In the Proceedings of the 10th AnnuaE Sympo- sium on Visual Languages, St. Louis. MO, pp. 288- 295.

121 Biggerstaff, T. J.; Mitbander, B. G.; and Webster, D. E. 1994. Program understanding and the concept as- signment problem. Communications of the ACM 37.5; (May 1994). 72-82.

[3] Chen Y.F.; Nishimoto, M.Y.; and Ramamwr~y. C V 1990. The C Information Abstraction System, IEEE Transactions on Software Engineering, (March 1990).

141 Detienne, E; and Soloway, E. 1990. An Empirically- Derived Control Structure for the process of program understanding. International Journal of Man-Machine Studies 33,3 (1990). 323-342.

[SI Devanbu, IB.; Bradman, R. J.; Selfridge, P. G.; and Ballard, B. W. 1990. LaSSIE: a knowledge-bas ware information system. Communicationsof ifhe ACM 34.5 (May, 1990), 34-49.

[GI Hartman. J. 1991. Understanding natural programs us- h g proper decomposition. In Proceedings of the Inter- national Conference on Software Engineering, TX, pp. 62-73.

171 Johnson, W. k. 1986. Intention Based Diagnosis of Novice Programming Errors. Morgan Kaufinan, Los Altos CA.

[SI Kozaczynski. V.; and Ning, J. Q. Automated Program Understanding By Concept Recognition. Automated Software Engineering 1, 1 (Mar& 1994). 61-’78.

E91 Kozaczynski, V.; Ning. J. Q.; and Ehgberts. A. 1992. Programconcept recognition and transformation. Transactions on Soflware Engineering 18.12 (Decem- ber 1992). 1065-1075.

[lo] Kuhu, D. R. 1987. “A Source Code Analyzer for Maintenance,” h Proceedings of the 1987 IEEE Con- ference on Software Maintenance. pp. 176-180.

1113 Letovsky, S . 1988. Plan Analysis r.fPrograms. Ph.D.

[121 Letovsky, S.; and Soloway, E.. 1986. Delocalized sion. IEEE Software 3,

.; Pinto, J.; Letovsky, S.; and Soloway, ental models and software maintenance. In

Empirical Studies of Programmers, Soloway, E.; and Iymgar, S . (editors), Ablex, Norwood NJ.

Kozaczynski, V. “Re- from Legacy Systems

by Program Segmentation.” In Proceedings of the onference on Reverse Engineering, Balti-

1151 Qdici, A. 1994. A memory-based approach to rec- ognizing programming plans. Communications of the ACM 37.5 (May 1994),84-93.

n; and Waters. R. C. 1990. The Programmer’s Apprentice. Addison Wesley, Reading MA.

augh, J.; Blaha, M.: Premerlani, W.; Eddy, E; ensen, W. 19911.Object-BrientedModeling andDe-

and Heineman, G. 1994. “Graphical Level Software Understanding”, In e Ninth Knowledge-Based Software

E ~ ~ g ~ ~ e l F r ~ n g Conference. Monterey CA, pp. 117- 123.

Kay, E; and Erdlich. K. 1984. Empirical stud- a.”ing knowledge. IEEE Tramactions on

esis, Yale University, New Haven CO.

May, 1993, pp. 64-72.

sign. Prentice Hall, Ernglewood Cliffs NJ.

Sofbvare Engineering IO, 5 (1984). 595-609.

eda, K.; Chin, D. N.; Miyamoto, I. 1992. Meta language for software engineering. In Proceed- ings of the 4th Internationai SEKE Conference. Capri,

1211 Weiser, M. 1984. Program Slicing. IEEE Transactions of Software Engineering IO 4 (July 1984). 252-357.

1221 Wilde, M.; Huitt, R.; and Huitt, S . 1989. Dependency Analysis Tools: Reusable Components for Software Maintenance, In Proceeding of the I989 Conference on Software Maintenance, pp. 126- 127.

1231 Wills, L. M. 1992. Automated Program Recognition by Graph Parsing. Ph.D. Thesis, Technical Report 1358, MIT Artificial IntelligenceLab. Cambridge MA.

[241 Wills, L. M. 1990. Automated program rmgnition: a feasibility demonstration. Arlijcial Intelligence 45.

Italy, pp. 495-502.

1-2 (1990). 113-172.

165