reverse engineering

Post on 22-Nov-2014

548 Views

Category:

Documents

12 Downloads

Preview:

Click to see full reader

TRANSCRIPT

REVERSE ENGINEERING

Edited by

LINDA WILLS PHILIP NEWCOMB

KLUWER ACADEMIC PUBLISHERS

REVERSE ENGINEERING

REVERSE ENGINEERING

edited by

Linda Wills Georgia Institute of Technology

Philip Newcomb The Software Revolution, Inc.

A Special Issue of AUTOMATED SOFTWARE ENGINEERING

An International Journal Volume 3, Nos. 1/2(1996)

KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London

AUTOMATED SOFTWARE

ENGINEERING An International Journal

Volume 3, Nos. 1/2, June 1996

Special Issue: Reverse Engineering Guest Editors: Linda Wills and Philip Newcomh

Preface Lewis Johnson 5

Introduction Linda Wills 1

Database Reverse Engineering: From Requirements to CARE Tools J.'L. Hainaut, V. En^lebert, J. Henrard, J.-M. Hick and D. Roland 9

Understanding Interleaved Code Spencer Ru^aber, Hurt Stirewalt and Linda M. Wills Al

Pattern Matching for Clone and Concept Detection K.A. Konto^iannis, R. Pernori, E. Merlo, M Galler andM. Berstein 11

Extracting Architectural Features from Source Code David R. Harris, Alexander S. Yeh and Howard B. Reubenstein 109

Strongest Postcondition Semantics and the Formal Basis for Reverse Engineering GeraldC. Gannod and Betty H C. Chenf^ 139

Recent Trends and Open Issues in Reverse Engineering Linda M. Wills and James H. Cross II 165

Desert Island Column John Dobson 173

Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA

Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data

A CLP. Catalogue record for this book is available from the Library of Congress.

Copyright © 1996 by Kluwer Academic Publishers

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo­copying, recordmg, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061

Printed on acid-free paper.

Printed in the United States of America

Automated Software Engineering 3, 5 (1996) © 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Preface

This issue of Automated Software Engineering is devoted primarily to the topic of reverse engineering. This is a timely topic: many organizations must devote increasing resources to the maintenance of outdated, so-called "legacy" systems. As these systems grow older, and changing demands are made on them, they constitute an increasing risk of catastrophic failure. For example, it is anticipated that on January 1, 2000, there will be an avalanche of computer errors from systems that were not designed to handle dates larger than 1999.

In software engineering the term "legacy" has a negative connotation, meaning old and decrepit. As Leon Osterweil has observed, the challenge of research in reverse engineering and software understanding is to give the term the positive connotation that it deserves. Legacy systems ought to be viewed as a valuable resource, capturing algorithms and busi­ness rules that can be reused in future software systems. They are often an important cultural heritage for an organization, embodying the organization's collective knowledge and ex­pertise. But in order to unlock and preserve the value of legacy systems, we need tools that can help extract useful information, and renovate the codes so that they can continue to be maintained. Thus automated software engineering plays a critical role in this endeavor.

Last year's Working Conference on Reverse Engineering (WCRE) attracted a number of excellent papers. Philip Newcomb and Linda Wills, the program co-chairs of the conference, and I decided that many of these could be readily adapted into journal articles, and so we decided that a special issue should be devoted to reverse engineering. By the time we were done, there were more papers than could be easily accommodated in a single issue, and so we decided to publish the papers as a double issue, along with a Desert Island Column that was due for publication. Even so, we were not able to include all of the papers that we hoped to publish at this time, and expect to include some additional reverse engineering papers in future issues.

I would like to express my sincere thanks to Drs. Newcomb and Wills for organizing this special issue. Their tireless efforts were essential to making this project a success.

A note of clarification is in order regarding the review process for this issue. When Philip and Linda polled the WCRE program committee to determine which papers they thought deserved consideration for this issue, they found that their own papers were among the papers receiving highest marks. This was a gratifying outcome, but also a cause for concern, as it might appear to the readership that they had a conflict of interest. After reviewing the papers myself, I concurred with the WCRE program committee; these papers constituted an important contribution to the topic of reverse engineering, and should not be overlooked. In order to eliminate the conflict of interest, it was decided that these papers would be handled through the regular Automated Software Engineering admissions process, and be published when they reach completion. One of these papers, by Rugaber, Stirewalt, and Wills, is now ready for publication, and I am pleased to recommend it for inclusion in this special issue. We look to publish additional papers in forthcoming issues.

W.L. Johnson

Automated Software Engineering 3, 7-8 (1996) (c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Introduction to the Special Double Issue on Reverse Engineering LINDA M.WILLS Georgia Institute of Technology

A central activity in software engineering is comprehending existing software artifacts. Whether the task is to maintain, test, migrate, or upgrade a legacy system or reuse software components in the development of new systems, the software engineer must be able to recover information about existing software. Relevant information includes: What are its components and how do they interact and compose? What is their functionality? How are certain requirements met? What design decisions were made in the construction of the software? How do features of the software relate to concepts in the application domain? Reverse engineering involves examining and analyzing software systems to help answer questions like these. Research in this field focuses on developing tools for assisting and automating portions of this process and representations for capturing and managing the information extracted.

Researchers actively working on these problems in academia and industry met at the Working Conference on Reverse Engineering (WCRE), held in Toronto, Ontario, in July 1995. This issue of Automated Software Engineering features extended versions of select papers presented at the Working Conference. They are representative of key technological trends in the field.

As with any complex problem, being able to provide a well-defined characterization of the problem's scope and underlying issues is a crucial step toward solving it. The Hainaut et al. and Rugaber et al. papers both do this for problems that have thus far been ill-defined and attacked only in limited ways. Hainaut et al. deal with the problem of recovering logical and conceptual data models from database applications. Rugaber et al. characterize the difficult problem of unraveling code that consists of several interleaved strands of computation. Both papers draw together work on several related, but seemingly independent problems, providing a framework for solving them in a unified way.

While Rugaber et al. deal with the problem of interleaving, which often arises due to structure-sharing optimizations, Kontogiannis et al. focus on the complementary problem of code duplication. This occurs as programs evolve and code segments are reused by simply duplicating them where they are needed, rather than factoring out the common structure into a single, generalized function. Kontogiannis et al. describe a collection of new pattern matching techniques for detecting pairs of code "clones" as well as for recognizing abstract programming concepts.

The recognition of meaningful patterns in software is a widely-used technique in reverse engineering. Currently, there is a trend toward flexible, interactive recognition paradigms, which give the user explicit control, for example, in selecting the type of recognizers to use

WILLS

and the degree of dissimilarity to tolerate in partial matches. This trend can be seen in the Kontogiannis et al. and Harris et al. papers.

Harris et al. focus on recognition of high-level, architectural features in code, using a library of individual recognizers. This work not only attacks the important problem of architectural recovery, it also contributes to more generic recognition issues, such as li­brary organization and analyst-controlled retrieval, interoperability between recognizers, recognition process optimization, and recognition coverage metrics.

Another trend in reverse engineering is toward increased use of formal methods. A repre­sentative paper by Gannod and Cheng describes a formal approach to extracting specifica­tions from imperative programs. They advocate the use of strongest postcondition semantics as a formal model that is more appropriate for reverse engineering than the more famil­iar weakest precondition semantics. The use of formal methods introduces more rigor and clarity into the reverse engineering process, making the techniques more easily automated and validated.

The papers featured here together provide a richly detailed perspective on the state of the field of reverse engineering. A more general overview of the trends and challenges of the field is provided in the summary article by Wills and Cross.

The papers in this issue are extensively revised and expanded versions of papers that originally appeared in the proceedings of the Working Conference on Reverse Engineering. We would like to thank the authors and reviewers of these papers, as well as the reviewers of the original WCRE papers, for their diligent efforts in creating high-quality presentations of this research.

Finally, we would like to acknowledge the general chair of WCRE, Elliot Chikofsky, whose vision and creativity has provided a forum for researchers to share ideas and work together in a friendly, productive environment.

Automated Software Engineering 3, 9-45 (1996) (c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Database Reverse Engineering: From Requirements to CARE Tools* J.-L. HAINAUT jlh@info.fundp.ac.be V. ENGLEBERT, J. HENRARD, J.-M. HICK AND D. ROLAND Institut d'Informatique, University of Namur, rue Grandgagnage, 21-B-5000 Namur

Abstract. This paper analyzes the requirements that CASE tools should meet for effective database reverse engineering (DBRE), and proposes a general architecture for data-centered applications reverse engineering CASE environments. First, the paper describes a generic DBMS-independent DBRE methodology, then it analyzes the main characteristics of DBRE activities in order to collect a set of desirable requirements. Finally, it describes DB-MAIN, an operational CASE tool developed according to these requirements. The main features of this tool that are described in this paper are its unique generic specification model, its repository, its transformation toolkit, its user interface, the text processors, the assistants, the methodological control and its functional extensibility. Finally, the paper describes five real-world projects in which the methodology and the CASE tool were applied.

Keywords: reverse engineering, database engineering, program understanding, methodology, CASE tools

1. Introduction

1.1. The problem and its context

Reverse engineering a piece of software consists, among others, in recovering or recon­structing its functional and technical specifications, starting mainly from the source text of the programs (IEEE, 1990; Hall, 1992; Wills et al., 1995). Recovering these specifications is generally intended to redocument, convert, restructure, maintain or extend old applica­tions. It is also required when developing a Data Administration function that has to know and record the description of all the information resources of the company.

The problem is particularly complex with old and ill-designed applications. In this case, not only can no decent documentation (if any) be relied on, but the lack of systematic method­ologies for designing and maintaining them have led to tricky and obscure code. Therefore, reverse engineering has long been recognized as a complex, painftil and prone-to-failure

*This is a heavily revised and extended version of "Requirements for Information System Reverse Engineering Support" by J.-L. Hainaut, V. Englebert, J. Henrard, J.-M. Hick, D. Roland, which first appeared in the Proceedings of the Second Working Conference on Reverse Engineering, IEEE Computer Society Press, pp. 136-145, July 1995. This paper presents some results of the DB-MAIN project. This project is partially supported by the Region Wallonne, the European Union, and by a consortium comprising ACEC-OSI (Be), ARIANE-II (Be), Banque UCL (Lux), BBL (Be), Centre de recherche public H. Tudor (Lux), CGER (Be), Cockerill-Sambre (Be), CONCIS (Fr), D'leteren (Be), DIGITAL, EDF (Fr), EPFL (CH), Groupe S (Be), ffiM, OBLOG Software (Port), ORIGIN (Be), Ville de Namur (Be), Winterthur (Be), 3 Suisses (Be). The DB-Process subproject is supported by the Communaute Frangaise de Belgique.

10 HAINAUTETAL.

activity, so much so that it is simply not undertaken most of the time, leaving huge amounts of invaluable knowledge buried in the programs, and therefore definitively lost.

In information systems, ox data-oriented applications, i.e., in applications whose central component is a database (or a set of permanent files), the complexity can be broken down by considering that the files or databases can be reverse engineered (almost) independently of the procedural parts.

This proposition to split the problem in this way can be supported by the following arguments:

— the semantic distance between the so-called conceptual specifications and the physical implementation is most often narrower for data than for procedural parts;

— the permanent data structures are generally the most stable part of applications; — even in very old applications, the semantic structures that underlie the file structures are

mainly procedure-independent (though their physical structures are highly procedure-dependent);

— reverse engineering the procedural part of an application is much easier when the semantic structure of the data has been elicited.

Therefore, concentrating on reverse engineering the data components of the application first can be much more efficient than trying to cope with the whole application.

The database community considers that there exist two outstanding levels of description of a database or of a consistent collection of files, materialized into two documents, namely its conceptual schema and its logical schema. The first one is an abstract, technology-independent, description of the data, expressed in terms close to the application domain. Conceptual schemas are expressed in some semantics-representation formalisms such as the ERA, NIAM or OMT models. The logical schema describes these data translated into the data model of a specific data manager, such as a commercial DBMS. A logical schema comprises tables, columns, keys, record types, segment types and the like.

The primary aim of database reverse engineering (DBRE) is to recover possible logical and conceptual schemas for an existing database.

1.2. Two introductory examples

The real scope of database reverse engineering has sometimes been misunderstood, and pre­sented as merely redrawing the data structures of a database into some DBMS-independent formalism. Many early scientific proposals, and most current CASE tools are limited to the translation process illustrated in figure 1. In such situations, some elementary translation rules suffice to produce a tentative conceptual schema.

Unfortunately, most situations are actually far more complex. In figure 2, we describe a very small COBOL fragment fi-om which we intend to extract the semantics underlying the files CF008 and PFOS. By merely analyzing the record structure declarations, as most DBRE CASE tools do at the present time, only schema (a) in figure 2 can be extracted. It obviously brings little information about the meaning of the data.

However, by analyzing the procedural code, the user-program dialogs, and, if needed, the file contents, a more expressive schema can be obtained. For instance, schema (b) can be considered as a refinement of schema (a) resulting from the following reasonings:

create table CUSTOMER (

CNUM numeric{6)not null,

CNAME char(24) not null,

CADDRESS char(48 not null,

primary key (CNUM))

create table ORDER (

ONUM char(8) not null,

CNUM numeric(6) not null,

ODATE date,

primary key (ONUM),

foreign key (CNUM)

references CUSTOMER)

CUSTOMEF CNUM CNAME CADDRESS id: CNUM

0-N 1

<^passes\ 1

1-1

1 ORDER 1 ONUM ODATE

j id: ONUM 1

Figure 1. An idealistic view of database reverse engineering.

E n v i r o n m e n t d i v i s i o n . I n p u t - o u t p u t s e c t i o n .

[ F i l e c o n t r o l . s e l e c t CF008 a s s i g n t o DSK02:P12 o r g e m i z a t i o n i s i n d e x e d r e c o r d k e y i s Kl o f REC- C F 0 0 8 - 1 .

s e l e c t PFOS a s s i g n t o DSK02:P27 o r g a n i z a t i o n i s i n d e x e d r e c o r d k e y i s Kl o f REC-

D a t a d i v i s i o n . F i l e s e c t i o n . f d CF008. 01 REC-CF008-1 .

02 Kl p i c 9 ( 6 ) . 02 f i l l e r p i c X ( 1 2 5 ) .

f d PFOS. 01 REC-PFOS-1.

02 Kl 03 K l l p i c X ( 9 ) . 03 f i l l e r p i c 9 ( 6 )

02 PDATA p i c X ( 1 8 0 ) .

PFOS-1.

02 PRDATA r e d e f i n e s PDATA. 03 f i l l e r p i c 9 ( 4 ) V 9 9 .

Working s t o r a g e s e c t i o n . 0 1 IN-COMPANY.

02 CPY-ID p i c 9 ( 6 ) . 02 C-DATA p i c X ( 1 2 5 ) .

01 CPy-DATA. 02 CNAME p i c X ( 2 5 ) . 02 CADDRESS p i c X(IOO) .

01 PKEY. 02 K l l p i c X ( 9 ) . 02 K12 p i c X ( 6 ) .

P r o c e d u r e d i v i s i o n .

move z e r o e s t o PKBY. d i s p l a y "Enter ID :* w i t h no a d v a n c i n g . a c c e p t K l l o f PKEY. move PKEY t o Kl o f REC-PFOS-1. r e a d PFOS k e y Kl on i n v a l i d k e y

d i s p l a y " I n v a l i d P r o d u c t ID" d i s p l a y PDATA w i t h no a d v a n c i n g .

r e a d PFOS. p e r f o r m u n t i l K l l o f Kl > K l l o f PKEY

d i s p l a y " P r o d u c t i o n : " w i t h no a d v a n c i n g d i s p l a y PRDATA w i t h no advauic ing d i s p l a y " t o n s by " w i t h no a d v a n c i n g move Kl o f REC-PFOS-1 t o PKEY move K12 o f PKEY t o Kl o f REC-CF008-1 r e a d CF008 i n t o IN-COMPANY

n o t i n v a l i d k e y move C-DATA t o CPY-DATA d i s p l a y CNAME o f CPY-DATA

end-read. read next PFOS

end-perform.

^

REC-CF008-11 KL niler id:KI 1

REC-PFOS-1 Ki

Kll filler

PDATA[0.|] PRDATA[0-1]

niler id:KI excIPDATA

PRDATA

1COMPANY CPY-ID CNAME CADDRESS

I id: CPY-ID <k

1 L

PRODUCT j PRO-ID PNAME CATEGORY^ id: PRO-ID |<

PRODUCTION P-ID

PRO-ID CPY-ID

VOLUME id: P-ID ref:P-ID.PRO-lD ref:P-ID.CPY-lD

= 1

n

COMPANY CPY-ID CNAME CADDRESSi id: CPY-ID

/PRODUCTIONV ^ \ VOLUME /

PRODUCT PRO-ID PNAME CATEGORYI

id: PRO-ID

(a) (b) (c)

Figure 2. A more realistic view of database reverse engineering. Merely analyzing the data stnicture declaration statements yields a poor result (a), while further inspection of the procedural code makes it possible to recover a much more explicit schema (b), which can be expressed as a conceptual schema (c).

12 HAINAUTETAL.

— the gross structure of the program suggests that there are two kinds of REC-PFOS-1 records, arranged into ordered sequences, each comprising one type-1 record (whose PDATA field is processed before the loop), followed by an arbitrary sequence of type-2 records (whose PRDATA field is processed in the body of the loop); all the records of such a sequence share the same first 9 characters of the key;

— the processing of type-1 records shows that the Kll part of key Kl is an identifier, the rest of the key acting as pure padding; the user dialog suggests that type-1 records describe products; this record type is called PRODUCT, and its key PRO-ID; visual inspection of the contents of the PFOS file could confirm this hypothesis;

— examining the screen contents when running the program shows that PDATA is made of a product name followed by a product category; this interpretation is given by a typical user of the program; this field can then be considered as the concatenation of a PNAME field and a CATEGORY field;

— the body of the loop processes the sequence of type-2 records depending on the cur­rent PRODUCT record; they all share the PRO-ID value of their parent PRODUCT record, so that this 9-digit subfield can be considered as a foreign key to the PRODUCT record;

— the processing of a type-2 record consists in displaying one line made up of constants and field values; the linguistic structure of this line suggests that it informs us about some Production of the current product; the PDATA value is expressed in tons (most probably a volume), and seems to be produced by some kind of agents described in the file CF008; hence the names PRODUCTION for type-2 record type and VOLUME for the PRDATA field;

— the agent of a production is obtained by using the second part of the key of the PRO­DUCTION record; therefore, this second part can be considered as a foreign key to the REC-CF008-1 records;

— the name of the field in which the agent record is stored suggests that the latter is a company; hence the name COMPANY for this record type, and CPY-ID for its access key;

— the C-DATA field of the COMPANY record type should match the structure of the CPY-DATA variable, which in turn is decomposed into CNAME and CADDRESS.

Refining the initial schema (a) by these reasonings results in schema (b), and interpreting these technical data structures into a semantic information model (here some variant of the Entity-relationship model) leads to schema (c).

Despite its small size, this exercise emphasizes some common difficulties of database reverse engineering. In particular, it shows that the declarative statements that define file and record structures can prove a poor source of information. The analyst must often rely on the inspection of other aspects of the application, such as the procedural code, the user-program interaction, the program behaviour, the file contents. This example also illustrates the weaknesses of most data managers which, together with some common programming practices that tend to hide important structures, force the programmer to express essential data properties through procedural code. Finally domain knowledge proves essential to discover and to validate some components of the resulting schema.

DATABASE REVERSE ENGINEERING 13

L3, State of the art

Though reverse engineering data structures is still a complex task, it appears that the current state of the art provides us with concepts and techniques powerful enough to make this enterprise more realistic.

The literature proposes systematic approaches for database schema recovery:

— for standard files: (Casanova and Amarel de Sa, 1983; Nilson, 1985; Davis and Arora, 1985; Sabanis and Stevenson, 1992)

— for IMS databases: (Navathe and Awong, 1988; Winans and Davis, 1990; Batini et al., 1992; Fong and Ho, 1994)

— for CODAS YL databases: (Batini et al., 1992; Fong and Ho, 1994; Edwards and Munro, 1995)

— for relational databases: (Casanova and AmaralDeSa, 1984; Navathe and Awong, 1988; Davis and Arora, 1988; Johannesson and Kalman, 1990; Markowitz and Makowsky, 1990; Springsteel and Kou, 1990; Fonkam and Gray, 1992; Batini et al., 1992; Premer-lani and Blaha, 1993; Chiang et al, 1994; Shoval and Shreiber, 1993; Petit et al., 1994; Andersson, 1994; Signore et al., 1994; Vermeer and Apers, 1995).

Many of these studies, however, appear to be limited in scope, and are generally based on assumptions about the quality and completeness of the source data structures to be reverse engineered that cannot be relied on in many practical situations. For instance, they often suppose that

— all the conceptual specifications have been translated into data structures and constraints (at least until 1993); in particular, constraints that have been procedurally expressed are ignored;

— the translation is rather straightforward (no tricky representations); for instance, a rela­tional schema is often supposed to be in 4NF^; Premerlani and Blaha (1993) and Blaha and Premerlani (1995) are among the only proposals that cope with some non trivial representations, or idiosyncrasies, observed in real world applications; let us mention some of them: nuUable primary key attributes, almost unique primary keys, denormal-ized structures, degradation of inheritance, mismatched referential integrity domains, overloaded attributes, contradictory data;

— the schema has not been deeply restructured for performance objectives or for any other requirements; for instance, record fragmentation or merging for disc space or access time minimization will remain undetected and will be propagated to the conceptual schema;

— a complete and up-to-date DDL schema of the data is available; — names have been chosen rationally (e.g., a foreign key and the referenced primary key

have the same name), so that they can be used as reliable definition of the objects they denote.

In many proposals, it appears that the only databases that can be processed are those that have been obtained by a rigourous database design method. This condition cannot be

14 HAINAUTETAL.

assumed for most large operational databases, particularly for the oldest ones. Moreover, these proposals are most often dedicated to one data model and do not attempt to elaborate techniques and reasonings conmion to several models, leaving the question of a general DBRE approach still unanswered.

Since 1992, some authors have recognized that the procedural part of the application programs is an essential source of information on data structures (Joris et al., 1992; Hainaut et al , 1993a; Petit et al., 1994; Andersson, 1994; Signore et al., 1994).

Like any complex process, DBRE cannot be successful without the support of adequate tools called CARE tools^. An increasing number of conmiercial products (claim to) offer DBRE functionalities. Though they ignore many of the most difficult aspects of the problem, those tools provide their users with invaluable help to carry out DBRE more effectively (Rock-Evans, 1990).

In Hainaut et al. (1993a), we proposed the theoretical baselines for a generic, DBMS-independent, DBRE methodology. These baselines have been developed and extended in Hainaut et al. (1993b) and Hainaut et al. (1994). The current paper translates these principles into practical requirements DBRE CARE tools should meet, and presents the main aspects and components of a CASE tool dedicated to database applications engineering, and more specifically to database reverse engineering.

7.4. About this paper

The paper is organized as follows. Section 2 is a synthesis of the main problems which occur in practical DBRE, and of a generic DBMS-independent DBRE methodology. Sec­tion 3 discusses some important requirements which should be satisfied by future DBRE CARE tools. Section 4 briefly presents a prototype DBRE CASE tool which is intended to address these requirements. The following sections describe in further detail some of the original principles and components of this CASE tool: the specification model and the repository (Section 5), the transformation toolkit (Section 6), the user interface (Section 7), the text analyzers and name processor (Section 8), the assistants (Section 9), functional extensibility (Section 10) and methodological control (Section 11). Section 12 evaluates to what extent the tool meets the requirements while Section 13 describes some real world applications of the methodology and of the tool.

The reader is assumed to have some basic knowledge of data management and design. Re­cent references Elmasri and Navathe (1994) and Date (1994) are suggested for data manage­ment, while Batini et al. (1992) and Teorey (1994) are recommended for database design.

2. A generic methodology for database reverse engineering (DBRE)

The problems that arise when recovering the documentation of the data naturally fall in two categories that are addressed by the two major processes in DBRE, namely data structure extraction and data structure conceptualization (Joris et al., 1992; Sabanis and Stevenson, 1992; Hainaut et al., 1993a). By naturally, we mean that these problems relate to the recovery of two different schemas, and that they require quite different concepts, reasonings and tools. In addition, each of these processes grossly appears as the reverse of a standard

DATABASE REVERSE ENGINEERING 15

Nonnalized conceptual schema

CONCEPTUAL NORMALIZATION

CRaw conceptuaT""^ schema ^^

8

SCHEMA DE'^OPTIMJZATION\

l^ iir ./'"'^onceptual-logic^^^ 'V^^^^hysical s chema^

SCHEMA UNTRANSUTION

SCHEMA PREPARATION

DMS-compliant optimized schema

Figure 3. Main components of the generic DBRE methodology. Quite naturally, this reverse methodology is to be read from right to left, and bottom-up!

database design process (resp. physical and logical design (Teorey, 1994; Batini et al., 1992)). We will describe briefly these processes and the problems they try to solve. Let us mention however, that partitioning the problems in this way is not proposed by many authors, who prefer proceeding in one step only. In addition, other important processes are ignored in this discussion for simplicity (see Joris et al. (1992) for instance).

This methodology is generic in two ways. First, its architecture and its processes are largely DMS-independent. Secondly, it specifies what problems have to be solved, and in which way, rather than the order in which the actions must be carried out. Its general architecture, as developed in Hainaut et al. (1993a), is outlined in figure 3.

2.1. Data structure extraction

This phase consists in recovering the complete DMS^ schema, including all the implicit and explicit structures and constraints. True database systems generally supply, in some readable and processable form, a description of this schema (data dictionary contents, DDL

16 HAINAUTETAL.

texts, etc.). Though essential information may be missing from this schema, the latter is a rich starting point that can be refined through further analysis of the other components of the application (views, subschemas, screen and report layouts, procedures, fragments of documentation, database content, program execution, etc.).

The problem is much more complex for standard files, for which no computerized de­scription of their structure exists in most cases'*. The analysis of each source program provides a partial view of the file and record structures only. For most real-world (i.e., non academic) applications, this analysis must go well beyond the mere detection of the record structures declared in the programs. In particular, three problems are encountered, that derive from frequent design practices, namely structure hiding, non declarative structures and lost specifications. Unfortunately, these practices are also conraion in (true) databases, i.e., those controlled by DBMS, as illustrated by Premerlani and Blaha (1993) and Blaha and Premerlani (1995) for relational databases.

Structure hiding applies to a source data structure or constraint S1, which could be imple­mented in the DMS. It consists in declaring it as another data structure S2 that is more general and less expressive than SI, but that satisfies other requirements such as field reusability, genericity, program conciseness, simplicity or efficiency. Let us mention some examples: a compound/multivalued field in a record type is declared as a single-valued atomic field, a sequence of contiguous fields are merged into a single anonymous field (e.g., as an unnamed COBOL field), a one-to-many relationship type is implemented as a many-to-many link, a referential constraint is not explicitly declared as a foreign key, but is procedurally checked, a relationship type is represented by a foreign key (e.g., in IMS and CODASYL databases).

Non declarative structures are structures or constraints which cannot be declared in the target DMS, and therefore are represented and checked by other means, such as pro­cedural sections of the application. Most often, the checking sections are not centralized, but are distributed and duplicated (frequently in different versions), throughout the appli­cation programs. Referential integrity in standard files and one-to-one relationship types in CODASYL databases are some examples.

Lost specifications are constructs of the conceptual schema that have not been imple­mented in the DMS data structures nor in the application programs. This does not mean that the data themselves do not satisfy the lost constraint^, but the trace of its enforcement cannot be found in the declared data structures nor in the application programs. Let us mention popular examples: uniqueness constraints on sequential files and secondary keys in IMS and CODASYL databases.

Recovering hidden, non-declarative and lost specifications is a complex problem, for which no deterministic methods exist so far. A careful analysis of the procedural state­ments of the programs, of the dataflow through local variables and files, of the file contents, of program inputs and outputs, of the user interfaces, of the organizational rules, can accu­mulate evidence for these specifications. Most often, that evidence must be consolidated by the domain knowledge.

Until very recently, these problems have not triggered much interest in the literature. The first proposals address the recovery of integrity constraints (mainly referential and inclusion) in relational databases through the analysis of SQL queries (Petit et al., 1994; Andersson, 1994; Signore et al., 1994). In our generic methodology, the main processes of

DATABASE REVERSE ENGINEERING 17

DATA STRUCTURE EXTRACTION are the following:

• DMS-DDL text ANALYSIS. This rather straightforward process consists in analyzing the data structures declaration statements (in the specific DDL) included in the schema scripts and application programs. It produces a first-cut logical schema.

• PROGRAM ANALYSIS. This process is much more complex. It consists in analyzing the other parts of the application programs, a.o., the procedural sections, in order to detect evidence of additional data structures and integrity constraints. The first-cut schema can therefore be refined following the detection of hidden, non declarative structures.

• DATA ANALYSIS. This refinement process examines the contents of the files and databases in order (1) to detect data structures and properties (e.g., to find the unique fields or the functional dependencies in a file), and (2) to test hypotheses (e.g., "could this field be a foreign key to this file?"). Hidden, non declarative and lost structures can be found in this way.

• SCHEMA INTEGRATION. When more than one information source has been pro­cessed, the analyst is provided with several, generally different, extracted (and possibly refined) schemas. Let us mention some common situations: base tables and views (RDBMS), DBD and PSB (IMS), schema and subschemas (CODASYL), file structures from all the application programs (standard files), etc. The final logical schema must include the specifications of all these partial views, through a schema integration process.

The end product of this phase is the complete logical schema. This schema is expressed according to the specific model of the DMS, and still includes possible optimized constructs, hence its name: the DMS-compliant optimized schema, or DMS schema for short.

The current DBRE CARE tools offer only limited DMS-DDL text ANALYSIS function­alities. The analyst is left without help as far as PROGRAM ANALYSIS, DATA ANALYSIS and SCHEMA INTEGRATION processes are concerned. The DB-MAIN tool is intended to address all these processes and to improve the support that analysts are entitled to expect from CARE tools.

2.2. Data structure conceptualization

This second phase addresses the conceptual interpretation of the DMS schema. It con­sists for instance in detecting and transforming or discarding non-conceptual structures, redundancies, technical optimization and DMS-dependent constructs. It consists of two sub-processes, namely Basic conceptualization and Conceptual normalization. The reader will find in Hainaut et al. (1993b) a more detailed description of these processes, which rely heavily on schema restructuring techniques (or schema transformations).

• BASIC CONCEPTUALIZATION. The main objective of this process is to extract all the relevant semantic concepts underlying the logical schema. Two different problems, requiring different reasonings and methods, have to be solved: schema untranslation and schema de-optimization. However, before tackling these problems, we often have to prepare the schema by cleaning it.

18 HAINAUTETAL.

SCHEMA PREPARATION. The schema still includes some constructs, such as files and access keys, which may have been useful in the Data Structure Extraction phase, but which can now be discarded. In addition, translating names to make them more meaningful (e.g., substitute the file name for the record name), and restructuring some parts of the schema can prove useful before trying to interpret them. SCHEMA UNTRANSLATION. The logical schema is the technical translation of con­ceptual constructs. Through this process, the analyst identifies the traces of such transla­tions, and replaces them by their original conceptual construct. Though each data model can be assigned its own set of translating (and therefore of untranslating) rules, two facts are worth mentioning. First, the data models can share an important subset of translating rules (e.g., COBOL files and SQL structures). Secondly, translation rules considered as specific to a data model are often used in other data models (e.g., foreign keys in IMS and CODAS YL databases). Hence the importance of generic approaches and tools. SCHEMA DE- OPTIMIZATION, The logical schema is searched for traces of constructs designed for optimization purposes. Three main families of optimization techniques should be considered: denormalization, structural redundancy and restructuring (Hainaut et al., 1993b).

• CONCEPTUAL NORMALIZATION. This process restructures the basic conceptual schema in order to give it the desired qualities one expects from any final conceptual schema, e.g., expressiveness, simplicity, minimality, readability, genericity, extensibility. For instance, some entity types are replaced by relationship types or by attributes, is-a relations are made explicit, names are standardized, etc. This process is borrowed from standard DB design methodologies (Batini et al., 1992; Teorey, 1994; Rauh and Stickel, 1995).

All the proposals published so far address this phase, most often for specific DMS, and for rather simple schemas (e.g., with no implementation tricks). They generally pro­pose elementary rules and heuristics for the SCHEMA UNTRANSLATION process and to some extent for CONCEPTUAL NORMALIZATION, but not for the more complex D^:-OPTIMIZATION phsiSQ. The DB-MAIN CARE tool has been designed to address all these processes in a flexible way.

2,3. Summary of the limits of the state of the art in CARE tools

The methodological framework developed in Sections 2.1 and 2.2 can be specialized ac­cording to a specific DMS and according to specific development standards. For instance Hainaut et al. (1993b) suggests specialized versions of the CONCEPTUALIZATION phdisc for SQL, COBOL, IMS, CODASYL and TOTAL/IMAGE databases.

It is interesting to use this framework as a reference process model against which existing methodologies can be compared, in particular, those which underlie the current CARE tools (figure 4). The conclusions can be summarized as follows:

• DATA STRUCTURE EXTRACTION. Current CARE tools are limited to parsing DMS-DDL schemas only (DMS-DDL t e x t ANALYSIS). All the other sources are

DATABASE REVERSE ENGINEERING 19

Normalized conceptual schema

o H

! e s

2

CONCEPTUAL NORMALIZATION

t -" ^ - x 1 ~^r Conceptual-logical ^ 1 ^ ^ ^ . . ^ schema ^^,^ 1

SCHEMA UNTRANSLATION ,

1. CDMS-compliant "^N

optimized schema^x

DMS-compiiant optimized schema

I DMS-DDL text

ANALYSIS 1

DMS-DDL schema

Figure 4. Simplified DBRE methodology proposed by most current CARE tools.

ignored, and must be processed manually. For instance, these tools are unable to collect the multiple views of a COBOL application, and to integrate them to produce the global COBOL schema. A user of a popular CARE tool tells us ''how he spent several weeks, cutting and pasting hundreds of sections of programs, to build an artificial COBOL pro­gram in which all the files and records were fully described. Only then was the tool able to extract the file data structures".

• DATA STRUCTURE CONCEPTUALIZATION. Current CARE tools focus mainly on untranslation (SCHEMA UNTRANSLATION) and offer some restructuring facilities (CONCEPTUAL NORMALIZATION), though these processes often are merged. Once again, some strong naming conventions must often be satisfied for the tools to help. For instance, a foreign key and the referenced primary key must have the same names. All performance-oriented constructs, as well as most non standard database structures, (see Premerlani and Blaha (1993) and Blaha and Premerlani (1995) for instance) are completely beyond the scope of these tools.

3. Requirements for a DBRE CARE tool

This section states some of the most important requirements an ideal DBRE support envi­ronment (or CARE tool) should meet. These requirements are induced by the analysis of the specific characteristics of the DBRE processes. They also derive from reverse engineering files and databases, often by hand or with very basic text editing tools, of a dozen actual applications.

Flexibility

Observation, The very nature of the RE activities differs from that of more standard engi­neering activities. Reverse engineering a software component, and particularly a database.

20 HAINAUTETAL.

is basically an exploratory and often unstructured activity. Some important aspects of higher level specifications are discovered (sometimes by chance), and not deterministically inferred from the operational ones.

Requirements. The tool must allow the user to follow flexible working patterns, including unstructured ones. It should be methodology-neutral^ unlike forward engineering tools. In addition, it must be highly interactive.

Extensibility

Observation, RE appears as a learning process; each RE project often is a new problem of its own, requiring specific reasonings and techniques.

Requirements, Specific functions should be easy to develop, even for one-shot use.

Source multiplicity

Observation, RE requires a great variety of information sources: data structure, data (from files, databases, spreadsheets, etc.), program text, program execution, program output, screen layout, CASE repository and Data dictionary contents, paper or computer-based documentation, interview, workflow and dataflow analysis, domain knowledge, etc.

Requirements, The tool must include browsing and querying interfaces with these sources. Customizable functions for automatic and assisted specification extraction should be avail­able for each of them.

Text analysis

Observation, More particularly, database RE requires browsing through huge amounts of text, searching them for specific patterns (e.g., programming cliches (Selfridge et al., 1993)), following static execution paths and dataflows, extracting program slices (Weizer, 1984).

Requirements, The CARE tool must provide sophisticated text analysis processors. The latter should be language independent, easy to customize and to program, and tightly coupled with the specification processing functions.

Name processing

Observation, Object names in the operational code are an important knowledge source. Frustratingly enough, these names often happen to be meaningless (e.g., REC-001-R08, 1-087), or at least less informative than expected (e.g., INV-QTY, QOH, C-DATA), due to the use of strict naming conventions. Many applications are multilingual, so that data names may be expressed in several languages. In addition, multi-programmer development often induces non consistent naming conventions.

DATABASE REVERSE ENGINEERING 21

Requirements. The tool must include sophisticated name analysis and processing func­tions.

Links with other CASE processes

Observation. RE is seldom an independent activity. For instance, (1) forward engineering projects frequently include reverse engineering of some existing components, (2) reverse engineering share important processes with forward engineering (e.g., conceptual normal­ization), (3) reverse engineering is a major activity in broader processes such as migration, reengineering and data administration.

Requirements. A CARE tool must offer a large set of functions, including those which pertain to forward engineering.

Openness

Observation. There is (and probably will be) no available tool that can satisfy all corporate needs in application engineering. In addition, companies usually already make use of one or, most often, several CASE tools, software development environments, DBMS, 4GL or DDS.

Requirements. A CARE tool must communicate easily with the other development tools (e.g., via integration hooks, communications with a common repository or by exchanging specifications through a conmion format).

Flexible specification model

Observation. As in any CAD activity, RE applies on incomplete and inconsistent spec­ifications. However, one of its characteristics makes it intrinsically different ft-om design processes: at any time, the current specifications may include components from different abstraction levels. For instance, a schema in process can include record types (physical objects) as well as entity types (conceptual objects).

Requirements. The specification model must be wide-spectrum, and provides artifacts for components of different abstraction levels.

Genericity

Observation. Tricks and implementation techniques specific to some data models have been found to be used in other data models as well (e.g., foreign keys are frequent in IMS and CODASYL databases). Therefore, many RE reasonings and techniques are common to the different data models used by current applications.

Requirements. The specification model and the basic techniques offered by the tool must be DMS-independent, and therefore highly generic.

22 HAINAUTETAL.

Multiplicity of views

Observation, The specifications, whatever their abstraction level (e.g., physical, logical or conceptual), are most often huge and complex, and need to be examined and browsed through in several ways, according to the nature of the information one tries to obtain.

Requirements, The CARE tool must provide several ways of viewing both source texts and abstract structures (e.g., schemas). Multiple textual and graphical views, summary and fine-grained presentations must be available.

Rich transformation toolset

Observation. Actual database schemas may include constructs intended to represent con­ceptual structures and constraints in non standard ways, and to satisfy non functional re­quirements (performance, distribution, modularity, access control, etc.). These constructs are obtained through schema restructuration techniques.

Requirements, The CARE tool must provide a rich set of schema transformation tech­niques. In particular, this set must include operators which can undo the transformations commonly used in practical database designs.

Traceability

Observation, A DBRE project includes at least three sets of documents: the operational descriptions (e.g., DDL texts, source program texts), the logical schema (DMS-compliant) and the conceptual schema (DMS-independent). The forward and backward mappings between these specifications must be precisely recorded. The forward mapping specifies how each conceptual (or logical) construct has been implemented into the operational (or logical) specifications, while the backward mapping indicates of which conceptual (or logical) construct each operational (or logical) construct is an implementation.

Requirements, The repository of the CARE tool must record all the links between the schemas at the different levels of abstraction. More generally, the tool must ensure the traceability of the RE processes.

4. The DB-MAIN CASE tool

The DB-MAIN database engineering environment is a result of a R&D project initiated in 1993 by the DB research unit of the Institute of Informatics, University of Namur. This tool is dedicated to database applications engineering, and its scope encompasses, but is much broader than, reverse engineering alone. In particular, its ultimate objective is to assist developers in database design (including full control of logical and physical processes), database reverse engineering, database application reengineering, maintenance, migration and evolution. Further detail on the whole approach can be found in Hainaut et al. (1994).

DATABASE REVERSE ENGINEERING 23

As far as DBRE support is concerned, the DB-MAIN CASE tool has been designed to address as much as possible the requirements developed in the previous section.

As a wide-scope CASE tool, DB-MAIN includes the usual functions needed in database analysis and design, e.g., entry, browsing, management, validation and transformation of specifications, as well as code and report generation. However, the rest of this paper, namely Sections 5 to 11, will concentrate only on the main aspects and components of the tool which are directly related to DBRE activities. In Section 5 we describe the way schemas and other specifications are represented in the repository. The tool is based on a general purpose transformational approach which is described in Section 6. Viewing the specifications from different angles and in different formats is discussed in Section 7. In Section 8, various tools dedicated to text and name processing and analysis are described. Section 9 presents some expert modules, called assistant, which help the analyst in complex processing and analysis tasks. DB-MAIN is an extensible tool which allows its users to build new functions through the Voyager-2 tool development language (Section 10). Finally, Section 11 evokes some aspects of the tool dedicated to methodological customization, control and guidance.

In Section 12, we will reexamine the requirements described in Section 3, and evaluate to what extent the DB-MAIN tool meets them.

5. The DB-MAIN specification model and repository

The repository collects and maintains all the information related to a project. We will limit the presentation to the data aspects only. Though they have strong links with the data struc­tures in DBRE, the specification of the other aspects of the applications, e.g., processing, will be ignored in this paper. The repository comprises three classes of information:

— a structured collection of schemas and texts used and produced in the project, — the specification of the methodology followed to conduct the project, — the history (or trace) of the project.

We will ignore the two latter classes, which are related to methodological control and which will be evoked briefly in Section 11.

A schema is a description of the data structures to be processed, while a text is any textual material generated or analyzed during the project (e.g., a program or an SQL script). A project usually comprises many (i.e., dozens to hundreds of) schemas. The schemas of a project are linked through specific relationships; they pertain to the methodological control aspects of the DB-MAIN approach, and will be ignored in this section.

A schema is made up of specification constructs which can be classified into the usual three abstraction levels. The DB-MAIN specification model includes the following concepts (Hainautetal., 1992a).

The conceptual constructs are intended to describe abstract, machine-independent, semantic structures. They include the notions of entity types (with/without attributes; with/without identifiers), of super/subtype hierarchies (single/multiple inheritance, total and disjoint properties), and of relationship types (binary/iV-ary; cyclic/acyclic), whose roles are characterized by min-max cardinalities and optional names; a role can be defined

24

PRODUCT PROD-ID NAME U-PRICE

id: PROD-ID

I L.

ORDER ORD-ID DATE ORIGIN DETAIL[l-20]

PRO QTY

id: ORD-ID ace

ref: ORIGIN ace

ref: DETAIL[*].PRO

1 DSK:MGT-03 1

CUSTOMER

CUST-]P NAME ADDRESS

id:CUST-ID 1

0-N

HAINAUTETAL.

1-1 1

ACCOUNT ACC-NBR AMOUNT id: ACC-NBR

of.CUSTOMER

Figure 5. A typical data structure schema during reverse engineering. This schema includes conceptualized objects (PRODUCT, CUSTOMER, ACCOUNT, of), logical objects (record type ORDER, with single-valued and multivalued foreign keys) and physical objects (access keys ORDER.ORD-ID and ORDER.ORIGIN; file DSK:MGT-03).

on one or several entity-types. Attributes can be associated with entity and relationship types; they can be single-valued or multivalued, atomic or compound. Identifiers (or keys), made up of attributes and/or roles, can be associated with an entity type, a relationship type and a multivalued attribute. Various constraints can be defined on these objects: inclusion, exclusion, coexistence, at-least-one, etc.

The logical constructs are used to describe schemas compliant with DMS models, such as relational, CODASYL or IMS schemas. They comprise, among others, the concepts of record types (or table, segment, etc.), fields (or columns), referential constraints, and redundancy.

The physical constructs describe implementation aspects of the data which are related to such criteria as the performance of the database. They make it possible to specify files, access keys (index, calc key, etc.), physical data types, bag and list multivalued attributes, and other implementation details.

In database engineering, as discussed in Section 2, a schema describes a fragment of the data structures at a given level of abstraction. In reverse engineering, an in progress schema may even include constructs at different levels of abstraction. Figure 5 presents a schema which includes conceptual, logical and physical constructs. Ultimately, this schema will be completely conceptualized through the interpretation of the logical and physical objects.

Besides these concepts, the repository includes some generic objects which can be cus­tomized according to specific needs. In addition, annotations can be associated with each object. These annotations can include semi-formal properties, made of the property name and its value, which can be interpreted by Voyager-2 functions (see Section 10). These features provide dynamic extensibility of the repository. For instance, new concepts such as organizational units, servers, or geographic sites can be represented by specializing the generic objects, while statistics about entity populations, the gender dind plural of the object names can be represented by semi-formal attributes. The contents of the repository can

DATABASE REVERSE ENGINEERING 25

be expressed as a pure text file through the ISL language, which provides import-export facilities between DB-MAIN and its environment.

6. The transformation toolkit

The desirability of the transformational approach to software engineering is now widely recognized. According to Fikas (1985) for instance, the process of developing a program [can be] formalized as a set of transformations. This approach has been put forward in database engineering by an increasing number of authors since several years, either in research papers, or in text books and, more recently, in several CASE tools (Hainaut et al., 1992; Rosenthal and Reiner, 1994). Quite naturally, schema transformations have found their way into DBRE as well (Hainaut et al., 1993a, 1993b). The transformational approach is the cornerstone ofthe DB-MAIN approach (Hainaut, 1981,1991,1993b, 1994)andCASE tool (Hainaut et al., 1992, 1994; Joris et al., 1992). A formal presentation of this concept can be found in Hainaut (1991, 1995, 1996).

Roughly speaking, a schema transformation consists in deriving a target schema 5' from a source schema S by some kind of local or global modification. Adding an attribute to an entity type, deleting a relationship type, and replacing a relationship type by an equivalent entity type, are three examples of schema transformations. Producing a database schema from another schema can be carried out through selected transformations. For instance, normalizing a schema, optimizing a schema, producing an SQL database or COBOL files, or reverse engineering standard files and CODAS YL databases can be described mostly as sequences of schema transformations. Some authors propose schema transformations for selected design activities (Navathe, 1980; Kobayashi, 1986; Kozaczynsky, 1987; Rosenthal and Reiner, 1988; Batini et al , 1992; Rauh and Stickel, 1995; Halpin and Proper, 1995). Moreover, some authors claim that the whole database design process, together with other related activities, can be described as a chain of schema transformations (Batini et al., 1993; Hainaut et al., 1993b; Rosenthal and Reiner, 1994). Schema transformations are essential to define formally forward and backward mappings between schemas, and particularly between conceptual structures and DMS constructs (i.e., traceability).

A special class of transformations is of particular importance, namely the semantics-preserving transformations, also called reversible since each of them is associated with another semantics-preserving transformation called its inverse. Such a transformation en­sures that the source and target schemas have the same semantic descriptive power. In other words any situation of the application domain that can be modelled by an instance of one schema can be described by an instance of the other. If we can produce a relational schema from a conceptual schema by applying reversible transformations only, then both schemas will be equivalent by construction, and no semantics will be lost in the translation process. Conversely, if the interpretation of a relational schema, i.e., its conceptualization (Section 2.2), can be performed by using reversible transformations, the resulting conceptual schema will be semantically equivalent to the source schema. An in-depth discussion ofthe concept of specification preservation can be found in Hainaut (1995, 1996).

To illustrate this concept, we will outline informally three of the most popular trans­formation techniques, called mutations (type changing) used in database design. As a

26 HAINAUTETAL.

A

I l - J l - \ R / > —

B

- I2-J2

<=^

A

I

B

ii-ji r2-J2

{^A/

l-l

R

id: RB.B RA.A

\ XB >

1-1

Figure 6. Transforming a relationship type into an entity type, and conversely.

A

Al

1

B

M 6 2

id:Bl

.. J . .

—0-N

< ^ A

AJ

Biri-J] ref: Bl[*] O

B

B2 id:Bl

Figure 7. Relationship-type R is represented by foreign key Bl, and conversely.

A

AI A2[l-J] A3

<^

A

1 Al A3

A

Al A3

-I-J -<(^~Rr\-\-\~

EA2

A2

id:A2 R.A

-I-J -<JR~\-I-N-EA2

A2 id:A2

(a)

(b)

Figure 8. Transformation of an attribute into an entity type: (a) by explicit representation of its instances, (b) by explicit representation of its distinct values.

consequence, their inverse will be used in reverse engineering. To simplify the presenta­tion, each transformation and its inverse are described in one figure, in which the direct transformation is read from left to right, and its inverse from right to left.

Figure 6 shows graphically how a relationship type can be replaced by an equivalent entity type, and conversely. The technique can be extended to A -ary relationship types.

Another widely used transformation replaces a binary relationship type by a foreign key (figure 7), which can be either multivalued ( / > 1) or single-valued (J = 1).

The third standard technique transforms an attribute into an entity type. It comes in two flavours, namely instance representation {figure 8a), in which each instance of attribute A2 in each A entity is represented by an EA2 entity, and value representation (figure 8b), in which each distinct value of A2, whatever the number of its instances, is represented by one EA2 entity.

DB-MAIN proposes a three-level transformation toolset that can be used freely, according to the skill of the user and the complexity of the problem to be solved. These tools

DATABASE REVERSE ENGINEERING 27

I II i f f I t[iii[[i iii f iiitiiin ill iiMriiiliiiTNlTiiiiiffti ^ fi^il

Figure 9. The dialog box of the Split/Merge transformation through which the analyst can either extract some components from the master entity type (left), or merge two entity types, or migrate components from an entity type to another.

are neutral and generic, in that they can be used in any database engineering process. As far as DBRE is concerned, they are mainly used in Data Structure Conceptualization (Section 2.2).

• Elementary transformations. The selected transformation is applied to the selected object:

apply transformation T to current object 0

With these tools, the user keeps full control of the schema transformation. Indeed, similar situations can often be solved by different transformations; e.g., a multivalued attribute can be transformed in a dozen ways. Figure 9 illustrates the dialog box for the Split/Merge of an entity type. The current version of DB-MAIN proposes a toolset of about 25 ele­mentary transformations.

• Global transformations. A selected elementary transformation is applied to all the objects of a schema which satisfy a specified precondition:

apply transformation T to the objects that satisfy condition P

DB-MAIN offers some predefined global transformations, such as: replace all one-to-many relationship types by foreign keys or replace all multivalued attributes by entity types. However, the analyst can define its own toolset through the Transformation As­sistant described in Section 9.

28 HAINAUTETAL.

• Model-driven transformations. All the constructs of a schema that violate a given model M are transformed in such a way that the resulting schema complies with M:

apply the transformation plan which makes the current schema satisfy model M

Such an operator is defined by a transformation plan, which is a sort of algorithm compris­ing global transformations, which is proved (or assumed) to make any schema comply with M. A model-driven transformation implements formal techniques or heuristics applicable in such major engineering processes as normalization, model translation or untranslation, and conceptualization. Here too, DB-MAIN offers a dozen predefined model-based transformations such as relational, CODASYL, and COBOL translation, untranslation from these models and conceptual normalization. The analyst can define its own transformation plans, either through the scripting facilities of the Transforma­tion Assistant, or, for more complex problems, through the development of Voyager-2 functions (Section 10).

A more detailed discussion of these three transformation modes can be found in Hainaut et al. (1992) and Hainaut (1995).

7. The user interface

The user interaction uses a fairly standard GUI. However, DB-MAIN offers some original options which deserve being discussed.

Browsing through, processing, and analyzing, large schemas require an adequate pre­sentation of the specifications. It quickly appears that more than one way of viewing them is necessary. For instance, a graphical representation of a schema allows an easy detection of certain structural patterns, but is useless when analyzing name structures to detect sim­ilarities as is common in the DATA STRUCTURE EXTRACTION process (Section 2.1). DB-MAIN currently offers six ways of presenting a schema. Four of these views use a hypertext technique: compact (sorted list of entity type, relationship type and file names), standard (same + attributes, roles and constraints), extended (same + domains, annotations, ET-RT cross-reference) and sorted (sorted list of all the object names). Two views are graphical: full and compact (no attributes and no constraints). All of them are illustrated in figure 10.

Switching from one view to another is inmiediate, and any object selected in a view is still current when another view is chosen. Any relevant operator can be applied to an object, whatever the view through which it is presented. In addition, the text-based views make it possible to navigate from entity types to relationship types and conversely through hypertext links.

8. Text analysis and processing

Analyzing and processing various kinds of texts are basic activities in two specific processes, namely DMS-DDL text ANALYSIS and PROGRAM ANALYSIS.

DATABASE REVERSE ENGINEERING 29

Figure 10. Six different views of the same schema.

The first process is rather simple, and can be carried out by automatic extractors which analyze the data structure declaration statements of programs and build the corresponding abstract objects in the repository. DB-MAIN currently offers built-in standard parsers for COBOL, SQL, CODAS YL, IMS, and RPG, but other parsers can be developed in Voyager-l (Section 10).

To address the requirements of the second process, through which the preliminary speci­fications are refined from evidence found in programs or in other textual sources, DB-MAIN includes a collection of program analysis tools comprising, at the present time, an interac­tive pattern-matching engine, a dataflow diagram inspector and a program slicer. The main objective of these tools is to contribute to program understanding as far as data manipulation is concerned.

ThQ pattern-matching function allows searching text files for definite patterns or cliches expressed in PDL, a Pattern Definition Language. As an illustration, we will describe one of the most popular heuristics to detect an implicit foreign key in a relational schema. It consists in searching the application programs for some forms of SQL queries which evoke the presence of an undeclared foreign key (Signore et al., 1994; Andersson, 1994; Petit et al., 1994). The principle is simple: most multi-table queries use primary/foreign key joins. For instance, considering that column CNUM has been recognized as a candidate key of table CUSTOMER, the following query suggests that column CUST in table ORDER may be a foreign key to CUSTOMER:

select CUSTOMER.CNUM, CNAME, DATE from ORDER, CUSTOMER where ORDER.CUST = CUSTOMER.CNUM

30 HAINAUTETAL.

The SQL generic patterns

Tl : T2 :

' c l : C2 : j o i n

: = table-name := table-neune := column-name := column-name - q u a l i f ::=

begin-SQL s e l e c t s e l e c t - l i s t from i {©Tl ! @T2 j ©T2 ! STl} where ! @T1"."®C1 _ " = '' _ eT2"."eC2 !

end-SQL

The COBOUDB2 specific patterns

AN-name ::= [a-zA-Z][-a-zA-ZO-9] table-name : := AN-name column-name ::= AN-name _ ::= ( { " / n " | V t " | " "}) + - ::= ( { - / n " | - / f | - " ) ) * begin-SQL ::= {"exec"|"EXEC"}

_{"sq l" | -SQL"}_ end-SQL ::= _{"end"j"END"}

{"-exec"1"-EXEC"}-"." s e l e c t ::= {"select"("SELECT"} from ::= {"from"1"FROM"} where ::= {"where"|"WHERE"} s e l e c t - l i s t ::= any-but(frora) ! ::= any-but({where j end-SQL})

{ " , " | " / n " | " / t - | " "}

Figure 11. Generic and specific patterns for foreign key detection in SQL queries. In the specific patterns, "-" designates any non-empty separator, "_" any separator, and "AN-name" any alphanumeric string beginning with a letter. The "any-but(E)" function identifies any string not including expression E. Symbols "+'\ "*", "/t", "/n", "I" and "a-z" have their usual grep or BNF meaning.

More generally, any SQL expression that looks like:

s e l e c t ... from ... Tl,...T2 ... where ... Tl.Cl = T2.C2 ...

may suggest that Cl is a foreign key to table T2 or C2 a foreign key to Tl. Of course, this evidence would be even stronger if we could prove that Cl—^resp. Cl—is a key of its table. This is just what figure 11 translates more formally in PDL.

This example illustrates two essential features of PDL and of its engine.

1. A set of patterns can be split into two parts (stored in different files). When a generic pattern file is opened, the unresolved patterns are to be found in the specified specific pattern file. In this example, the generic patterns define the skeleton of an SQL query, which is valid for any RDBMS and any host language, while the specific patterns com­plement this skeleton by defining the C0B0L/DB2 API conventions. Replacing the latter will allow processing, e.g., C/0/?ACL£^ programs.

2. A pattern can include variables, the name of which is prefixed with @. When such a pattern is instantiated in a text, the variables are given a value which can be used, e.g., to automatically update the repository.

The pattern engine can analyze external source files, as well as textual descriptions stored in the repository (where, for instance, the extractors store the statements they do not understand, such as comments, SQL t r i g g e r and check). These texts can be searched for visual inspection only, but pattern instantiation can also trigger DB-MAIN actions. For instance, if a procedure such as that presented in figure 11 (creation of a referential constraint between column C2 and table Tl) is associated with this pattern, this procedure can be executed automatically (under the analyst's control) for each instantiation of the

DATABASE REVERSE ENGINEERING 31

pattern. In this way, the analyst can build a powerful custom tool which detects foreign keys in queries and which adds them to the schema automatically.

The dataflow inspector builds a graph whose nodes are the variables of the program to be analyzed, and the edges are relationships between these variables. These relationships are defined by selected PDL patterns. For instance, the following COBOL rules can be used to build a graph in which two nodes are linked if their corresponding variables appear simultaneously in a simple assignment statement, in a redefinition declaration, in an indirect write statement or in comparisons:

var_l : := cob_var; var_2 : : = cob.var ; move : := "MOVE" - @var_l - "TO" - (ivar_2 ; redefines : := @var_l - "REDEFINES" - @var_2; write : := "WRITE" - ivar^l - "FROM" @var_2; if : := "IF" - @var_l - rel.op - @var.2; if.not : := "IF" - @var_l - "NOT" - rel_op - (ivar.2 ;

This tool can be used to solve structure hiding problems such as the decomposition of anonymous fields and procedurally controlled foreign keys, as illustrated in figure 2.

The first experiments have quickly taught us that pattern-matching and dataflow inspec­tion work fine for small programs and for locally concentrated patterns, but can prove difficult to use for large programs. For instance, a pattern made of a dozen statements can span several thousands lines of code. With this problem in mind, we have developed a variant of program slicer (Weiser, 1984), which, given a program P, generates a new program P ' defined as follows. Let us consider a point 5 in P (generally a statement) and an object O (generally a variable) of P. The program slice of P for O at 5 is the smallest subset P' of P whose execution will give O the same state at S as would the execution of P in the same environment. Generally P ' is a very small fragment of P, and can be inspected much more efficiently and reliably, both visually and with the help of the analysis tools described above, than its source program P. One application in which this program slicer has proved particularly valuable is the analysis of the statements contributing to the state of a record when it is written in its file.

DB-MAIN also includes a name processor which can transform selected names in a schema, or in selected objects of a schema, according to substitution patterns. Here are some examples of such patterns:

"^C-" -> "CUST-" replaces all prefixes "C-" with the prefix "CUST-"; " DATE" -> " TIME" replaces each substring " DATE", whatever its position,

with the substring "TIME"; " ^CODE$" -> "REFERENCE" replaces all the names "CODE" with the new name

"REFERENCE".

In addition, it proposes case transformation: lower-to-upper, upper-to-lower, capitalize and remove accents. These parameters can be saved as a name processing script, and reused later.

32 HAINAUTETAL.

Figure 12. Control panel of the Transformation assistant. The left-side area is the problem solver, which presents a catalog of problems (1st column) and suggested solutions (2nd column). The right-side area is the script manager. The worksheet shows a simplified script for conceptuahzing relational databases.

9. The assistants

An assistant is a higher-level solver dedicated to coping with a special kind of problem, or performing specific activities efficiently. It gives access to the basic toolboxes of DB-MAIN, but in a controlled and intelligent way.

The current version of DB-MAIN includes three general purpose assistants which can sup­port, among others, the DBRE activities, namely the Transformation assistant, the Schema Analysis assistant and the Text Analysis assistant. These processors offer a collection of built-in functions that can be enriched by user-defined functions developed in Voyager-2 (Section 10).

The Transformation Assistant (figure 12) allows applying one or several transformations to selected objects. Each operation appears as a problem/solution couple, in which the problem is defined by a pre-condition (e.g., the objects are the many-to-many relationship types of the current schema), and the solution is an action resulting in eliminating the problem (e.g., transform them into entity types). Several dozens problem/solution items are proposed. The analyst can select one of them, and execute it automatically or in a controlled way. Alternatively, (s)he can build a script comprising a list of operations, execute it, save and load it.

Predefined scripts are available to transform any schema according to popular models (e.g., Bachman model, binary model, relational, CODASYL, standard files), or to perform standard engineering processes (e.g., conceptualization of relational and COBOL schemas, normalization). Customized operations can be added via Voyager-2 functions (Section 10). Figure 12 shows the control panel of this tool. A second generation of the Trans­

formation assistant is under development. It provides a more flexible approach to build complex transformation plans thanks to a catalog of more than 200 preconditions, a library

DATABASE REVERSE ENGINEERING 33

of about 50 actions and more powerful scripting control structures including loops and if-then-else patterns.

The Schema Analysis assistant is dedicated to the structural analysis of schemas. It uses the concept of submodel, defined as a restriction of the generic specification model described in Section 5 (Hainaut et al., 1992). This restriction is expressed by a boolean expression of elementary predicates stating which specification patterns are valid, and which ones are forbidden. An elementary predicate can specify situations such as the following: "entity types must have from 1 to 100 attributes", "relationship types have from 2 to 2 roles", "entity type names are less than 18-character long", "names do not include spaces", "no name belongs to a given list of reserved words", "entity types have from 0 to 1 supertype", "the schema is hierarchical", "there are no access keys". A submodel appears as a script which can be saved and loaded. Predefined submodels are available: Normalized ER, Binary ER, NIAM, Functional ER, Bachman, Relational, CODASYL, etc. Customized predicates can be added via Voyager-2 functions (Section 10). The Schema Analysis assistant offers two ftinctions, namely Check and Search. Checking a schema consists in detecting all the constructs which violate the selected submodel, while the Search frinction detects all the constructs which comply with the selected submodel.

The Text Analysis assistant presents in an integrated package all the tools dedicated to text analysis. In addition it manages the active links between the source texts and the abstract objects in the repository.

10. Functional extensibility

DB-MAIN provides a set of built-in standard ftinctions that should be sufficient to satisfy most basic needs in database engineering. However, no CASE tool can meet the require­ments of all users in any possible situation, and specialized operators may be needed to deal with unforeseen or marginal situations. There are two important domains in which users re­quire customized extensions, namely additional internal functions and interfaces with other tools. For instance, analyzing and generating texts in any language and according to any dialect, or importing and exchanging specifications with any CASE tool or Data Dictionary Systems are practically impossible, even with highly parametric import/export processors. To cope with such problems, DB-MAIN provides the Voyager-2 tool development environ­ment allowing analysts to build their own functions, whatever their complexity. Voyager-2 offers a powerfiil language in which specific processors can be developed and integrated into DB-MAIN. Basically, Voyager-2 is a procedural language which proposes primitives to access and modify the repository through predicative or navigational queries, and to invoke all the basic functions of DB-MAIN. It provides a poweful list manager as well as ftinctions to parse and generate complex text files. A user's tool developed in Voyager-2 is a program comprising possible recursive procedures and functions. Once compiled, it can be invoked by DB-MAIN just like any basic function.

Figure 13 presents a small but powerftil Voyager-2 function which validates and creates a referential constraint with the arguments extracted from a COBOL/SQL program by the pattern defined in figure 11. When such a pattern instantiates, the pattern-matching engine passes the values of the four variables Tl, T2, CI and C2 to the MakeForeignKey ftinction.

34 HAINAUTETAL.

function integer KakeForeignKey (string : T1,T2,C1,C2) ; explain(* title="Create a foreign key from an SQL join"; helps"if CI is a unique key of table Tl and if C2 is a column of T2, and if CI and C2 are compatible, then define C2 as a foreign key of T2 to Tl, and return true, else return false" * ) ; /*define the variables; any repository object type can be a domain */

schema : S; entity_type : E; attribute : A,IK,FK; list : ID-LIST,FK-LIST;

S := Ge t C u r r e n t Schema 0 ; / * S is the current schema */ ID-LIST = list of the attributes A such that :(I)A belongs to an entity type E which is in schema S and whose name is

(2) the name of A is CI, (3) A is an identifier ofE (the ID property of A is true) */ ID-LIST := a t t r i b u t e [ A l { o f : e n t i t y _ t y p e [ E ] { i n : [ S ] and E.NAME = T l )

and A.NAME = Cl and A.ID = t r u e ) ;

FK-UST = list of the attributes A such that: (I) A belongs to an entity type E which is in S and whose name is Tl, (2) name of A is C2 */

FK-LIST := a t t r i b u t e [ A ] { o f : e n t i t y _ t y p e [ E ] { i n : { S ] and E.NAME = T2} and A.NAME = C 2 } ;

if both lists are not-empty, then if the attributes are compatible then define the attribute in ID-LIST as a foreign key to the attribute in FK-LIST * /

i f n o t ( e m p t y ( I D - L I S T ) o r empty(FK-LIST)) t h e n {IK := GetFirst(ID-LIST); FK := GetFirst(FK-LIST); if IK.TYPE = FK.TYPE and IK.LENGTH = FK.LENGTH then

{connect(reference,IK,FK); return true;} else {return false;};}

else {return false;};

Figure 13. A (strongly simplified) excerpt of the repository and a Voyager-2 function which uses it. The repository expresses the fact that schemas have entity types, which in turn have attributes. Some attributes can be identifiers (boolean ID) or can reference (foreign key) another attribute (candidate key). The input arguments of the procedure are four names T1,T2,C1,C2 such as those resulting from an instantiation of the pattern of figure 11, The function first evaluates the possibility of attribute (i.e., column) C2 of entity type (i.e., table) T2 being a foreign key to entity type Tl with identifier (candidate key), Cl. If the evaluation is positive, the referential constraint is created. The expla in section illustrates the self-documenting facility of Woyager-2 programs; it defines the answers the compiled version of this function will provide when queried by the DB-MAIN tool.

11. Methodological control and design recovery^

Though this paper presents it as a CARE tool only, the DB-MAIN environment has a wider scope, i.e., data-centered applications engineering. In particular, it is to address the com­plex and critical problem of application evolution. In this context, understanding how the

DATABASE REVERSE ENGINEERING 35

engineering processes have been carried out when legacy systems were developed, and guiding today's analysts in conducting application development, maintenance and reengi-neering, are major functions that should be offered by the tool. This research domain, known as design (or software) process modeling, is still under full development, and few results have been made available to practitioners so far. The reverse engineering process is strongly coupled with these aspects in three ways.

First, reverse engineering is an engineering activity of its own (Section 2), and therefore is submitted to rules, techniques and methods, in the same way as forward engineering; it therefore deserves being supported by methodological control functions of the CARE tool.

Secondly, DBRE is a complex process, based on trial-and-error behaviours. Exploring several solutions, comparing them, deriving new solutions from earlier dead-end ones, are common practices. Recording the history of a RE project, analyzing it, completing it with new processes, and replaying some of its parts, are typical design process modeling objectives.

Thirdly, while the primary aim of reverse engineering is (in short) to recover technical and functional specifications from the operational code of an existing application, a secondary objective is progressively emerging, namely to recover the design of the application, i.e., the way the application has (or could have) been developed. This design includes not only the specifications, but also the reasonings, the transformations, the hypotheses and the decisions the development process consists of.

Briefly stated, DB-MAIN proposes a design process model comprising concepts such as design product, design process, process strategy, decision, hypothesis and rationale. This model derives from proposals such as those of Potts and Bruns (1988) and Rolland (1993), extended to all database engineering activities. This model describes quite adequately not only standard design methodologies, such as the Conceptual-Logical-Physical approaches (Teorey, 1994; Batini et al., 1992) but also any kind of heuristic design behaviour, includ­ing those that occur in reverse engineering. We will shortly describe the elements of this design process model.

Product and product instance, A product instance is any outstanding specification object that can be identified in the course of a specific design. A conceptual schema, an SQL DDL text, a COBOL program, an entity type, a table, a collection of user's views, an evaluation report, can all be considered product instances. Similar product instances are classified into products, such as Normalized concep tua l schema, DMS-compliant op t imized schema or DMS-DDL schema (see figure 3).

Process and process instance, A process instance is any logical unit of activity which transforms a product instance into another product instance. Normalizing schema SI into schema S2 is a process instance. Similar process instances are classified into processes, such as CONCEPTUAL NORMALIZATION in figure 3.

Process strategy. The strategy of a process is the specification of how its goal can be achieved, i.e., how each instance of the process must be carried out. A strategy may be deterministic, in which case it reduces to an algorithm (and can often be implemented as a primitive), or it may be non-deterministic, in which case the exact way in which each of its

36 HAINAUTETAL.

instances will be carried out is up to the designer. The strategy of a design process is defined by a script that specifies, among others, what lower-level processes must/can be triggered, in what order, and under what conditions. The control structures in a script include action selection (at most one, one only, at least one, all in any order, all in this order, at least one any number of times, etc.), alternate actions, iteration, parallel actions, weak condition (should be satisfied), strong condition (must be satisfied), etc.

Decision, hypothesis and rationale. In many cases, the analyst/developer will carry out an instance of a process with some hypothesis in mind. This hypothesis is an essential characteristics of this process instance since it implies the way in which its strategy will be performed. When the engineer needs to try another hypothesis, (s)he can perform another instance of the same process, generating a new instance of the same product. After a while (s)he is facing a collection of instances of this product, from which (s)he wants to choose the best one (according to the requirements that have to be satisfied). A justification of the deci­sion must be provided. Hypothesis and decision justification comprise the design rationale.

History. The history of a process instance is the recorded trace of the way in which its strategy has been carried out, together with the product instances involved and the rationale that has been formulated. Since a project is an instance of the highest level process, its history collects all the design activities, all the product instances and all the rationales that have appeared, and will appear, in the life of the project. The history of a product instance P (also called its design) is the set of all the process instances, product instances and rationales which contributed to P. For instance, the design of a database collects all the information needed to describe and explain how the database came to be what it is.

A specific methodology is described in MDL, the DB-MAIN Methodology Description Language. The description includes the specification of the products and of the processes the methodology is made up of, as well as of the relationships between them. A product is of a certain type, described as a specialization of a generic specification object from the DB-MAIN model (Section 5), and more precisely as a submodel generated by the Schema analysis assistant (Section 9). For instance, a product called Raw-conceptual-schema (figure 3), can be declared as a BINARY-ER-SCHEMA. The latter is a product type that can be defined by a SCHEMA satisfying the following predicate, stating that relationship types are binary, and have no attributes, and that the attributes are atomic and single- valued:

(all rel-types have from 2 to 2 roles) and (all rel-types have from 0 to 0 attributes) and (all attributes have from 0 to 0 components) and (all attributes have a max cardinality from 1 to 1);

A process is defined mainly by the input product type(s), the internal product type, the output product type(s) and by its strategy.

The DB-MAIN CASE tool is controlled by a methodology engine which is able to in­terpret such a method description once it has been stored in the repository by the MDL compiler. In this way, the tool is customized according to this specific methodology. When developing an application, the analyst carries out process instances according to chosen hypotheses, and builds product instances. (S)he makes decisions which (s)he can justify.

DATABASE REVERSE ENGINEERING 37

All the product instances, process instances, hypotheses, decisions and justifications, re­lated to the engineering of an application make up the trace, or history of this application development. This history is also recorded in the repository. It can be examined, replayed, synthesized, and processed (e.g., for design recovery).

One of the most promising applications of histories is database design recovery. Con­structing a possible design history for an existing, generally undocumented database is a complex problem which we propose to tackle in the following way. Reverse engineering the database generates a DBRE history. This history can be cleaned by removing unnecessary actions. Reversing each of the actions of this history, then reversing their order, yields a tentative, unstructured, design history. By normalizing the latter, and by structuring it ac­cording to a reference methodology, we can obtain a possible design history of the database. Replaying this history against the recovered conceptual schema should produce a physical schema which is equivalent to the current database.

A more comprehensive description of how these problems are addressed in the DB-MAIN approach and CASE tool can be found in Hainaut et al. (1994), while the design recovery approach is described in Hainaut et al. (1996).

12. DBRE requirements and the DB-MAIN CASE tool

We will examine the requirements described in Section 3 to evaluate how the DB-MAIN CASE tool can help satisfy them.

Flexibility. Instead of being constrained by rigid methodological frameworks, the analyst is provided with a collection of neutral toolsets that can be used to process any schema whatever its level of abstraction and its degree of completion. In particular, backtracking and multi-hypothesis exploration are easily performed. However, by customizing the method engine, the analyst can build a specialized CASE tool that is to enforce strict methodologies, such as that which has been described in Section 2.

Extensibility, Through the Voyager-2 language, the analyst can quickly develop specific functions; in addition, the assistants, the name and the text analysis processors allows the analyst to develop customized scripts.

Sources multiplicity. The most common information sources have a text format, and can be queried and analyzed through the text analysis assistant. Other sources can be processed through specific Voyager-2 functions. For example, data analysis is most often performed by small ad hoc queries or application programs, which validate specific hypotheses about, e.g., a possible identifier or foreign key. Such queries and programs can be generated by Voyager-2 programs that implement heuristics about the discovery of such concepts. In addition, external information processors and analyzers can easily introduce specifications through the text-based import-export ISL language. For example, a simple SQL program can extract SQL specifications fi-om DBMS data dictionaries, and generate their ISL expression, which can then be imported into the repository.

38 HAINAUTETAL.

Text analysis. The DB-MAIN tool offers both general purpose and specific text analyzers and processors. If needed, other processors can be developed in Voyager-2. Finally, external analyzers and text processors can be used provided they can generate ISL specifications which can then be imported in DB-MAIN to update the repository.

Name processing. Besides the name processor, specific Voyager-l functions can be de­veloped to cope with more specific name patterns or heuristics. Finally, the compact and sorted views can be used as poweful browsing tools to examine name patterns or to detect similarities.

Links with other CASE processes. DB-MAIN is not dedicated to DBRE only; therefore it includes in a seamless way supporting functions for the other DB engineering processes, such as forward engineering. Being neutral, many functions are common to all the engi­neering processes.

Openness, DB-MAIN supports exchanges with other CASE tools in two ways. First, Voyager-2 programs can be developed (1) to generate specifications in the input language of the other tools, and (2) to load into the repository the specifications produced by these tools. Secondly, ISL specifications can be used as a neutral intermediate language to communicate with other processors.

Flexible specification model. The DB-MAIN repository can accomodate specifications of any abstraction level, and based on a various paradigms; if asked to be so, DB-MAIN can be fairly tolerant to incomplete and inconsistent specifications and can represent schemas which include objects of different levels and of different paradigms (see figure 5); at the end of a complex process the analyst can ask, through the Schema Analysis assistant, a precise analysis of the schema to sort out all the structural flaws.

Genericity, Both the repository schema and the functions of the tool are independent of the DMS and of the programming languages used in the application to be analyzed. They can be used to model and to process specifications initially expressed in various technologies. DB-MAIN includes several ways to specialize the generic features in order to make them compliant with a specific context, such as processing PL/l-IMS, COBOL-VSAM or C-ORACLE applications.

Multiplicity of views. The tool proposes a rich palette of presentation layouts both in graphical and textual formats. In the next version, the analyst will be allowed to define customized views.

Rich transformation toolset, DB -MAIN proposes a transformational toolset of more than 25 basic functions; in addition, other, possibly more complex, transformations can be built by the analyst through specific scripts, or through Voyager-2 functions.

Traceability, DB-MAIN explicitly records a history, which includes the successive states of the specifications as well as all the engineering activities performed by the analyst and

DATABASE REVERSE ENGINEERING 39

by the tool itself. Viewing these activities as specification transformations has proved an elegant way to formalize the links between the specifications states. In particular, these links can be processed to explain how a conceptual object has been implemented (forward mapping), and how a technical object has been interpreted (reverse mapping).

13. Implementation and applications of DB-MAIN

We have developed DB-MAIN in C++ for MS-Windows machines. The repository has been implemented as an object oriented database. For performance reasons, we have built a specific 0 0 database manager which provides very short access and update times, and whose disc and core memory requirements are kept very low. For instance, a fully documented 40,000-object project can be developed on a 8-MB machine.

The first version of DB-MAIN was released in September 1995. It includes the basic processors and functions required to design, implement and reverse engineer large size databases according to various DMS. Version 1 supports many of the features that have been described in this paper. Its repository can accomodate data structure specifications at any abstraction level (Section 5). It provides a 25-transformation toolkit (Section 6), four textual and two graphical views (Section 7), parsers for SQL, COBOL, CODASYL, IMS and RPG programs, the PDL pattern-matching engine, the dataflow graph inspector, the name processor (Section 8), the Transformation, Schema Analysis and Text Analysis assis­tants (Section 9), the Voyager-2 virtual machine and compiler (Section 10), a simple history generator and its replay processor (Section 11). Among the other functions of Version 1, let us mention code generators for various DMS. Its estimated cost was about 20 man/year.

The DB-MAIN tool has been used to carry out several government and industrial projects. Let us describe five of them briefly.

• Design of a government agricultural accounting system. The initial information was found in the notebooks in which the farmers record the day-to-day basic data. These documents were manually encoded as giant entity types with more than 1850 attributes and up to 9 decomposition levels. Through conceptualization techniques, these structures were transformed into pure conceptual schemas of about 90 entity types each. Despite the unusual context for DBRE, we have followed the general methodology described in Section 2: Data structure extraction. Manual encoding; refinement through direct contacts with

selected accounting officers; Data structure conceptualization.

— Untranslation. The multivalued and compound attributes have been transformed into entity types; the entity types with identical semantics have been merged; serial attributes, i.e., attributes with similar names and identical types, have been replaced with multivalued attributes;

— De-optimization. The farmer is requested to enter the same data at different places; these redundancies have been detected and removed; the calculated data have been removed as well;

40 HAINAUTETAL.

— Normalization. The schema included several implicit IS-A hierarchies, which have been expressed explicitly;

The cost for encoding, conceptualizing and integrating three notebooks was about 1 per­son/month. This rather unusual application of reverse engineering techiques was a very interesting experience because it proved that data structure engineering is a global domain which is difficult (and sterile) to partition into independent processes (design, reverse). It also proved that there is a strong need for highly generic CASE tools.

• Migrating a hybrid file/SQL social security system into a pure SQL database. Due to a strict disciplined design, the programs were based on rather neat file structures, and used systematic cliches for integrity constraints management. This fairly standard two-month project comprised an interesting work on name patterns to discover foreign keys. In addition, the file structures included complex identifying schemes which were difficult to represent in the DB-MAIN repository, and which required manual processing.

• Redocumenting the ORACLE repository of an existing OO CASE tool. Starting from various SQL scripts, partial schemas were extracted, then integrated. The conceptual­ization process was fairly easy due to systematic naming conventions for candidate and foreign keys. In addition, it was performed by a developer having a deep knowledge of the database. The process was completed in two days.

• Redocumentating a medium size ORACLE hospital database. The database included about 200 tables and 2,700 columns. The largest table had 75 columns. The analyst quickly detected a dozen major tables with which one hundred views were associated. It appeared that these views defined, in a systematic way, a 5-level subtypes hierarchy. Entering the description of these subtypes by hand would have required an estimated one week. We chose to build a customized function in PDL and Voyager-l as follows. A pattern was developed to detect and analyze the c r e a t e view statements based on the main tables. Each instantiation of this pattern triggered a Voyager-2 function which defined a subtype with the extracted attributes. Then, the function scanned these IS-A relations, detected the conraion attributes, and cleaned the supertype, removing inherited attributes, and leaving the conmion ones only. This tool was developed in 2 days, and its execution took 1 minute. However, a less expert Voyager-l programmer could have spent more time, so that these figures cannot be generalized reliably. The total reverse engineering process cost 2 weeks.

• Reverse engineering of an RPG database. The application was made of 31 flat files comprising 550 fields (2 to 100 fields per file), and 24 programs totalling 30,000 LOC. The reverse engineering process resulted in a conceptual schema comprising 90 entity types, including 60 subtypes, and 74 relationship types. In the programs, data validation concentrated in well defined sections. In addition, the programs exhibited complex access patterns. Obviously, the procedural code was a rich source of hidden structures and constraints. Due to the good quality of this code, the program analysis tools were of little help, except to quickly locate some statements. In particular, pattern detection could be done visually, and program slicing yielded too large program chunks. Only the dataflow inspector was found useful, though in some programs, this graph was too large, due to the presence of working variables common to several independent program sections. At that time, no RPG parser was available, so that a Voyager-2 RPG extractor was developed

DATABASE REVERSE ENGINEERING 41

in about one week. The final conceptual schema was obtained in 3 weeks. The source file structures were found rather complex. Indeed, some non-trivial patterns were largely used, such as overlapping foreign keys, conditional foreign and primary keys, overloaded fields, redundancies (Blaha and Premerlani, 1995). Surprisingly, the result was estimated unnecessarily complex as well, due to the deep type/subtype hierarchy. This hierarchy was reduced until it seemed more tractable. This problem triggered an interesting discussion about the limit of this inheritance mechanism. It appeared that the precision vs readability trade-off may lead to unnormalized conceptual schemas, a conclusion which was often formulated against object class hierarchies in 0 0 databases, or in OO applications.

14. Conclusions

Considering the requirements outlined in Section 3, few (if any) commercial CASE/CARE tools offer the functions necessary to carry out DBRE of large and complex applications in a really effective way. In particular, two important weaknesses should be pointed out. Both derive from the oversimplistic hypotheses about the way the application was developed. First, extracting the data structures from the operational code is most often limited to the analysis of the data structure declaration statements. No help is provided for further analyzing, e.g., the procedural sections of the programs, in which essential additional information can be found. Secondly, the logical schema is considered as a straighforward conversion of the conceptual schema, according to simple translating rules such as those found in most textbooks and CASE tools. Consequently, the conceptualization phase uses simple rules as well. Most actual database structures appear more sophisticated, however, resulting from the application of non standard translation rules and including sophisticated performance oriented constructs. Current CARE tools are completely blind to such structures, which they carefully transmit into the conceptual schema, producing, e.g., optimized IMS conceptual schemas, instead of pure conceptual schemas.

The DB-MAIN CASE tool presented in this paper includes several CARE components which try to meet the requirements described in Section 3. The first version^ has been used successfully in several real size projects. These experiments have also put forward several technical and methodological problems, which we describe briefly.

• Functional limits of the tool. Though DB-MAIN Version 1 already offers a reasonable set of integrity constraints, a more powerful model was often needed to better describe physical data structures or to express semantic structures. Some useful schema trans­formations were lacking, and the scripting facilities of the assistants were found very interesting, but not powerful enough in some situations. As expected, several users asked for ''full program reverse engineering".

• Problem and tool complexity. Reverse engineering is a software engineering domain based on specific, and still unstable, concepts and techniques, and in which much remains to learn. Not surprisingly, true CARE tools are complex, and DB-MAIN is no exception when used at its full potential. Mastering some of its functions requires intensive training which can be justified for complex projects only. In addition, writing and testing specific PDL pattern libraries and Voyager-2 functions can cost several weeks.

42 HAINAUTETAL.

• Performance, While some components of DB-MAIN proved very efficient when pro­cessing large projects with multiple sources, some others slowed down as the size of the specifications grew. That was the case when the pattern-matching engine parsed large texts for a dozen patterns, and for the dataflow graph constructor which uses the former. However, no dramatic improvement can be expected, due to the intrinsic complexity of pattern-matching algorithms for standard machine architectures.

• Viewing the specifications. When a source text has been parsed, DB-MAIN builds a first-cut logical schema. Though the tool proposes automatic graphical layouts, positioning the extracted objects in a natural way is up to the analyst. This task was often considered painful, even on a large screen, for schemas comprising many objects and connections. In the same realm, several users found that the graphical representations were not as attractive as expected for very large schemas, and that the textual views often proved more powerful and less cumbersome.

The second version, which is under development, will address several of the observed weaknesses of Version 1, and will include a richer specification model and extended toolsets. We will mainly mention some important extensions: a view derivation mechanism, which will solve the problem of mastering large schemas, a view integration processor to build a global schema from extracted partial views, the first version of the MDL compiler, of the methodology engine, and of the history manager, and an extended program sheer. The repository will be extended to the representation of additional integrity constraints, and of other system components such as programs. A more powerful version of the Voyager-2 lan­guage and a more sophisticated Transformation assistant (evoked in Section 9) are planned for Version 2 as well. We also plan to experiment the concept of design recovery for actual applications.

Acknowledgments

The detailed conmients by the anonymous reviewers have been most useful to improve the readability and the consistency of this paper, and to make it as informative as possible. We would also like to thank Linda Wills for her friendly encouragements.

Notes

1. A table is in 4NF ij^all the non-trivial multivalued dependencies are functional. The BCNF (Boyce-Codd normal form) is weaker but has a more handy definition: a table is in BCNF (/f each functional determinant is a key,

2. A CASE tool offering a rich toolset for reverse engineering is often called a CARE (Computer-Aided Reverse Engineering) tool.

3. A Data Management System (DMS) is either a File Management System (FMS) or a Database Management System (DBMS).

4. Though some practices (e.g., disciplined use of COPY or INCLUDE meta-statements to include common data structure descriptions in programs), and some tools (such as data dictionaries) may simulate such centralized schemas.

5. There is no miracle here: for instance, the data are imported, or organizational and behavioural rules make them satisfy these constraints.

DATABASE REVERSE ENGINEERING 43

6. But methodology-aware if 5/g« recovery is intended. This aspect has been developed in Hainautetal. (1994), and will be evoked in Section 11.

7. For instance, Belgium commonly uses three legal languages, namely Dutch, French and German. As a consequence, English is often used as a de facto common language.

8. The part of the DB-MAIN project in charge of this aspect is the DB-Process sub-project, fully supported by the Communaut Francaise de Belgique.

9. In order to develop contacts and collaboration, an Education version (complete but limited to small applications) and its documentation have been made available. This free version can be obtained by contacting the first author at j lh@inf o. f undp. ac . be.

References

Andersson, M. 1994. Extracting an entity relationship schema from a relational database through reverse engi­neering. In Proc. of the 13th Int. Conf on ER Approach, Manchester: Springer-Verlag.

Batini, C, Ceri, S., and Navathe, S.B. 1992. Conceptual Database Design. Benjamin-Cummings. Batini, C, Di Battista, 0., and Santucci, G. 1993. Structuring primitives for a dictionary of entity relationship data

schemas. IEEE TSE, 19(4). Blaha, M.R. and Premerlani, W.J. 1995. Observed idiosyncracies of relational database designs. In Proc. of the

2nd IEEE Working Conf. on Reverse Engineering, Toronto: IEEE Computer Society Press, Bolois, G. and Robillard, P. 1994. Transformations in reengineering techniques. In Proc. of the 4th Reengineering

Forum Reengineering in Practice, Victoria, Canada. Casanova, M. and Amarel de Sa, J. 1983. Designing entity relationship schemas for conventional information

systems. In Proc. of ERA, pp. 265-278. Casanova, M.A. and Amaral, De Sa 1984. Mapping uninterpreted schemes into entity-relationship diagrams: Two

applications to conceptual schema design. In IBM J. Res. & Develop., 28(1). Chiang, R.H,, Barron, TM,, and Storey, V.C. 1994. Reverse engineering of relational databases: Extraction of an

EER model from a relational database. Joum. of Data and Knowledge Engineering, 12(2): 107-142. Date, C,J. 1994. An Introduction to Database Systems. Vol. 1, Addison-Wesley. Davis, K.H. and Arora, A.K. 1985, A Methodology for translating a conventional file system into an entity-

relationship model. In Proc. of ERA, lEEE/North-HoUand. Davis, K.H. and Arora, A.K. 1988. Converting a relational database model to an entity relationship model. In

Proc. of ERA: A Bridge to the User, North-Holland. Edwards, H.M. and Munro, M. 1995. Deriving a logical model for a system using recast method. In Proc. of the

2nd IEEE WC on Reverse Engineering. Toronto: IEEE Computer Society Press. Fikas, S.F. 1985. Automating the transformational development of software. IEEE TSE, SE-11:1268-1277. Fong, J. and Ho, M. 1994, Knowledge-based approach for abstracting hierarchical and network schema semantics.

In Proc. of the 12th Int. Conf. on ER Approach, Arlington-Dallas: Springer-Verlag. Fonkam, M.M. and Gray, W.A. 1992. An approach to ehciting the semantics of relational databases. In Proc. of

4th Int. Conf on Advance Information Systems Engineering—CAiSE'92, pp. 463-480, LNCS, Springer-Verlag. Elmasri, R. and Navathe, S. 1994. Fundamentals of Database Systems. Benjamin-Cummings. Hainaut, J.-L. 1981. Theoretical and practical tools for data base design. In Proc. Intern. VLDB Conf, ACM/IEEE. Hainaut, J.-L, 1991. Entity-generating schema transformation for entity-relationship models. In Proc. of the 10th

ERA, San Mateo (CA), North-Holland. Hainaut, J.-L., Cadelli, M„ Decuyper, B., and Marchand, O. 1992. Database CASE tool architecture: Principles

for flexible design strategies. In Proc. of the 4th Int. Conf on Advanced Information System Engineering (CAiSE-92), Manchester: Springer-Veriag, LNCS.

Hainaut, J.-L., Chandelon M,, Tonneau C, and Joris, M. 1993a. Contribution to a theory of database reverse engineering. In Proc. of the IEEE Working Conf. on Reverse Engineering, Baltimore: IEEE Computer Society Press.

Hainaut, J.-L., Chandelon M., Tonneau C, and Joris, M. 1993b. Transformational techniques for database reverse engineering. In Proc. of the 12thlnt. Conf on ER Approach, Arlington-Dallas: E/R Institute and Springer-Verlag, LNCS.

44 HAINAUTETAL.

Hainaut, J.-L., Englebert, V., Henrard, J., Hick, J.-M., and Roland, D. 1994. Evolution of database applications: The DB-MAIN approach. In Proc. of the 13th Int. Conf. on ER Approach^ Manchester: Springer-Verlag.

Hainaut, J.-L. 1995. Transformation-based database engineering. Tutorial notes, VLDB'95, Ziirich, Switzerland, (available atjlh@info.fundp.ac.be).

Hainaut, J.-L. 1996. Specification Preservation in Schema transformations—Application to Semantics and Statis­tics, Elsevier: Data & Knowledge Engineering (to appear).

Hainaut, J.-L., Roland, D., Hick J.-M., Henrard, J., and Englebert, V. 1996. Database design recovery, in Proc. of CAiSE'96, Springer-Veriag.

Halpin, T.A. and Proper, H.A. 1995. Database schema transformation and optimization. In Proc. of the 14th Int. Conf on ER/00 Modelling (ERA), Springer-Verlag.

Hall, P.A.V. (Ed.) 1992. Software Reuse and Reverse Engineering in Practice, Chapman & Hall IEEE, 1990. Special issue on Reverse Engineering, IEEE Software, January, 1990.

Johannesson, P. and Kalman, K. 1990. A Method for translating relational schemas into conceptual schemas. In Proc. of the 8th ERA, Toronto, North-Holland.

Joris, M., Van Hoe, R., Hainaut, J.-L., Chandelon, M., Tonneau, C, and Bodart F. et al. 1992. PHENIX: Methods and tools for database reverse engineering. In Proc. 5th Int Conf. on Software Engineering and Applications, Toulouse, December 1992, EC2 Publish.

Kobayashi, I. 1986. Losslessness and semantic correctness of database schema transformation: Another look of schema equivalence. In Information Systems, ll(l):41-59.

Kozaczynsky, Lilien, 1987. An extended entity-relationship (E2R) database specification and its automatic veri­fication and transformation. In Proc. of ERA Conf

Markowitz, K.M. and Makowsky, J.A. 1990. Identifying extended entity-relationship object structures in relational schemas. IEEE Trans, on Software Engineering, 16(8).

Navathe, S.B. 1980. Schema analysis for database restrucmring. In ACM TODS, 5(2). Navathe, S.B. and Awong, A. 1988. Abstracting relational and hierarchical data with a semantic data model. In

Proc. of ERA: A Bridge to the User, North-Holland. Nilsson, E.G. 1985. The translation of COBOL data structure to an entity-rel-type conceptual schema. In Proc. of

ERA, lEEE/North-HoUand. Petit, J.-M., Kouloumdjian, J., Bouliaut, J.-F., and Toumani, F. 1994. Using queries to improve database reverse

engineering. In Proc. of the 13th Int. Conf on ER Approach, Springer-Verlag: Manchester. Premerlani, W.J. and Blaha, M.R. 1993. An approach for reverse engineering of relational databases. In Proc. of

the IEEE Working Conf on Reverse Engineering, IEEE Computer Society Press. Potts, C. and Bruns, G. 1988. Recording the reasons for design decisions. In Proc. oflCSE, IEEE Computer

Society Press. Rauh, O. and Stickel, E. 1995. Standard transformations for the normalization of ER schemata. In Proc. of the

CAiSE'95 Conf, Jyvaskyla, Finland, LNCS, Springer-Veriag. Rock-Evans, R. 1990. Reverse engineering: Markets, methods and tools, OVUM report. Rosenthal, A. and Reiner, D. 1988. Theoretically sound transformations for practical database design. In Proc. of

ERA Conf Rosenthal, A. and Reiner, D. 1994. Tools and transformations—Rigourous and otherwise—for practical database

design, ACM TODS, 19(2). Rolland, C. 1993. Modeling the requirements engineering process. In Proc. of the 3rd European-Japanese Seminar

in Information Modeling and Knowledge Bases, Budapest (preprints). Sabanis, N. and Stevenson, N. 1992. Tools and techniques for data remodelling cobol applications. In Proc.

5th Int. Conf on Software Engineering and Applications, Toulouse, 7-11 December, pp. 517-529, EC2 Publish.

Selfridge, P.G., Waters, R.C., and Chikofsky, E.J. 1993. Challenges to the field of reverse engineering. In Proc. of the 1st WC on Reverse Engineering, pp. 144-150, IEEE Computer Society Press.

Shoval, P. and Shreiber, N. 1993. Database reverse engineering: From relational to the binary relationship Model, data and knowledge Engineering, 10(10).

Signore, O., Lofftedo, M., Gregori, M., and Cima, M. 1994. Reconstruction of E-R schema from database Applications: A cognitive approach. In Proc. of the 13th Int Conf on ER Approach, Manchester: Springer-Verlag.

DATABASE REVERSE ENGINEERING 45

Springsteel, F.N. and Kou, C. 1990. Reverse data engineering of E-R designed relational schemas. In Proc, of Databases, Parallel Architectures and their Applications.

Teorey, T.J. 1994. Database Modeling and Design: The Fundamental Principles, Morgan Kaufmann. Vermeer, M. and Apers, R 1995. Reverse engineering of relational databases. In Proc. of the 14th Int. Conf on

ER/00 Modelling (ERA). Weiser, M. 1984. Program slicing, IEEE TSE, 10:352-357. Wills, L., Newcomb, R, and Chikofsky, E. (Eds.) 1995. Proc. of the 2nd IEEE Working Conf on Reverse Engi­

neering. Toronto: IEEE Computer Society Press. Winans, J. and Davis, K.H. 1990. Software reverse engineering from a currently existing IMS database to an

entity-relationship model. In Proc. of ERA: the Core of Conceptual Modelling, pp. 345-360, North-Holland.

Automated Software Engineering, 3, 47-76 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Understanding Interleaved Code SPENCER RUGABER, KURT STIREWALT {spencer,kurt}@cc.gatech.edu College of Computing, Georgia Institute of Technology, Atlanta, GA

LINDA M. WILLS linda.wills@ee.gatech.edu School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA

Abstract. Complex programs often contain multiple, interwoven strands of computation, each responsible for accomplishing a distinct goal. The individual strands responsible for each goal are typically delocalized and overlap rather than being composed in a simple linear sequence. We refer to these code fragments as being interleaved. Interleaving may be intentional-for example, in optimizing a program, a programmer might use some intermediate result for several purposes-or it may creep into a program unintentionally, due to patches, quick fixes, or other hasty maintenance practices. To understand this phenomenon, we have looked at a variety of instances of interleaving in actual programs and have distilled characteristic features. This paper presents our characterization of interleaving and the implications it has for tools that detect certain classes of interleaving and extract the individual strands of computation. Our exploration of interleaving has been done in the context of a case study of a corpus of production mathematical software, written in Fortran from the Jet Propulsion Laboratory. This paper also describes our experiences in developing tools to detect specific classes of interleaving in this software, driven by the need to enhance a formal description of this software library's components. The description, in turn aids in the automated component-based synthesis of software using the library.

With every leaf a miracle.

— Walt Whitman.

Keywords: software understanding, interleaving, domain models, specification extraction, analysis tools.

1. Introduction

Imagine being handed a software system you have never seen before. Perhaps you need to track down a bug, rewrite the software in another language or extend it in some way. We know that software maintenance tasks such as these consume the majority of software costs (Boehm, 1981), and we know that reading and understanding the code requires more effort than actually making the changes (Fjeldstad and Hamlen, 1979). But we do not know what makes understanding the code itself so difficult.

Letovsky has observed that programmers engaged in software understanding activities typically ask "how" questions and "why" questions (Letovsky, 1988). The former require an in-depth knowledge of the programming language and the ways in which programmers express their software designs. This includes knowledge of common algorithms and data structures and even concerns style issues, such as indentation and use of comments. Nev­ertheless, the answers to "how" questions can be derived from the program text. "Why" questions are more troublesome. Answering them requires not only comprehending the program text but relating it to the program's purpose - solving some sort of problem. And

48 RUGABER, STIREWALT, AND WILLS

the problem being solved may not be explicitly stated in the program text; nor is the rationale the programmer had for choosing the particular solution usually visible.

This paper is concerned with a specific difficulty that arises when trying to answer "why" questions about computer programs. In particular, it is concerned with the phenomenon of interleaving in which one section of a program accomplishes several purposes, and disentangling the code responsible for each purpose is difficult. Unraveling interleaved code involves discovering the purpose of each strand of computation, as well as understanding why the programmer decided to interleave the strands. To demonstrate this problem, we examine an example program in a step-by-step fashion, trying to answer the questions "why is this program the way it is?" and "what makes it difficult to understand?"

/ . / . NPEDLN

The Fortran program, called NPEDLN, is part of the SPICELIB library obtained from the Jet Propulsion Laboratory and intended to help space scientists analyze data returned from space missions. The acronym NPEDLN stands for Nearest Point on Ellipsoid to Line. The ellipsoid is specified by the lengths of its three semi-axes (A, B, and c), which are oriented with the X, y, and z coordinate axes. The line is specified by a point (LINEPT) and a direction vector (LINEDR). The nearest point is contained in a variable called PNEAR. The full program consists of 565 lines; an abridged version can be found in the Appendix with a brief description of subroutines it calls and variables it uses. The executable statements, with comments and declarations removed, are shown in Figure L

The lines of code in NPEDLN that actually compute the nearest point are somewhat hard to locate. One reason for this has to with error checking. It turns out that SPICELIB includes an elaborate mechanism for reporting and recovering from errors, and roughly half of the code in NPEDLN is used for this purpose. We have indicated those lines by shading in Figure 2. The important point to note is that although it is natural to program in a way that intersperses error checks with computational code, it is not necessary to do so. In principal, an entirely separate routine could be constructed to make the checks and NPEDLN called only when all the checks are passed. Although this approach would require redundant computation and potentially more total lines of code, the resultant computations in NPEDLN would be shorter and easier to follow.

In some sense, the error handling code and the rest of the routine realize independent plans. We use the term plan to denote a description or representation of a computational structure that the designers have proposed as a way of achieving some purpose or goal in a program. This definition is distilled from definitions in (Letovsky and Soloway, 1986, Rich and Waters, 1990, Selfridge et al., 1993). Note that apian is not necessarily stereotyp­ical or used repeatedly; it may be novel or idiosyncratic. Following (Rich and Waters, 1990, Selfridge et al., 1993) , we reserve the term cliche for a plan that represents a standard, stereotypical form, which can be detected by recognition techniques, such as (Hartman, 1991, Letovsky, 1988, Kozaczynski and Ning, 1994, Quilici, 1994, Rich and Wills, 1990, Wills, 1992). . Plans can occur at any level of abstraction from architectural overviews to code. By extracting the error checking plan from NPEDLN, we get the much smaller and, presumably, more understandable program shown in Figure 3.

UNDERSTANDING INTERLEAVED CODE 49

SUBROUTINE NPEDLN (

IF ( RETURN 0 ) RETURN

ELSE CALL CHKIN (

END IF

A, B, C, LINEPT, LINEDR, PNEAR, DIST )

CALL UNORM ( LINEDR, UDIR, MAG ) IF ( MAG .EQ. 0 ) THEN

CALL SETMSG( 'Line direction vector

CALL SIGERR( CALL CHKOUTI RETURN

ELSE IF { .OR. .OR.

THEN CALL SETMSG

is the zero vector. 'SPICE(ZEROVECTOR)' 'NPEDLN' )

( A { B { C

LE. LE. LE.

DO ) DO ) DO )

A = #, = #. '

'Semi-axes: B = #, C

'#', A ) '#', B ) '#', C )

CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') CALL CHKOUT ( 'NPEDLN' )

CALL ERRDP CALL ERRDP CALL ERRDP

RETURN END IF SCALE = SCLA SCLB SCLC IF (

. .OR.

. .OR. CALL SETMSG {

MAX ( DABS (A) A / SCALE B / SCALE

SCALE ( SCLA**2 .LE ( SCLB**2 .LE. ( SCLC**2 .LE.

DABS(B), DABS(C) )

= C /

CALL ERRDP CALL ERRDP CALL ERRDP

O.DO ) O.DO ) O.DO ) ) THEN

'Semi-axis too small: A = #, B = #, C = #. ' ) '#', A ) '#', B ) '#*, C )

CALL SIGERR ('SPICE(DEGENERATECASE)') CALL CHKOUT ( 'NPEDLN' ) RETURN

END IF Scale LINEPT. SCLPTd) = LINEPT (1) / SCALE SCLPT{2) = LINEPT(2) / SCALE SCLPT{3) = LINEPT(3) / SCALE CALL VMINUS (UDIR, OPPDIR )

CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, SCLC, PT(1,1), FOUND(l))

CALL SURFPT (SCLPT, OPPDIR, SCLA, SCLB, SCLC, PT{1,2), FOUND(2))

DO 50001 I = 1, 2

IF ( FOUND(I) ) THEN DIST = 0.ODO CALL VEQU ( CALL VSCL ( CALL CHKOUT ( RETURN

END IF 50001 CONTINUE C

NORMALd) = UDIRd) NORMAL(2) = UDIR(2) NORMAL(3) = UDIR(3) CALL NVC2PL ( NORMAL, CALL INEDPL ( SCLA, SCLB, SCLC

CAND, XFOUND ) IF ( .NOT. XFOUND ) THEN

PT(1,I), PNEAR ) SCALE,PNEAR, PNEAR ) 'NPEDLN' )

/ SCLA**2 / SCLB**2 / SCLC**2 O.DO, CANDPL )

CANDPL,

CALL SETMSG (

CALL SIGERR { CALL CHKOUT { RETURN

END IF CALL NVC2PL ( UDIR, CALL PJELPL ( CAND,

Candidate ellipse could not be found.' ) SPICE(DEGENERATECASE)' NPEDLN* )

)

O.DO, PR J PL,

PRJPL )

PRJEL )

CALL VPRJP ( SCLPT, CALL NPELPT ( PRJPT,

PRJPL,

PRJEL,

PRJPT )

PRJNPT )

DIST = VDIST ( PRJNPT, PRJPT )

CALL VPRJPI ( PRJNPT, PRJPL,

IFOUND ) IF ( .NOT. IFOUND ) THEN

CANDPL, PNEAR,

CALL SETMSG (

CALL SIGERR (

CALL CHKOUT (

RETURN

END IF

CALL VSCL ( SCALE, PNEAR,

DIST = SCALE * DIST

CALL CHKOUT { 'NPEDLN' )

RETURN

END

Inverse projection could not

be found.' )

SPICE(DEGENERATECASE)' )

NPEDLN• )

Figure 1. NPEDLN minus comments and declarations.

50 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN { A, B, C, LINEPT, LINEDR, PNEAR, DIST )

IF ( RETURN 0 f RETUiUS

EL$E CAX*L CHKIN {

mo XF 'N|»E0I * )

CALL UNORM ( LINEDR, UDIR, MAG ) I F ( JJAG ,EQ. 0

CALL SETMSG{ ) THfiN 'Line direction vector is the zer6 vector, '

CALL SIQERR< * SPICE(ZEROVECTOB)' \ CALL CHKODTi *NPEI>LK' ) RETPim

ELSE IF { .OR. .OR.

mm CALL SKtKSG {

( A XE. 0,00 ) { B ,LE, Q,m )

i c .m. o.m ) }

fi » #. C « #•' i CALL ERRDP { * #' , A } CALL ERRDP i '#\ B ) CALL EKRDP. { '#', C ) CALL SIGBRR T SPICE {IHVALIDAXISJi ESGTH) ) CALL CHKOUT f 'nPSlDlM' J RETtmN

e»D IF SCALE SCLA SCLB SCLC IF {

= MAX { DABS(A), DABS(B), DABS(C) )

. ,OR, CALL SETMSG i

A / SCALE B / SCALE C / SCALE

{ SCLA**2 .LE. O.m } ( SCLB**2 .LE. O.bO ) i SCLC**2 .LE, a.m ) \ THEN

'Semi-axis too SMiall: A « #, B = #. C == i. ' }

CALL ERRDP i '#*, A ) CAtt. ERRDP i '#% B ) CALt WmW { '#'r C J CALL SIGEKR ('SPtCfiCDEtSENERATSCASE) ') CALL CHKOUT i 'NPEDLN* } RETURN

END IF Scale LINEPT. SCLPT(l) = LINEPT(l) / SCALE SCLPT(2) = LINEPT(2) / SCALE SCLPT{3) = LINEPT(3) / SCALE CALL VMINUS (UDIR, OPPDIR )

ODO ( PT(1,I), PNEAR ) { SCALE,PNEAR, PNEAR )

)

CALL SURFPT (SCLPT, UDIR, SCLA, SCLB, SCLC, PT(1,1), FOUND(l))

CALL SURFPT (SCLPT, OPPDIR, SCLA, SCLB, SCLC, PT(1,2), F0UND(2))

DO 50001 I = 1, 2

IF ( FOUND(I) ) THEN DIST = 0 CALL VEQU CALL VSCL CALL CRKOUT ( RETURN

END IF 50001 CONTINUE C

NORMAL(1) = UDIR(l) / SCLA**2 NORMAL(2) = UDIR(2) / SCLB**2 NORMAL(3) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL, O.DO, CANDPL ) CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,

CAND, XFOUND ) IF ( ,NOT. XFOUND ) THEN

CALL SETMSG < 'Candidate ellipse could not be found,' >

CALL SIGERR ( ' SPICE (DES^NERATECASEP ) CALL CHKOUT ( 'NP13>WI' \ RETtmJJ

FND I F CALL NVC2PL ( UDIR, O.DO, CALL PJELPL ( CAND, PRJPL,

PRJPL ) PRJEL )

CALL VPRJP ( SCLPT, PRJPL, PRJPT ) CALL NPELPT ( PRJPT, PRJEL, PRJNPT ) DIST = VDIST ( PRJNPT, PRJPT )

CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

IF ( ,NOT. IPOmm ) THEN CALL SETMSG ( 'Inverse projection could not

be found.' > CALL SIGERR i 'SPICS(DSJeENERATSCASEj ' } CAU, CHKOUT { 'JtJPEDLN' > RETURN

END I F CALL VSCL ( SCALE, PNEAR, PNEAR ) DIST = SCALE * DIST CALL CHKOUT ( 'STPEDLN* ) RETURN END

Figure 2. Code with error handling highlighted.

The structure of an understanding process begins to emerge: detect a plan, such as error checking, in the code and extract it, leaving a smaller and more coherent residue for further analysis; document the extracted plan independently; and note the ways in which it interacts with the rest of the code.

We can apply this approach further to NPEDLN'S residual code in Figure 3. NPEDLN has a primary goal of computing the nearest point on an ellipsoid to a specified line. It also has a related goal of ensuring that the computations involved have stable numerical behavior; that is, that the computations are accurate in the presence of a wide range of numerical inputs. A standard trick in numerical programming for achieving stability is to scale the data involved in a computation, perform the computation, and then unscale the results. The

UNDERSTANDING INTERLEAVED CODE 51

JROUTINE NPEDLN ( A, B, C, LINEPT, PNEAR, DIST )

CALL UNORM ( LINEDR, UDIR, MAG ) SCALE = SCLA SCLB SCLC

SCLPT{1) SCLPT{2) SCLPT(3)

MAX { DABS(A), DABS(B), A / SCALE B / SCALE C / SCALE

= LINEPT(1) / SCALE = LINEPT(2) / SCALE = LINEPT(3) / SCALE

LINEDR,

DABS{C)

CALL VMINUS (UDIR, OPPDIR ) CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,

SCLC, PT(1,1), FOUNDd)) CALL SURFPT (SCLPT, OPPDIR, SCLA,

SCLB, SCLC, PT(1,2), F0UND(2)) DO 50001

I = 1, 2 IF ( FOUNDd) ) THEN

DIST = O.ODO CALL VEQU CALL VSCL RETURN

END IF 50001 CONTINUE

( PT(1,I), PNEAR ) ( SCALE,PNEAR, PNEAR )

NORMAL (1) = UDIRd) NORMAL(2) = UDIR(2) NORMAL(3) = UDIR{3) CALL NVC2PL ( NORMAL, CALL INEDPL ( SCLA,

/ SCLA**2 / SCLB**2 / SCLC**2 O.DO, CANDPL )

SCLB, SCLC, CANDPL, CAND, XFOUND )

CALL NVC2PL { UDIR, O.DO, PRJPL ) CALL PJELPL ( CAND, PRJPL, PRJEL )

CALL VPRJP ( SCLPT, CALL NPELPT ( PRJPT, DIST = VDIST ( PRJNPT,

PRJPL, PRJPT ) PRJEL, PRJNPT ) PRJPT )

CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

CALL VSCL ( SCALE, PNEAR, PNEAR ) DIST = SCALE * DIST RETURN END

Figure 3. The residual code without the error handling plan.

code responsible for doing this in NPEDLN is scattered throughout the program's text. It is highlighted in the excerpt shown in Figure 4.

The delocalized nature of this "scale-unscale" plan makes it difficult to gather together all the pieces involved for consistent maintenance. It also gets in the way of understanding the rest of the code, since it provides distractions that must be filtered out. Letovsky and Soloway's cognitive study (Letovsky and Soloway, 1986) shows the deleterious effects of delocalization on comprehension and maintenance.

When we extract the scale-unscale code from NPEDLN, we are left with the smaller code segment shown in Figure 5 that more directly expresses the program's purpose: computing the nearest point.

There is one further complication, however. It turns out that NPEDLN not only computes the nearest point from a line to an ellipsoid, it also computes the shortest distance between the line and the ellipsoid. This additional output (DIST) is convenient to construct because it can make use of intermediate results obtained while computing the primary output (PNEAR).

This is illustrated in Figure 6. (The computation of DIST using VDIST is actually the last computation performed by the subroutine NPELPT, which NPEDLN calls; we have pulled this computation out of NPELPT for clarity of presentation.)

Note that an alternative way to structure SPICELIB would be to have separate routines for computing the nearest point and the distance. The two routines would each be more coherent, but the common intermediate computations would have to be repeated, both in the code and at runtime.

The "pure" nearest point computation is shown in Figure 7. It is now much easier to see the primary computational purpose of this code.

52 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN { A, B, C, LINEPT, LINEDR, PNEAR, DIST )

CALL UNORM { LINEDR, UDIR, MAG ) SCALE SClk SCLB SCLC

- MAX { DABS {Ah mBS(B), DABS(C) }

SCLP1*aj SCLPTI2) SCLPTU)

/ SCALE / SCALE / SCALE

LINEPT (2) LINEPTO)

ISCALB SCALE SCALE

CALL VMINUS (UDIR, OPPDIR ) CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,

SCLC, PT{1,1), FOUND(l)) CALL SURFPT (SCLPT, OPPDIR, SCLA,

SCLB, SCLC, PT(1,2), F0UND(2)) DO

50001

50001 I = 1, 2

IF ( FOUND(I) DIST = 0. CALL VEQU CALL VSCL RETURN

END IF CONTINUE

) THEN ,0D0 ( PT(1< ,1), PNEAR ) { SCALE, fWEAR, mBAR }

NORMAL(1) = UDIR(l) / SCLA**2 N0RMAL(2) = UDIR(2) / SCLB**2 NORMALO) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL, O.DO, CANDPL ) CALL INEDPL ( SCLA, SCLB, SCLC, CANDPL,

CAND, XFOUND ) CALL NVC2PL ( UDIR, O.DO, PRJPL ) CALL PJELPL ( CAND, PRJPL, PRJEL )

CALL VPRJP ( SCLPT, PRJPL, PRJPT ) CALL NPELPT ( PRJPT, PRJEL, PRJNPT ) DIST = VDIST ( PRJNPT, PRJPT )

CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

CALL VSCL ( SCALE, PimAR, PJIEAH J DIST =? SCALE * DIST RETURN END

Figure 4. Code with scale-unscale plan highlighted.

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR, PNEAR, DIST )

C CALL UNORM ( LINEDR, UDIR, MAG )

C CALL VMINUS (UDIR, OPPDIR ) CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,

SCLC, PT(1,1), FOUND{1)) CALL SURFPT (SCLPT, OPPDIR, SCLA,

SCLB, SCLC, PT(1,2), FOUND(2)) DO 50001

I = 1, 2 IF { FOUND(I) ) THEN

DIST = O.ODO CALL VEQU { PT(1,I), PNEAR ) RETURN

END IF 50001 CONTINUE

NORMAL(1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL { CALL INEDPL {

CALL NVC2PL ( CALL PJELPL (

C CALL VPRJP { CALL NPELPT { DIST = VDIST

C CALL VPRJPI (

RETURN END

UDIR(l) / SCLA**2 UDIR(2) / SCLB**2 UDIR(3) / SCLC**2 NORMAL, O.DO, CANDPL ) SCLA, SCLB, SCLC, CANDPL, CAND, XFOUND ) UDIR, O.DO, PRJPL ) CAND, PRJPL, PRJEL )

SCLPT, PRJPL, PRJPT ) PRJPT, PRJEL, PRJNPT ) ( PRJNPT, PRJPT )

PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

Figure 5. The residual code without the scale-unscale plan.

UNDERSTANDING INTERLEAVED CODE 53

SUBROUTINE NPEDLN ( A, B, C, LINEPT, LINEDR, PNEAR, DIST )

C CALL UNORM { LINEDR, UDIR, MAG )

C CALL VMINUS (UDIR, OPPDIR ) CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,

SCLC, PT(1,1), FOUNDd)) CALL SURFPT (SCLPT, OPPDIR, SCLA,

SCLB, SCLC, PT(1,2), F0UND{2)) DO 50001

I = 1, 2 IF { FOUND(I) ) THEN

DISfT » O.ODQ CALL VEQU ( PT(1,I), PNEAR ) RETURN

END IF 50001 CONTINUE

NORMAL(1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL ( CALL INEDPL (

CALL NVC2PL ( CALL PJELPL (

CALL VPRJP ( CALL NPELPT ( DlSfT s VDXST

UDIRd) / SCLA**2 UDIR(2) / SCLB**2 UDIR(3) / SCLC**2 NORMAL, O.DO, CANDPL ) SCLA, SCLB, SCLC, CANDPL, CAND, XFOUND ) UDIR, O.DO, PRJPL ) CAND, PRJPL, PRJEL )

SCLPT, PRJPL, PRJPT ) PRJPT, PRJEL, PRJNPT ) ( PRJN&T, tRJPT }

CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

RETURN END

Figure 6. Code with distance plan highlighted.

SUBROUTINE NPEDLN ( A, B, C, PNEAR )

LINEPT, LINEDR,

CALL UNORM ( LINEDR, UDIR, MAG )

CALL VMINUS (UDIR, OPPDIR ) CALL SURFPT (SCLPT, UDIR, SCLA, SCLB,

SCLC, PT(1,1), FOUNDd) ) CALL SURFPT (SCLPT,

SCLB, OPPDIR, SCLA, SCLC, PT(1,2), FOUND(2))

DO 50001 I = 1, 2

IF ( FOUND(I) CALL VEQU RETURN

END IF 50001 CONTINUE

THEN ( PT(1,I), PNEAR )

NORMAL(1) = NORMAL(2) = NORMAL(3) = CALL NVC2PL CALL INEDPL

CALL NVC2PL CALL PJELPL

CALL VPRJP CALL NPELPT

UDIRd) / SCLA**2 UDIR(2) / SCLB**2 UDIR(3) / SCLC**2

( NORMAL, O.DO, CANDPL ) ( SCLA, SCLB, SCLC, CANDPL, CAND, XFOUND )

( UDIR, O.DO, PRJPL ) ( CAND, PRJPL, PRJEL )

( SCLPT, ( PRJPT,

PRJPL, PRJEL,

PRJPT ) PRJNPT )

CALL VPRJPI ( PRJNPT, PRJPL, CANDPL, PNEAR, IFOUND )

RETURN END

Figure 7. The residual code without the distance plan.

54 RUGABER, STIREWALT, AND WILLS

The production version of NPEDLN contains several interleaved plans. Intermediate For­tran computations are shared by the nearest point and distance plans. A delocalized scaling plan is used to improve numerical stability, and an independent error handling plan is used to deal with unacceptable input. Knowledge of the existence of the several plans, how they are related, and why they were interleaved is required for a deep understanding of NPEDLN.

1,2. Contributions

In this paper, we present a characterization of interleaving, incorporating three aspects that make interleaved code difficult to understand: independence, delocalization, and re­source sharing. We have distilled this characterization from an empirical examination of existing software - primarily SPICELIB. Secondary sources of existing software which we also examined are a Cobol database report writing system from the US Army and a pro­gram for finding the roots of functions, presented and analyzed in (Basili and Mills, 1982) and (Rugaber et al., 1990). We relate our characterization of interleaving to existing con­cepts in the literature, such as delocalized plans (Letovsky and Soloway, 1986), coupling (Yourdon and Constantine, 1979), and redistribution of intermediate results (Hall, 1990, Hall, 1991).

We then describe the context in which we are exploring and applying these ideas. Our driving program comprehension problem is to elaborate and validate existing partial spec­ifications of the JPL library routines to facilitate the automation of specification-driven generation of programs using these routines. We have developed analysis tools, based on the Software Refinery, to detect interleaving. We describe the analyses that we have for­mulated to detect specific classes of interleaving that are particularly useful in elaborating specifications. We then discuss open issues concerning requirements on software and plan representations that detection imposes, the role of application knowledge in addressing the interleaving problem, scaling up the scope of interleaving, and the feasibility of building tools to assist interleaving detection and extraction. We conclude with a description of how related research in cliche recognition as well as non-recognition techniques can play a role in addressing the interleaving problem.

2. Interleaving

Programmers solve problems by breaking them into pieces. Pieces are programming lan­guage implementations of plans, and it is common for multiple plans to occur in a single code segment. We use the term interleaving to denote this merging (Rugaber et al., 1995).

Interleaving expresses the merging of two or more distinct plans within some con­tiguous textual area of a program. Interleaving can be characterized by the delocal­ization of the code for the individual plans involved, the sharing of some resource, and the implementation of multiple, independent plans in the program's overall purpose.

UNDERSTANDING INTERLEAVED CODE 55

Interleaving may arise for several reasons. It may be intentionally introduced to improve program efficiency. For example, it may be more efficient to compute two related values in one place than to do so separately. Intentional interleaving may also be performed to deal with non-functional requirements, such as numerical stability, that impose global constraints which are satisfied by diffuse computational structures. Interleaving may also creep into a program unintentionally, as a result of inadequate software maintenance, such as adding a feature locally to an existing routine rather than undertaking a thorough redesign. Or interleaving may arise as a natural by-product of expressing separate but related plans in a linear, textual medium. For example, accessors and constructors for manipulating data structures are typically interleaved throughout programs written in traditional programming languages due to their procedural, rather than object-oriented structure. Interleaving cannot always be avoided (e.g., due to limitations of the available progranmiing language) and may be desirable (e.g., for economy and avoiding duplication which can lead to inconsistent maintenance). Regardless of why interleaving is introduced, it complicates understanding a program. This makes it difficult to perform tasks such as extracting reusable components, localizing the effects of maintenance changes, and migrating to object-oriented languages.

There are several reasons interleaving is a source of difficulties. The first has to do with delocalization. Because two or more design purposes are implemented in a single segment of code, the individual code fragments responsible for each purpose are more spread out than they would be if they were segregated in their own code segments. Another reason in­terleaving presents a problem is that when it is the result of poorly thought out maintenance activities such as "patches" and "quick fixes", the original, highly coherent structure of the system may degrade. Finally, the rationale behind the decision to intentionally introduce interleaving is often not explicitly recorded in the program. For example, although inter­leaving is often introduced for purposes of optimization, expressing intricate optimizations in a clean and well-documented fashion is not typically done. For all of these reasons, our ability to comprehend code containing interleaved fragments is compromised.

Our goal is not to completely eliminate interleaving from programs, since that is not always desirable or possible to do at the level of source text. Rather, it is to find ways of detecting interleaving and representing the interleaved plans at a level of abstraction that makes the individual plans and their interrelationships clear.

We now examine each of the characteristics of interleaving - delocalization, sharing, independence - in more detail.

2.1, Delocalization

Delocalization is one of the key characteristics of interleaving: one or more parts of a plan are spatially separated from other parts by code from other plans with which they are interleaved.

5 6 RUGABER, STIREWALT, AND WILLS

SUBROUTINE NPEDLN(A, B, C, LINEPT, LINEDR, PNEAR, DIST)

CALL UNORM ( LINEDR, UDIR, MAG ) . . . l^rror c^h^ck^]

SCALB « m X { DABjS(A), DABS<B), PABS(C) ) SCLA. « A / SCALE SCLB = B / SCALE SCLC * C / SCAt fi ... [error checks]

SCLPT(l) « LXIilBPTil) / SCALE SCLP'r(2) * LINEPT <2) / SCALE SCLPT{3) ^ LINEPT(3) / SCALE CALL VMINUS ( UDIR, OPPDIR ) CALL SURFPT ( SCLPT, UDIR, SCLA, SCLB,

SCLC,PT(1,1), FOUND(1)) CALL SURFPT ( SCLPT, OPPDIR,SCLA, SCLB,

SCLC, PT(1,2), FOUND(2)) ... [checking for intersection of the

line with the ellipsoid] IF ( FOUND(I) ) THEN

DXST ** O.ODO CALI* VSCL {SCALE, PNEAR, PNEAH}

RETURN END IF

... [handling the non-intercept case] CALL VSCL ( SCALE, PNEASt, PNBAR ) DIST s= SCALE * DIST

RETURN END

Figure 8. Portions of the NPEDLN Fortran program. Shaded regions highlight the lines of code responsible for scaling and unsealing.

UNDERSTANDING INTERLEAVED CODE 57

The "scale-unscale" pattern found in NPEDLN is a simple example of a more general de-localized plan that we refer to as a reformulation wrapper, which is frequently interleaved with computations in SPICELIB. Reformulation wrappers transform one problem into an­other that is simpler to solve and then transfer the solution back to the original situation. Other examples of reformulation wrappers in SPICELIB are reducing a three-dimensional geometry problem to a two-dimensional one and mapping an ellipsoid to the unit sphere to make it easier to solve intersection problems.

Delocalization may occur for a variety of reasons. One is that there may be an inherently non-local relationship between the components of the plan, as is the case with reformula­tion wrappers, which makes the spatial separation necessary. Another reason is that the intermediate results of part of a plan may be shared with another plan, causing the plans to overlap and their steps to be shuffled together; the steps of one plan separate those of the other. For example, in Figure 8, part of the unscale plan (computing the scaling factor) is separated from the rest of the plan (multiplying by the scaling factor) in all unscalings of the results (DIST and PNEAR). This allows the scaling factor to be computed once and the result reused in all scalings of the inputs A, B, and c and in unsealing the results.

Realizing that a reformulation wrapper or some other delocalized plan is interleaved with a particular computation can help prevent comprehension failures during maintenance (Letovsky and Soloway, 1986). It can also help detect when the delocalized plan is incom­plete, as it was in an earlier version of our example subroutine whose modification history includes the following correction:

C- SPICELIB Version 1.2.0, 25-NOV-1992 (NJB)

C Bug fix: in the intercept case, PNEAR is now

C properly re-scaled prior to output. Formerly,

C it was returned without having been re-scaled.

2,2, Resource Sharing

The sharing of some resource is characteristic of interleaving. When interleaving is intro­duced into a program, there is normally some implicit relationship between the interleaved plans, motivating the designer to choose to interleave them. An example of this within NPEDLN is shown in Figure 9. The shaded portions of the code shown are shared between the two computations for PNEAR and DIST. In this case, the common resources shared by the interleaved plans are intermediate data computations. The implementations for com­puting the nearest point and the shortest distance overlap in that a single structural element contributes to multiple goals.

The sharing of the results of some subcomputation in the implementation of two distinct higher level operations is termed redistribution of intermediate results by Hall (Hall, 1990, Hall, 1991). More specifically, redistribution is a class of function sharing optimizations which are implemented simply by tapping into the dataflow from some value producer and feeding it to an additional target consumer, introducing fanout into the dataflow. Redistri­bution covers a wide range of conmion types of function sharing optimizations, including common subexpression elimination and generalized loop fusion. Hall developed an au­tomated technique for redistributing results for use in optimizing code generated from

58 RUGABER, STIREWALT, AND WILLS

SOBROtTTINE NPEDLN (A, B, C, LINEPT,

tFir&t 100 lines of NPEDLMJ

CALL NPELPT { PRJPT, PRJEL, PRJNPT )

PI ST = VDIST ( PRJNPT, PRJPT ) ~|

CALL VPRJPI(PRJNPT,PRJPL,CANDPL,PNEAR IFOUND )

IF ( .NOT. IFOUND ) THEN ... [error handling]

END IF CALL VSCL ( SCALE, PNEAR, PNEAR

DIST SCALE * DIST

CALL CHKOUT ( 'NPBDLN* ) RETUKKf END

Shared

Figure 9. Portions of NPEDLN, highlighting two overlapping computations.

general-purpose reusable software components. Redistribution of results is a form of inter­leaving in which the resources shared are data values.

The commonality between interleaved plans might be in the form of other shared resources besides data values, for example control structures, lexical module structures, and names. Often when interleaving is unintentional, the resource shared is code space: the code statements of two plans are interleaved because they must be expressed in linear text. Typically, intentional interleaving involves sharing higher level resources.

Control coupling. Control conditions may be redistributed just as data values are. The use of control flags allows control conditions to be determined once but used to affect execution at more than one location in the program. In NPEDLN, for example, SURFPT is called to compute the intersection of the line with the ellipsoid. This routine returns a control flag, FOUND, indicating whether or not the intersection exists. This flag is then used outside of SURFPT to control whether the intercept or non-intercept case is to be handled, as is shown in Figure 10.

The use of control flags is a special form of control coupling: "any connection between two modules that communicates elements of control (Yourdon and Constantine, 1979)," typically in the form of function codes, flags, or switches (Myers, 1975). This sharing of control information between two modules increases the complexity of the code, complicat­ing comprehension and maintenance.

Content coupling. Anotherform of resource sharing occurs when the lexical structure of a module is shared among several related functional components. For example, the entire contents of a module may be lexically included in another. This sometimes occurs when a programmer wants to take advantage of a powerful intraprocedural optimizer limited to improving the code in a single routine. Another example occurs when a programmer uses ENTRY statements to partially overlap the contents of several routines so that they may share

UNDERSTANDING INTERLEAVED CODE 59

CALL SURFPT ( SCLPT, UDIR, SCLA, SCLB,

SCLC, PT{1,1), FOUND(l) )

CALL SURFPT ( SCLPT, OPPDIR, SCLA, SCLB,

SCLC, PT{1,2), FOUND(2) )

DO 50001

I = 1, 2

IF ( FOUND(I) ) THEN

. . . [handling the intercept case] RETURN

END IF 50001 CONTINUE C G e t t i n g h e r e means t h e l i n e d o e s n ' t C i n t e r s e c t t h e e l l i p s o i d ,

. . . [handling the non-intercept case] RETURN END

Figure 10. Fragment of subroutine showing control coupling

access to some state variables. This is sometimes done in a language, such as Fortran, that does not contain an encapsulation mechanism like packages or objects.

These two practices are examples of a phenomenon called content coupling (Yourdon and Constantine, 1979) in which - "some or all of the contents of one module are included in the contents of another" - and which often manifests itself in the form of a multiple-entry module. Content coupling makes it difficult to independently modify or maintain the individual functions.

Name Sharing. A simple form of sharing is the use of the same variable name for two different purposes. This can lead to incorrect assumptions about the relationship between subcomputations within a program.

In general, the difficulty that resource sharing introduces is that it causes ambiguity in interpreting the purpose of program pieces. This can lead to incorrect assumptions about what effect changes will have, since the maintainer might be focusing on only one of the actual uses of the resource (variable, value, control flag, data structure slot, etc.).

2,3. Independence

While interleaving is introduced to take advantage of commonalities, it is also true that the interleaved plans each have a distinct purpose. Because understanding relates program goals to program code, having two goals realized in one section of code can be confusing.

There are several ways for dealing with this problem. One way would be to make two copies of the code segment, each responsible for one of the goals, and both duplicating any common code. In the NPEDLN example, a separate routine could be provided that is responsible for computing DIST. Although this may make understanding each of the routines somewhat simpler, there are costs due to the extra code and the implicit need, often forgotten, to update both versions of the common code whenever it needs to be fixed. A variant of

60 RUGABER, STIREWALT, AND WILLS

this approach is to place the common code in a separate routine, replacing it in each of the two copies with a call to the new routine. This factoring approach works well when the common code is contiguous, but quickly becomes unworkable if the common code is interrupted by the plan specific code.

The bottom line is that this style of intentional interleaving confronts the programmer with a tradeoff between efficiency and maintainability/understandability. Ironically, making the efficiency choice may hinder efforts to make the code more efficient and reusable in the long run, such as parallelizing or "objectifying" the code (converting it to an object-oriented style).

3. Case Study

In order to better understand interleaving, we have undertaken a case study of production library software. The library, called SPICELIB, consists of approximately 600 mathematical programs, written in Fortran by programmers at the Jet Propulsion Laboratory, for analyzing data sent back from space missions. The software performs calculations related to solar system geometry, such as coordinate frame conversions, intersections of rays, ellipses, planes, and ellipsoids, and light-time calculations, NPEDLN comes from this library.

We were introduced to SPICELIB by researchers at NASA Ames, who have devel­oped a component-based software synthesis system called Amphion (Lowry et al., 1994, Lowry et al., 1994, Stickel et al., 1994). Amphion automatically constructs programs that compose routines drawn from SPICELIB. It does this by making use of di domain theory that includes formal specifications of the library routines, connecting them to the abstract con­cepts of solar system geometry. The domain theory is encoded in a structured representation, expressed as axioms in first-order logic with equality. A space scientist using Amphion can schematically specify the geometry of a problem through a graphical user interface, and Amphion automatically generates Fortran programs to call SPICELIB routines to solve the described problem. Amphion is able to do this by proving a theorem about the solvability of the problem and, as a side effect, generating the appropriate calls. This is shown in the bottom half of Figure 11. Amphion has been installed at JPL and used by space scientists to successfully generate over one hundred programs to solve solar system kinematics problems. The programs consist of dozens of subroutine calls and are typically synthesized in under three minutes of CPU time using a Sun Sparc 2 (Lowry et al., 1994, Lowry zi al., 1994).

Amphion's success depends on how accurate, consistent, and complete its domain theory is. An essential program understanding task is to validate the domain theory by checking it against the SPICELIB routines and extending it when incompletenesses are found. To do this, we need to be able to pull apart interleaved strands. For example, one incompleteness in Amphion's domain theory is that it does not fully cover the functionality of the routines in SPICELIB. Some routines compute more than one result. For example, NPEDLN computes the nearest point on an ellipsoid to a line as well as the shortest distance between that point and the ellipsoid. However, the domain theory does not describe both of these values. In the case of NPEDLN, only the nearest point computation is modelled, not the shortest distance. In these routines, it is often the case that the code responsible for the secondary functionalities is interleaved with the code for the primary function covered by

UNDERSTANDING INTERLEAVED CODE 61

Interleaving! Detection

plans

componentT T partial spec t I elaborated spec

Spec Extraction, Elaboration

T

Reu$e Library — ^

Domain Theory z.

domain based^ spec

Jupiter Galileo

AmphJon

(NASA) Fortran /=".^ program i ^ l

Figure 11. Applying interleaving detection to component-based reuse.

Amphion's domain theory. Uncovering the secondary functionahty requires unraveling and understanding two interleaved computations.

Another way in which Amphion's current domain theory is incomplete is that it does not express preconditions on the use of the library routines; for example, that a line given as input to a routine must not be the zero vector or that an ellipsoid's semi-axes must be large enough to be scalable. It is difficult to detect the code responsible for checking these preconditions because it is usually tightly interleaved with the code for the primary computation in order to take advantage of intermediate results computed for the primary computation.

In collaboration with NASA Ames researchers, we explored ways in which Amphion's domain theory is incomplete, and we built program comprehension techniques to extend it. As the top half of Figure 11 shows, we developed mechanisms for detecting particular classes of interleaving, with the aim of extending the incomplete domain theory. In the process, we also performed analyses to gather empirical information about how much of spiCELiB is covered by the domain theory.

We have built interleaving detection mechanisms and empirical analyzers using a commer­cial tool called the Software Refinery (Reasoning Systems Inc.). This is a comprehensive tool suite including language-specific parsers and browsers for Fortran, C, Ada, and Cobol, language extension mechanisms for building analyzers for new languages, and a user inter­face construction tool for displaying the results of analyses. It maintains an object-oriented repository for holding the results of its analyses, such as abstract syntax trees and symbol tables. It provides a powerfiil wide-spectrum language, called Refine (Smith et al., 1985), which supports pattern matching and querying the repository. Using the Software Refinery allows us to leverage a commercially available tool as well as to evaluate the strengths and limitations of its approach to program analysis, which we discuss in Section 4.4.

6 2 RUGABER, STIREWALT, AND WILLS

3,L Domain Theory Elaboration in Synthesis and Analysis

Our motivations for validating and extending a partial domain theory of existing software come both from the synthesis and from the analysis perspectives. The primary motivations for doing this from the synthesis perspective are to make component retrieval more accurate in support of reuse, to assist in updating and growing the domain theory as new software components are added, and to improve the software synthesized.

From the software analysis perspective, the refinement and elaboration of the domain theory, based on what is discovered in the code, is a primary activity, driving the generation of hypotheses and informing future analyses. The process of understanding software involves two parallel knowledge acquisition activities (Brooks, 1983, Ornburn and Rugaber, 1992, Soloway and Ehrlich, 1984):

1. using domain knowledge to understand the code - knowledge about the application sets up expectations about how abstract concepts are typically manifested in concrete code implementations;

2. using knowledge of the code to understand the domain - what is discovered in the code is used to build up a description of various aspects of the application and to help answer questions about why certain code structures exist and what is their purpose with respect to the application.

We are studying interleaving in the context of performing these activities, given SPICELIB

and an incomplete theory of its application domain. We are targeting our detection of interleaving toward elaborating the existing domain theory. We are also looking for ways in which the current knowledge in the domain theory can guide detection and ultimately comprehension.

5.2. Extracting Preconditions

Using the Software Refinery, we automated a number of program analyses, one of which is the detection of subroutine parameter precondition checks. A precondition is a Boolean guard controlling execution of a routine. Preconditions normally occur early in the code of a routine before a significant commitment (in terms of execution time and state changes that must be reversed) is made to execute the routine. Because precondition checks are often interspersed with the computation of intermediate results, they tend to delocalize the plans that perform the primary computational work. Moreover precondition computations are usually part of a larger plan that detects exceptional, possibly erroneous conditions in the state of a running program, and then takes alternative action when these conditions arise, such as returning with an error code, signaling, or invoking error handlers. In some instances the majority of the lines of code in a routine are there to deal with the preconditions and resulting exception handling rather than to actually implement the base plan of the routine.

We found many examples of precondition checks on input parameters in our empirical analysis of the SPICELIB. One such check occurs in the subroutine SURFPT and is shown in

UNDERSTANDING INTERLEAVED CODE 63

C$Procedure SURFPT ( Surface point on an ellipsoid )

SUBROUTINE SURFPT ( POSITN, U, A, B, C, POINT, FOUND )

DOUBLE PRECISION U { 3 )

...declarations...

C Check the input vector to see if its the zero vector. If it is

C signal an error and return.

C

IF ( ( U(l) .EQ. O.ODO ) .AND.

( U(2) .EQ. O.ODO ) .AND.

( U(3) .EQ. O.ODO ) ) THEN

CALL SETMSG { 'SURFPT: The input vector is the zero vector.

CALL SIGERR ( 'SPICE(ZEROVECTOR)' )

CALL CHKOUT ( 'SURFPT' )

RETURN

END IF

Figure 12. A fragment of the subroutine SURFPT in SPICELIB. This fragment shows a precondition check which

invokes an exception if all of the elements of the u array are 0.

Figure 12. SURFPT finds the intersection (POINT) of a ray (represented by a point POSITN

and a direction vector u) with an ellipsoid (represented as three semi-axes lengths A, B, and c), if such an intersection exists (indicated by FOUND). One of the preconditions checked by SURFPT is that the direction vector u is not the zero-vector.

Parameter precondition checks make explicit the assumptions a subroutine places on its inputs. The process of understanding a subroutine can be facilitated by detecting its precon­dition checks and using the information they encode to elaborate a high-level specification of the subroutine. We have created a tool that detects parameter precondition checks and extracts the preconditions into a documentation form suitable for expression as a partial specification. The specifications can then be compared against the Amphion domain model.

Precondition checks are particularly difficult to understand when they are sprinkled throughout the code of a subroutine as opposed to being concentrated at the beginning. However, we discovered that, though interleaved, these checks could be heuristically iden­tified in SPICELIB by searching for I F statements whose predicates are unmodified input parameters (or simple dataflow dependents of them) and whose bodies invoke exception handlers. The logical negation of each of the predicates forms a conjunct in the precondition of the subroutine. The analysis that decides whether or not I F statements test only unmod­ified input parameters is specific to the Fortran language; but the analysis that decides if a code fragment is an exception plan depends on the fact that exceptions are dealt with in a stylized and stereotypical manner in SPICELIB. The implication is that the Fortran specific portion is not likely to need changing when we apply the tool to a new Fortran application; whereas the SPICELIB specific portion will certainly need to change. With this in mind, we chose a tool architecture that allows flexibility in keeping these different types of pattern knowledge separate and independently adaptable.

6 4 RUGABER, STIREWALT, AND WILLS

Detecting Exception Handlers. In general, we need application specific knowledge about usage patterns in order to discover exception handlers. For example, the developers of spiCELiB followed a strict discipline of exception propagation by registering an exception upon detection using a subroutine SIGERR and then exiting the executing subroutine using a RETURN statement. Hence, a call to SIGERR together with a RETURN indicates a cliche for handling an exception in SPICELIB. In some other application, the form of this cliche will be different. It is, therefore, necessary to design the recognition component of our architecture around this need to specialize the tool with knowledge about the system being analyzed.

The Software Refinery provides excellent support for this design principle through the use of the rule construct and a tree-walker that applies these rules to an abstract syntax tree (AST). Rules declaratively specify state changes by listing the conditions before and after the change without specifying how the change is implemented. This is useful for in­cluding SPICELIB specific pattern knowledge because it allows the independent, declarative expression of the different facets of the pattern.

We recognize application specific exception handlers using two rules that search the AST for a call to SIGERR followed by a RETURN statement. These rules and the Refine code that applies them are presented in detail in (Rugaber et al., 1995).

Detecting Guards. Discovering guards, which are I F statements that depend only upon input parameters, involves keeping track of whether or not these parameters have been modified. If they have been modified before the check, then the check probably is not a precondition check on inputs. In Fortran, a variable X can be modified by:

1. appearing on the left hand side of an assignment statement,

2. being passed into a subroutine which then modifies the formal parameter bound to X by the call,

3. being implicitly passed into another subroutine in a COMMON block and modified in this other subroutine, or

4. being explicitly aliased by an EQUIVALENCE statement to another variable which is then modified.

Currently our analysis does not detect modification through COMMON or EQUIVALENCE because none of the code in SPICELIB uses these features with formal parameters. We track modi­fications to input parameters by using an approximate dataflow algorithm that propagates a set of unmodified variables through the sequence of statements in the subroutine. At each statement, if a variable X in the set could be modified by the execution of the statement, then X is removed from the set. After the propagation, we can easily check whether or not an IF statement is a guard.

Results. The result of this analysis is a table of preconditions associated with each subrou­tine. Since we are targeting partial specification elaboration for Amphion, we chose to make the tool output the preconditions in I TgX form. Figure 13 gives examples of preconditions extracted for a few SPICELIB subroutines. Our tool generated the lATgX source included in Figure 13 without change.

UNDERSTANDING INTERLEAVED CODE 65

RECGEO -n{F > 1) A ^{RE < O.ODO)

REMSUB -^({LEFT > RIGHT) V (RIGHT < 1) V (LEFT < 1) V (RIGHT > LEN(IN)) V (LEFT > LEN(IN)))

SURFPT -'((C/(l) = O.ODO) A (U(2) = O.ODO) A (U(3) = O.ODO))

XPOSBL -y((MOD(NCOL,BSIZE) 7« 0) V (MOD(NROW, BSIZE) 7 0)) A -^(NCOL < 1) A ^(NROW < 1) A -^(BSIZE < 1)

Figure 13. Preconditions extracted for some of the subroutines in SPICELIB.

Taken literally, the precondition for SURFPT, for example, states that one of the first three elements of the u array parameter must be non-zero. In terms of solar system geometry, u is seen as a vector, so the more abstract precondition can be stated as "U is not the zero vector." Extracting the precondition into the literal representation is the first step to being able to express the precondition in the more abstract form.

The other preconditions listed in Figure 13, stated in their abstract form, are the following. The subroutine RECGEO converts the rectangular coordinates of a point RECTAN to geodetic coordinates, with respect to a given reference spheroid whose equatorial radius is RE, using a flattening coefficient p. Its precondition is that the radius is greater than 0 and the flattening coefficient is less than 1. The subroutine REMSUB removes the substring (LEFT-.RIGHT) from a character string IN. It requires that the positions of the first character LEFT and the last character RIGHT to be removed are in the range 1 to the length of the string and that the position of the first character is less than the position of the last. Finally, the subroutine XPOSBL transposes the square blocks within a matrix BMAT. Its preconditions are that the block size BSIZE must evenly divide both the number of rows NROW in BMAT and the number of columns NCOL and that the block size, number of rows, and number of columns are all at least 1.

3.3. Finding Interleaving Candidates

There are several other analyses that we have investigated using heuristic techniques for finding interleaving candidates.

66 RUGABER, STIREWALT, AND WILLS

3.3.1. Routines with Multiple Outputs

One heuristic for finding instances of interleaving is to determine which subroutines compute more than one output. When this occurs, the subroutine is returning either the results of multiple distinct computations or a result whose type cannot be directly expressed in the Fortran type system (e.g., as a data aggregate). In the former case, the subroutine is realized as the interleaving of multiple distinct plans, as is the case with NPEDLN'S computation of both the nearest point and the shortest distance.

In the latter case, the subroutine may be implementing only a single plan, but a maintainer's conceptual categorization of the subroutine is still obscured by the appearance of some number of seemingly distinct outputs. A good example of this case occurs in the SPICELIB

subroutine SURFPT, which conceptually returns the intersection of a vector with the surface of an ellipsoid. However, it is possible to give SURFPT a vector and an ellipsoid that do not intersect. In such a situation the output parameter POINT will be undefined, but the Fortran type system cannot express the type: DOUBLE PRECISION V Undefined. The original programmer was forced to simulate a variable of this type using two variables, POINT and FOUND, adopting the convention that when FOUND is false, the return value is Undefined, and when FOUND is true, the return value is POINT.

Clearly subroutines with multiple outputs complicate program understanding. We built a tool that determines the multiple output subroutines in a library by analyzing the direction of dataflow in parameters of functions and subroutines. A parameter's direction is either: in if the parameter is only read in the subroutine, out if the parameter is only written in the subroutine, or in-out if the parameter is both read and written in the subroutine. Multiple output subroutines will have more than one parameter with direction out or in-out.

Our tool bases its analysis on the structure chart (call graph) objects that the Software Refinery creates. The nodes of these structure charts are annotated with parameter direction information. The resulting analysis showed that 25 percent of the subroutines in SPICELIB

had multiple output parameters. We were thus able to focus our work on these routines first, as they are likely to involve interleaving.

In addition, we performed an empirical analysis to determine, for those routines covered by the Amphion domain model (35 percent of the library), which ones have multiple output parameters, some of which are not covered by the domain model. We refer to outputs that are not mapped to anything in the domain model as dead end dataflows (similar to an interprocedural version of dead code (Aho et al., 1986)). Since the programs that Amphion creates can never make use of these return values, they have not been associated with any meaning in the domain theory. For example, NPEDLN'S distance output (DIST) is a dead end dataflow as far as the domain theory is concerned. Dead end dataflows imply interleaving in the subroutine and/or an incompleteness in the domain theory. Our analysis revealed that of the subroutines covered by the domain theory, 30 percent have some output parameters that are dead end dataflows. These are good focal points for detecting interleaved plans that might be relevant to extending the domain theory.

UNDERSTANDING INTERLEAVED CODE 67

3.3.2. Control Coupling

Another heuristic for detecting potential interleaving finds candidate routines that may be involved in control coupling. Control coupling is often implemented by using a subroutine formal parameter as a control flag. So, we focus on calls to library routines that supply a constant as a parameter to other routines, as opposed to a variable. The constant parameter may be a flag that is being used to choose among a set of possible computations to perform. The heuristic strategy we use for detecting control coupling first computes a set of candidate routines that are invoked with a constant parameter at every call-site in the library or in code generated from the Amphion domain theory. Each member of this set is then analyzed to see if the formal parameter associated with the constant actual parameter is used to conditionally execute disjoint sections of code. Our analysis shows that 19 percent of the routines in SPICELIB are of this form.

3.3.3. Reformulation Wrappers

A third heuristic for locating interleaving is to ask: Which pairs of routines co-occurl Two routines co-occur if they are always called by the same routines, they are executed under the same conditions, and there is a flow of computed data from one to the other. We would like to detect co-occurrence pairs because they are likely to form reformulation wrappers. Of course, in general we would like to consider any code fragments as potential pairs, not just library routines. Once co-occurrence pairs are detected, they must be further checked to see whether they are inverses of each other. For example, in the "scale-unscale" reformulation wrapper, the operations that divide and multiply by the scaling factor co-occur and invert the effects of each other; the inputs are scaled (divided) and the results of the wrapped computation are later unsealed (multiplied). Through empirical investigation of SPICELIB,

we have discovered co-occurrence pairs that form reformulation wrappers and are building tools to perform this analysis automatically.

4. Open Issues and Future Work

We are convinced that interleaving seriously complicates understanding computer programs. But recognizing a problem is different from knowing how to fix it. Questions arise as to what form of representation is appropriate to hold the extracted information, how knowledge of the application domain can be used to detect plans, the extent to which the concept of interleaving scales up, and how powerful tools need to be to detect and extract interleaved components.

4,1. Representation

Our strategy for building program analysis tools is to formulate a program representation whose structural properties correspond to interesting program properties. A programming

68 RUGABER, STIREWALT, AND WILLS

Style tool, for example, uses a control flow graph that explicitly represents transfer of execu­tion flow in programs. Irreducible control flow graphs signify the use of unstructured GO TO statements. The style tool uses this structural property to report violations of structured programming style. Since we want to build tools for interleaving detection we have to for­mulate a representation that captures the properties of interleaving. We do this by first listing structural properties that correspond to each of the three characteristics of interleaving and then searching for a representation that has these structural properties.

The key characteristics of interleaving are delocalization, resource sharing, and indepen­dence. In sequential languages like Fortran, delocalization often cannot be avoided when two or more plans share data. The components of the plans have to be serialized with respect to the dataflow constraints. This typically means that components of plans cluster around the computation of the data being shared as opposed to clustering around other components of the same plan. This total ordering is necessary due to the lack of support for concurrency in most high level programming languages. It follows then that in order to express a delocalized plan, a representation must impose a partial rather than a total execution ordering on the components of plans.

The partial execution ordering requirement suggests that some form of graphical represen­tation is appropriate. Graph representations naturally express a partial execution ordering via implicit concurrency and explicit transfer of control and data. Since there are a number of such representations to choose from, we narrow the possibilities by noting that:

1. independent plans must be localized as much as possible, with no explicit ordering among them;

2. sharing must be detectable (shared resources should explicitly flow from one plan to another); similarly if two plans pi, p2 both share a resource provided by a plan ps then Pi and p2 should appear in the graph as siblings with a common ancestor ps;

3. the representation must support multiple views of the program as the interaction of plans at various levels of abstraction, since interleaving may occur at any level of abstraction.

An existing formalism that meets these criteria is Rich's Plan Calculus (Rich, 1981, Rich, 1981, Rich and Waters, 1990). A plan in the Plan Calculus is encoded as a graphi­cal depiction of the plan's structural parts and the constraints (e.g., data and control flow connections) between them. This diagrammatic notation is complemented with an axiom-atized description of the plan that defines its formal semantics. This allows us to develop correctness preserving transformations to extract interleaved plans. The Plan Calculus also provides a mechanism, called overlays^ for representing correspondences and relationships between pairs of plans (e.g., implementation and optimization relationships). This enables the viewing of plans at multiple levels of abstraction. Overlays also support a general notion of plan composition which takes into account resource sharing at all levels of abstraction by allowing overlapping points of view.

UNDERSTANDING INTERLEAVED CODE 69

4.2. Exploiting Application Knowledge

Most of the current technology available to help understand programs addresses implemen­tation questions; that is, it is driven by the syntactic structure of programs written in some programming language. But the tasks that require the understanding - perfective, adaptive, and corrective maintenance - are driven by the problem the program is solving; that is, its application domain. For example, if a maintenance task requires extending NPEDLN to handle symmetric situations where more than one "nearest point" to a line exist, then the programmer needs to figure out what to do about the distance calculation also computed by NPEDLN. Why was DIST computed inside of the routine instead of separately? Was it only for efficiency reasons, or might the nearest point and the distance be considered a/?a/r of results by its callers? In the former case, a single DIST return value is still appropriate, in the latter, a pair of identical values is indicated. To answer questions like these, programmers need to know which plans pieces of code are implementing. And this sort of plan knowledge derives from understanding the application area, not the program.

Another example from NPEDLN concerns reformulation wrappers. These plans are inher­ently delocalized. In fact, they only make sense as plans at all when considered in the context of the application: stable computations of solar system geometry. Without this understanding, the best hope is to recognize that the code has uniformly applied a function and its inverse in two places, without knowing why this was done and how the computations are connected.

The underlying issue is that any scheme for code understanding based solely on a top-down or a bottom-up approach is inherently limited. As illustrated by the examples, a bottom-up approach cannot hope to relate delocalized segments or disentangle interleavings without being able to relate to the application goals. And a top-down approach cannot hope to find where a plan is implemented without being able to understand how plan implementations are related syntactically and via dataflows. The implication is that a coordinated strategy is indicated, where plans generate expectations that guide program analysis and program analysis generates related segments that need explanation.

4.3, Scaling the Concept of Interleaving

We can characterize the ways interleaving manifests itself in source code along two spec-trums. These form a possible design space of solutions to the interleaving problem and can help relate existing techniques that might be applicable. One spectrum is the scope of the interleaving, which can range from intraprocedural to interprocedural to object (clusters of procedures and data) to architectural. The other spectrum is the structural mechanism providing the interleaving, which may be naming, control, data, or protocol. Protocols are global constraints, such as maintaining stack discipline or synchronization mechanisms for cooperating processes. For example, the use of control flags is a control-based mechanism for interleaving with interprocedural scope. The common iteration construct involved in loop fusion is another control-based mechanism, but this interleaving has intraprocedural scope. Reformulation wrappers use a protocol mechanism, usually at the intraprocedural level, but they can have interprocedural scope. Multiple-inheritance is an example of a

70 RUGABER, STIREWALT, AND WILLS

data-centered interleaving mechanism with object scope. Interleaving at the scope of ob­jects and architectures or involving global protocol mechanisms is not yet well understood. Consequently, few mechanisms for detection and extraction currently exist in these areas.

4.4. Tool Support

We used the Software Refinery from Reasoning Systems in our analyses. This comprehen­sive toolkit provides a set of language-specific browsers and analyzers, a parser generator, a user interface builder, and an object-oriented repository for holding the results of anal­yses. We made particular use of two other features of the toolkit. The first is called the Workbench, and it provided pre-existing analyses for traditional graphs and reports such as structure charts, dataflow diagrams, and cross reference lists. The results of the analyses can be accessed from the repository using small, Refine language programs such as those described in (Rugaber et al., 1995). The Refine compiler was the other feature we used, compiling a Refine program into compiled Lisp.

The approach taken by the Refine language and tool suite has many advantages for attacking problems like ours. The language itself combines features of imperative, object-oriented, functional, and rule-based programming, thus providing flexibility and generality. Of particular value to us is its rule-based constructs. Before-and-after condition patterns define the properties of constructs without indicating how to find them. We had merely to add a simple tree walking routine to apply the rules to the abstract syntax tree. In addi­tion to the rule-based features. Refine provides abstract data structures, such as sets, maps, and sequences, which manage their own memory requirements, thereby reducing program­mer work. The object-oriented repository further reduces programmer responsibility by providing persistence and memory management.

We also take full advantage of Reasoning Systems' existing Fortran language model and its structure chart analysis. These allowed us a running start on our analysis and provided a robust handling of Fortran constructs that are not typically available from non-commercial research tools.

We can see several ways in which the Refine approach can be extended. In particular, the availability of other analyses, such as control flow graphs for Fortran and general dataflow analysis, would prove useful. Robust dataflow analysis is particularly important to the precision of precondition extraction.

5. Related Work

Techniques for detecting interleaving and disentangling interleaved plans are likely to build on existing program comprehension and maintenance techniques.

UNDERSTANDING INTERLEAVED CODE 71

5.1, The Role of Recognition

When what is interleaved is familiar (i.e., stereotypical, frequently used plans), cliche recognition (e.g., (Hartman, 1991, Johnson, 1986, Kozaczynski and Ning, 1994, Letovsky, 1988, Quilici, 1994, Rich and Wills, 1990, Wills, 1992)) is a useful detection mechanism. In fact, most recognition systems deal explicitly with the recognition of cliches that are interleaved in specific ways with unrecognizable code or other cliches. One of the key features of GRASPR (Wills, 1992), for instance, is its ability to deal with delocalization and redistribution-type function sharing optimizations.

KBEmacs (Rich and Waters, 1990, Waters, 1979) uses a simple, special-purpose recogni­tion strategy to segment loops within programs. This is based on detecting coarse patterns of data and control flow at the procedural level that are indicative of common ways of con­structing, augmenting, and interleaving iterative computations. For example, KBEmacs looks for minimal sections of a loop body that have data flow feeding back only to themselves. This decomposition enables a powerful form of abstraction, called temporal abstraction, which views iterative computations as compositions of operations on sequences of values. The recognition and temporal abstraction of iteration cliches is similarly used in GRASPR to enable it to deal with generalized loop fusion forms of interleaving. Loop fusion is viewed as redistribution of sequences of values and treated as any other redistribution optimization (Wills, 1992).

Most existing cliche recognition systems tend to deal with interleaving involving data and control mechanisms. Domain-based clustering, as explored by DM-TAG in the DESIRE system (Biggerstaff et al., 1994), focuses on naming mechanisms, by keying in on the patterns of linguistic idioms used in the program, which suggest the manifestations of domain concepts.

Mechanisms for dealing with specific types of interleaving have been explicitly built into existing recognition systems. In the future, we envision recognition architectures that detect not only familiar computational patterns, but also recognize familiar types of transforma­tions or design decisions that went into constructing the program. Many existing cliche recognition systems implicitly detect and undo certain types of interleaving design deci­sions. However, this process is usually done with special-purpose procedural mechanisms that are difficult to extend and that are viewed as having supporting roles to the cliche recognition process, rather than as being an orthogonal form of recognition.

5.2, Disentangling Unfamiliar Plans

When what is interleaved is unfamiliar (i.e., novel, idiosyncratic, not repeatedly used plans), other, non-recognition-based methods of delineation are needed. For example, slicing (Weiser, 1981, Ning et al., 1994) is a widely-used technique for localizing functional com­ponents by tracing through data dependencies within the procedural scope. Cluster analysis (Biggerstaff et al., 1994, Hutchens and Basili, 1985, Schwanke, 1991, Schwanke, 1989) is used to group related sections of code, based on the detection of shared uses of global data, control paths, and names. However, clustering techniques can only provide limited assistance by roughly delineating possible locations of functionally cohesive components. Another technique, called "potpourri module detection" (Calliss and Cornelius, 1990), de-

72 RUGABER, STIREWALT, AND WILLS

tects modules that provide more than one independent service by looking for multiple proper subgraphs in an entity-to-entity interconnection graph. These graphs show dependencies among global entities within a single module. Presumably, the independent services reflect separate plans in the code.

Research into automating data encapsulation has recently provided mechanisms for hypothesizing possible locations of data plans at the object scope. For example, Bow-didge and Griswold (Bowdidge and Griswold, 1994) use an extended data flow graph rep­resentation, called a star diagram, to help progranmiers see all the uses of a particu­lar data structure and to detect frequently occurring computations that are candidates for abstract functions. Techniques have also been developed within the RE'^ project (Canfora et al., 1993, Cimitile et al., 1994) for identifying candidate abstract data types and their associated modules, based on the call graph and dominance relations. Further research is required to develop techniques for extracting objects from pieces of data that have not already been aggregated in programmer-defined data structures. For example, detecting multiple pieces of data that are always used together might suggest candidates for data aggregation (as for example, in NPEDLN, where the input parameters A, B, and c are used as a tuple representing an ellipsoid, and the outputs PNEAR and DIST represent a pair of results related by interleaved, highly overlapping plans).

6. Conclusion

Interleaving is a commonly occurring phenomenon in the code that we have examined. Although a particular instance may be the result of an intentional decision on the part of a programmer trying to improve the efficiency of a program, it can nevertheless make understanding the program more difficult for subsequent maintainers. In our studies we have observed that interleaving typically involves the implementation of several independent plans in one code segment, often so that a program resource could be shared among the plans. The interleaving can, in turn, lead to each of the separate plan implementations being spread out or delocalized throughout the segment.

To investigate the phenomenon of interleaving, we have studied a substantial collection of production software, SPICELIB from the Jet Propulsion Laboratory, SPICELIB needs to be clearly understood in order to support automated program generation as part of the Amphion project, and we were able to add to the understanding by performing a variety of interleaving-based analyses. The results of these studies reinforce our feelings that interleaving is a useful concept when understanding is important, and that many instances of interleaving can be detected by relatively straightforward tools.

Acknowledgments

Support for this research has been provided by ARPA, (contract number NAG 2-890). We are grateful to JPL'S NAIF group for enabling our study of their SPICELIB software. We also benefited from insightful discussions with Michael Lowry at Nasa Ames Research Center concerning this study and interesting future directions.

UNDERSTANDING INTERLEAVED CODE 73

Appendix NPELDN with Some of Its Documentation

LINEPT ( 3 ) LINEDR ( 3 ) PNEAR ( 3 ) DIST RETURN CANDPL ( UBPL ) CAND ( UBEL ) OPPDIR ( 3 ) PRJPL ( UBPL ) MAG NORMAL (3 ) PRJEL { UBEL ) PRJPT (3 ) PRJNPT ( 3 )

( SCALE SOLA SCLB SCLC SCLPT UDIR I FOUND IFOUND XFOUND

3 ) 3 )

2 )

C$ Nearest point on ellipsoid to line. SUBROUTINE NPEDLN(A,B,C,LINEPT,LINEDR,PNEAR,

DIST) INTEGER UBEL PARAMETER ( UBEL = 9 ) INTEGER UBPL PARAMETER { UBPL = 4 ) DOUBLE PRECISION A DOUBLE PRECISION B DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION LOGICAL DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION DOUBLE PRECISION INTEGER LOGICAL LOGICAL LOGICAL IF ( RETURN 0 ) THEN

RETURN ELSE

CALL CHKIN ( 'NPEDLN' ) END IF CALL UNORM ( LINEDR, UDIR, MAG ) IF ( MAG .EQ. 0 ) THEN

CALL SETMSG('Direction is zero vector.') CALL SIGERR{'SPICE(ZEROVECTOR)* ) CALL CHKOUTCNPEDLN' ) RETURN

ELSE IF (( A .LE. O.DO ) .OR. ( B .LE. O.DO ) •OR. { C .LE. O.DO ) ) THEN

CALL SETMSG ('Semi-axee: A=#,B=#,C=#.') CALL ERRDP {'#', A ) CALL ERRDP ('#', B ) CALL ERRDP {•*'. C ) CALL SIGERR ('SPICE(INVALIDAXISLENGTH)') CALL CHKOUT ('NPEDLN' ) RETURN

END IF C Scale the semi-zixes lengths for better C numerical behavior. If squaring any of the C scaled lengths causes it to underflow to C zero, signal an error. Otherwise scale the C point on the input line too.

SCALE = MAX { DABS(A), DABS{B), DABS(C) ) SCLA = A / SCALE SCLB = B / SCALE SCLC = C / SCALE IF (( SCLA**2 .LE. O.DO )

.OR. { SCLB**2 .LE. O.DO )

.OR. ( SCLC**2 .LE. O.DO ) ) THEN CALL SETMSG {'Axis too small: A=#,B=#,C=#.')

CALL ERRDP ('•', A ) CALL ERRDP (' # ' , B ) CALL ERRDP (' # ' , C ) CALL SIGERR {'SPICE(DEGENERATECASE)') CALL CHKOUT ('NPEDLN' ) RETURN

END IF SCLPT(1) = LINEPT(1) / SCALE SCLPT(2) = LINEPT(2) / SCALE SCLPT(3) = LINEPT(3) / SCALE

C Hand off the intersection case to SURFPT. C SURFPT determines whether rays intersect a body, C so we treat the line as a pair of rays.

CALL VMINUS(UDIR, OPPDIR) CALL SURFPT(SCLPT, UDIR, SCLA, SCLB,

SCLC, PT(1,1), FOUND(l)) CALL SURFPT(SCLPT, OPPDIR, SCLA, SCLB,

SCLC, PT(1,2), F0UND(2)) DO 50001

I = 1, 2 IF ( FOUND(I) ) THEN

DIST = O.ODO CALL VEQU ( PT(1,I), PNEAR ) CALL VSCL ( SCALE, PNEAR, PNEAR ) CALL CHKOUT { 'NPEDLN' ) RETURN

END IF 50001 CONTINUE C Getting here means the line doesn't intersect C the ellipsoid. Find the candidate ellipse CAND. C NORMAL is a normal vector to the plane C containing the candidate ellipse. Mathematically C the ellipse must exist; it's the intersection of C an ellipsoid centered at the origin and a plane C containing the origin. Only numerical problems C can prevent the intersection from being found. NORMAL(l) = UDIR(l) / SCLA**2 NORMAL(2) = UDIR(2) / SCLB**2 N0RMAL(3) = UDIR(3) / SCLC**2 CALL NVC2PL ( NORMAL, O.DO, CANDPL ) CALL INEDPL (SCLA,SCLB,SCLC,CANDPL,CAND,XFOUND) IF ( .NOT. XFOUND ) THEN

CALL SETMSG ( 'Cauididate ellipse not found.') CALL SIGERR ( 'SPICE(DEGENERATECASE)' ) CALL CHKOUT ( 'NPEDLN' ) RETURN

END IF C Project the candidate ellipse onto a plane C orthogonal to the line. We'll call the plane C PRJPL and the projected ellipse PRJEL.

CALL NVC2PL ( UDIR, O.DO, PRJPL ) CALL PJELPL ( CAND, PRJPL, PRJEL )

C Find the point on the line lying in the project-C ion plane, and then find the near point PRJNPT C on the projected ellipse. Here PRJPT is the C point on the line lying in the projection plane. C The distance between PRJPT and PRJNPT is DIST.

CALL VPRJP ( SCLPT, PRJPL, PRJPT ) CALL NPELPT ( PRJPT, PRJEL, PRJNPT ) DIST = VDIST ( PRJNPT, PRJPT )

C Find the near point PNEAR on the ellipsoid by C taking the inverse orthogonal projection of C PRJNPT; this is the point on the camdidate C ellipse that projects to PRJNPT. The output C DIST was coinputed in step 3 and needs only to be C re-scaled. The inverse projection of PNEAR ought C to exist, but may not be calculeible due to nu-C merical problems (this cem only happen when the C ellipsoid is extremely flat or needle-shaped).

CALL VPRJPKPRJNPT,PRJPL, CANDPL, PNEAR, IFOUND) IF ( .NOT. IFOUND ) THEN

CALL SETMSG ('Inverse projection not found.*) CALL SIGERR ('SPICE(DEGENERATECASE)' ) CALL CHKOUT ('NPEDLN' ) RETURN

END IF C Undo the scaling.

CALL VSCL ( SCALE, PNEAR, PNEAR ) DIST = SCALE * DIST CALL CHKOUT ( 'NPEDLN' ) RETURN END

74 RUGABER, STIREWALT, AND WILLS

C C c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c

Descriptions of subroutines called by NPEDLN:

CHKIN Module Check In (error handling). UNORM Normalize double precision 3-vector. SETMSG Set Long Error Message. SIGERR Signal Error Condition. CHKOUT Module Check Out (error handling). ERRDP Insert DP Number into Error Message Text. VMINUS Negate a double precision 3-D vector. SURFPT Find intersection of vector w/ ellipsoid. VEQU Make one DP 3-D vector equal to another. VSCL Vector scaling, 3 dimensions. NVC2PL Make plane from normal and constant. INEDPL Intersection of ellipsoid and plane. PJELPL Project ellipse onto plane, orthogonally. VPRJP Project a vector onto plane orthogonally. NPELPT Find nearest point on ellipse to point. VPRJPI Vector projection onto plane, inverted.

PRJEL

PRJPT PRJNPT

SCALE

Length of semi-axis in the x direction. Length of semi-axis in the y direction. Length of semi-axis in the z direction. Point on input line. Direction vector of input line. Nearest point on ellipsoid to line. Distance of ellipsoid from line. Upper bound of array containing ellipse. Upper bound of array containing plane. Intersection point of line & ellipsoid. Candidate ellipse. Plane containing candidate ellipse. Normal to the candidate plane CANDPL. Unitized line direction vector. Magnitude of line direction vector. Vector in direction opposite to UDIR. Projection plane, which the candidate ellipse is projected onto to yield PRJEL. Projection of the candidate ellipse CAND onto the projection plane PRJEL. Projection of line point. Nearest point on projected ellipse to projection of line point. Scaling factor.

UNDERSTANDING INTERLEAVED CODE 75

References

Aho, A., R. Sethi, and J. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, 1986.

Basili, V.R. and H.D. Mills. Understanding and documenting programs. IEEE Transactions on Software Engineering, 8(3):27(>-283, May 1982.

Biggerstaff, T, B. Mitbander, and D. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72-83, May 1994.

Boehm, Bany. Software Engineering Economics. Prentice Hall, 1981. Bowdidge, R. and W. Griswold. Automated support for encapsulating abstract data types. In Proc. 2nd ACM

SIGSOFT Symposium on Foundations of Software Engineering, pages 97-110, New Orleans, Dec. 1994. Brooks, R. Towards a theory of the comprehension of computer programs. International Journal of Man-Machine

Studies, 18:543-554,1983. Calliss, F. and B.Cornelius. Potpourri module detection. In IEEE Conference on Software Maintenance -1990,

pages 46-51, San Diego, CA, November 1990. IEEE Computer Society Press. Canfora, G., A.Cimitile, and M.Munro. A reverse engineering method for identifying reusable abstract data types.

In Proc. of the First Working Conference on Reverse Engineering, pages 73-82, Baltimore, Maryland, May 1993. IEEE Computer Society Press.

Cimitile, A., M.Tortorella, and M.Munro. Program comprehension through the identification of abstract data types. In Proc. 3rd Workshop on Program Comprehension, pages 12-19, Washington, D.C., November 1994. IEEE Computer Society Press.

Fjeldstad, R.K. and W.T. Hamlen. Application program maintenance study: Report to our respondents. In GUIDE 48, 4 1979. Also appears in (Parikh and Zvegintozov, 1983).

Hall, R. Program improvement by automatic redistribution of intermediate results. Technical Report 1251, MIT Artificial Intelligence Lab., February 1990. PhD.

Hall, R. Program improvement by automatic redistribution of intermediate results: An overview. In M.Lowry and R.McCartney, editors. Automating Software Design. AAAI Press, Menlo Park, CA, 1991.

Hartman, J. Automatic control understanding for natural programs. Technical Report AI91-161, University of Texas at Austin, 1991. PhD thesis.

Hutchens, D. and V.Basili. System structure analysis: Clustering with data bindings. IEEE Transactions on Software Engineering, 11(8), August 1985.

ReasoningSystems Incorporated. Software Refinery Toolkit. Palo Alto, CA. Johnson, W.L. Intention-Based Diagnosis of Novice Programming Errors. Morgan Kaufmann Publishers, Inc.,

Los Altos, CA, 1986. Kozaczynski, W. and J.Q. Ning. Automated program understanding by concept recognition. Automated Software

Engineering, l(l):61-78, March 1994. Letovsky, S. Plan analysis of programs. Research Report 662, Yale University, December 1988. PhD. Letovsky, S. and E.Soloway. Delocalized plans and program comprehension. IEEE Software, 3(3), 1986. Lowry, M., A.Philpot, TPressburger, and I.Underwood. Amphion: automatic programming for subroutine

libraries. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 2-11, Monterey, CA, 1994. Lowry, M., A.Philpot, TPressburger, and I.Underwood. A formal approach to domain-oriented software design

environments. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 48-57, Monterey, CA, 1994.

Myers, G. Reliable Software through Composite Design. Petrocelli Charter, 1975. Ning, J.Q., A.Engberts, and WKozaczynski. Automated support for legacy code understanding. Communications

of the ACM, 37(5):50-57, May 1994. Ombum, S. and S.Rugaber. Reverse engineering: Resolving conflicts between expected and actual software

designs. In IEEE Conf on Software Maintenance -1992, pages 32-40, Orlando, Florida, November 1992. Parikh, G. and N.Zvegintozov, editors. Tutorial on Software Maintenance. IEEE Computer Society, 1983. Order

No. EM453. Quilici, A. A memory-based approach to recognizing programming plans. Communications of the ACM,

37(5):84-93, May 1994. Rich, C. A formal representation for plans in the Programmer's Apprentice. In Proc. 7th International Joint

Conference on Artificial Intelligence, pages 1044-1052, Vancouver, British Columbia, Canada, August 1981.

7 6 RUGABER, STIREWALT, AND WILLS

Rich, C. Inspection methods in programming. Technical Report 604, MIT Artificial Intelligence Lab., June 1981. PhD thesis.

Rich, C. and R.C. Waters. The Programmer's Apprentice. Addison-Wesley, Reading, MA and ACM Press, Baltimore, MD, 1990.

Rich, C. and L.M. Wills. Recognizing a program's design: A graph-parsing approach. IEEE Software yl{\)\%2-%9, January 1990.

Rugaber, S., S.Ombum, and R.LeBlanc. Recognizing design decisions in programs. IEEE Software, 7(l):46-54, January 1990.

Rugaber, S., K.Stirewalt, and L.Wills. Detecting interleaving. In IEEE Conference on Software Maintenance -1995, pages 265-274, Nice, France, September 1995. IEEE Computer Society Press.

Rugaber, S., K.Stirewalt, and L.Wills. The interleaving problem in program understanding. In Proc. of the Second Working Conference on Reverse Engineering, pages 166-175, Toronto, Ontario, July 1995. IEEE Computer Society Press.

Schwanke, R. An intelligent tool for re-engineering software modularity. In IEEE Conference on Software Maintenance -1991, pages 83-92,1991.

Schwanke, R., R.Altucher, and M.Platoff. Discovering, visualizing, and controlling software structure. In Proc. 5th Int. Workshop on Software Specification and Design, pages 147-150, Pittsburgh, PA, 1989.

Selfridge, P., R.Waters, and E.Chikofsky. Challenges to the field of reverse engineering - A position paper. In Proc. of the First Working Conference on Reverse Engineering, pages 144-150, Baltimore, Maryland, May 1993. IEEE Computer Society Press.

Smith, D., G.Kotik, and S.Westfold. Research on knowledge-based software environments at Kestrel Institute. IEEE Transactions on Software Engineering, November 1985.

Soloway, E. and K.Ehrlich. Empirical studies of programming knowledge. IEEE Transactions on Software Engineering, 10(5):595-609, September 1984. Reprinted in C. Rich and R.C. Waters, editors. Readings in Artificial Intelligence and Software Engineering, Morgan Kauftnann, 1986.

Stickel, M., R.Waldinger, M.Lowry, T.Pressburger, I.Underwood, and A.Bundy. Deductive composition of astro­nomical software from subroutine libraries. In Proc. 12th International Conference on Automated Deduction, pages 341-55, Nancy, France, 1994.

Waters, R.C. A method for analyzing loop programs. IEEE Transactions on Software Engineering, 5(3):237-247, May 1979.

Weiser, Mark. Program slicing. In 5th International Conference on Software Engineering, pages 439-449, San Diego, CA, 3 1981.

Wills, L. Automated program recognition by graph parsing. Technical Report 1358, MIT Artificial Intelligence Lab., July 1992. PhD Thesis.

Yourdon, E. and L. Constantine. Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design. Prentice-Hall, 1979.

Automated Software Engineering, 3, 77-108 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Pattern Matching for Clone and Concept Detection * K. A. KONTOGIANNIS, R. DEMORI, E. MERLO, M. GALLER, M. BERNSTEIN

kostas @ cs.mcgill.ca

McGill University School of Computer Science 3480 University St., Room 318, Montreal, Canada H3A 2A7

Abstract. A legacy system is an operational, large-scale software system that is maintained beyond its first generation

of programmers. It typically represents a massive economic investment and is critical to the mission of the organization it serves. As such systems age, they become increasingly complex and brittle, and hence harder to maintain. They also become even more critical to the survival of their organization because the business rules encoded within the system are seldom documented elsewhere.

Our research is concerned with developing a suite of tools to aid the maintainers of legacy systems in recovering the knowledge embodied within the system. The activities, known collectively as "program understanding*', are essential preludes for several key processes, including maintenance and design recovery for reengineering.

In this paper we present three pattern-matching techniques: source code metrics, a dynamic programming algorithm for finding the best alignment between two code fragments, and a statistical matching algorithm between abstract code descriptions represented in an abstract language and actual source code. The methods are applied to detect instances of code cloning in several moderately-sized production systems including tcsh, bash, and CLIPS.

The programmer's skill and experience are essential elements of our approach. Selection of particular tools and analysis methods depends on the needs of the particular task to be accomplished. Integration of the tools provides opportunities for synergy, allowing the programmer to select the most appropriate tool for a given task.

Keywords: reverse engineering, pattern matching, program understanding, software metrics, dynamic program­ming

1. Introduction

Large-scale production software systems are expensive to build and, over their useful life­times, are even more expensive to maintain. Successful large-scale systems are often called "legacy systems" because (a) they tend to have been in service for many years, (b) the original developers, in the normal course of events, move on to other projects, leaving the system to be maintained by successive generations of maintenance programmers, and (c) the systems themselves represent enormous corporate assets that cannot be easily replaced.

Legacy systems are intrinsically difficult to maintain because of their sheer bulk and because of the loss of historical information: design documentation is seldom maintained as the system evolves. In many cases, the source code becomes the sole repository for evolving corporate business rules.

* This work is in part supported by IBM Canada Ltd., Institute for Robotics and Intelligent Systems, a Canadian Network of Centers of Excellence and, the Natural Sciences and Engineering Research Council of Canada. Based on "Pattern Matching for Design Concept Localization" by K.A.Kontogiannis, R.DeMori, M.Bernstein, M.Galler, E.Merlo, which first appeared in Proceedings of the Second Working Conference on Reverse Enginering, pp.96-103, July, 1995, © IEEE, 1995

78 KONTOGIANNIS ET AL.

During system maintenance, it is often necessary to move from low, implementation-oriented levels of abstraction back to the design and even the requirements levels. The process is generally known as "reverse engineering".^ In (Chikofsky, 1990) there are def­initions for a variety of subtasks, including "reengineering", "restructuring", and "redocu-mentation".

In particular, it has been estimated that 50 to 90 percent of the maintenance programmer's effort is devoted to simply understanding relationships within the program. The average Fortune 100 company maintains 35 million lines of source code (MLOC) with a growth rate of 10 percent per year just in enhancements, updates, and normal maintenance. Facilitating the program understanding process can yield significant economic savings.

We believe that maintaining a large legacy software system is an inherently human activity that requires knowledge, experience, taste, judgement and creativity. For the foreseeable future, no single tool or technique will replace the maintenance progranmier nor even satisfy all of the programmer's needs. Evolving real-world systems requires pragmatism and flexibility.

Our approach is to provide a suite of complementary tools from which the programmer can select the most appropriate one for the specific task at hand. An integration framework enables exploitation of synergy by allowing conmiunication among the tools.

Our research is part of a larger joint project with researchers from IBM Centre for Ad­vanced Studies, University of Toronto, and University of Victoria (Buss et al., 1994)

Over the past three years, the team has been developing a toolset, called RevEngE (i?everse Engineering Environment), based on an open architecture for integrating heterogeneous tools. The toolset is integrated through a common repository specifically designed to support program understanding (Mylopoulos, 1990). Individual tools in the kit include Ariadne (Konto, 1994), ART (Johnson, 1993), and Rigi (Tilley, 1994). ART (Analysis of 7?edundancy in Text) is a prototype textual redundancy analysis system. Ariadne is a set of pattern matching and design recovery programs implemented using a commercial tool called The Software Refinery^. Currently we are working on another version of the Ariadne environment implemented in C++. Rigi is a programmable environment for pro­gram visualization. The tools communicate through a flexible object server and single global schema implemented using the Telos information modeling language and repository (Mylopoulos, 1990).

In this paper we describe two types of pattern-matching techniques and discuss why pattern matching is an essential tool for program understanding. The first type is based on numerical comparison of selected metric values that characterize and classify source code fragments.

The second type is based on Dynamic Programming techniques that allow for statement-level comparison of feature vectors that characterize source code program statements. Con­sequently, we apply these techniques to address two types of relevant program understanding problems.

The first one is a comparison between two different program segments to see if one is a clone of the other, that is if the two segments are implementations of the same algo­rithm. The problem is in theory undecidable, but in practice it is very useftil to provide software maintainers with a tool that detects similarities between code segments. Similar

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 79

segments are proposed to the software engineer who will make the final decision about their modification or other use.

The second problem is the recognition of program segments that implement a given progranmiing concept. We address this problem by defining a concept description language called ACL and by applying statement-level comparison between feature vectors of the language and feature vectors of source code program statements.

1.1. The Code Cloning Problem

Source code cloning occurs when a developer reuses existing code in a new context by making a copy that is altered to provide new functionality. The practice is widespread among developers and occurs for several reasons: making a modified copy may be simpler than trying to exploit conunonality by writing a more general, parameterized function; scheduling pressures may not allow the time required to generalize the code; and efficiency constraints may not admit the extra overhead (real or perceived) of a generalized routine.

In the long run, code cloning can be a costly practice. Firstly, it results in a program that is larger than necessary, increasing the complexity that must be managed by the maintenance programmer and increasing the size of the executable program, requiring larger computers. Secondly, when a modification is required (for example, due to bug fixes, enhancements, or changes in business rules), the change must be propagated to all instances of the clone. Thirdly, often-cloned functionality is a prime candidate for repackaging and generaliza­tion for a repository of reusable components which can yield tremendous leverage during development of new applications.

This paper introduces new techniques for detecting instances of source code cloning. Program features based on software metrics are proposed. These features apply to basic program segments like individual statements, begin-end blocks and functions. Distances between program segments can be computed based on feature differences. This paper proposes two methods for addressing the code cloning detection problem.

The first is based on direct comparison of metric values that classify a given code fragment. The granularity for selecting and comparing code fragments is at the level of begin-end blocks. This method returns clusters of begin-end blocks that may be products of cut-and-paste operations.

The second is based on a new Dynamic Programming (DP) technique that is used to calculate the best alignment between two code fragments in terms of deletions, insertions and, substitutions. The granularity for selecting code fragments for comparison is again at the level of begin-end blocks. Once two begin-end blocks have been selected, they are compared at the statement level. This method returns clusters of begin-end blocks that may be products of cut-and-paste operations. The DP approach provides in general, more accurate results (i.e. less false positives) than the one based on direct comparison of metric values at the begin-end block level. The reason is that comparison occurs at the statement level and informal information is taken into account (i.e. variable names, literal strings and numbers).

80 KONTOGIANNIS ET AL.

1.2, The Concept Recognition Problem

Programming concepts are described by a concept language. A concept to be recognized is a phrase of the concept language. Concept descriptions and source code are parsed. The concept recognition problem becomes the problem of establishing correspondences, as in machine translation, between a parse tree of the concept description language and the parse tree of the code.

A new formalism is proposed to see the problem as a stochastic syntax-directed translation. Translation rules are pairs of rewriting rules and have associated a probability that can be set initially to uniform values for all the possible alternatives.

Matching of concept representations and source code representations involves alignment that is again performed using a dynamic programming algorithm that compares feature vectors of concept descriptions, and source code.

The proposed concept description language, models insertions as wild characters (AbstractStatement* and AbstractStatemenf^) and does not allow any deletions from the pattern. The comparison and selection granularity is at the statement level. Comparison of a concept description language statement with a source code statement is achieved by comparing feature vectors (i.e. metrics, variables used, variables defined and keywords).

Given a concept description M = Ai; A2; .-Am, a code fragment V = Si; S2] .-Sk is selected for comparison if: a) the first concept description statement Ai matches with Si, and b) the sequence of statements 52; ...5jfc, belong to the innermost begin-end block containing 5i .

The use of a statistical formalism allows a score (a probability) to be assigned to every match that is attempted. Incomplete or imperfect matching is also possible leaving to the software engineer the final decision on the similar candidates proposed by the matcher.

A way of dynamically updating matching probabilities as new data are observed is also suggested in this paper. Concept-to-code matching is under testing and optimization. It has been implemented using the REFINE environment and supports plan localization in C programs.

1.3. Related Work

A number of research teams have developed tools and techniques for localizing specific code patterns.

The UNIX operating system provides numerous tools based on regular expressions both for matching and code replacement. Widely-used tools include grep, awk, ed and v i . These tools are very efficient in localizing patterns but do not provide any way for partial and hierarchical matching. Moreover, they do not provide any similarity measure between the pattern and the input string.

Other tools have been developed to browse source code and query software repositories based on structure, permanent relations between code fragments, keywords, and control or dataflow relationships. Such tools include CIA, Microscope, Rigi, SCAN, and REFINE. These tools are efficient on representing and storing in local repositories relationships between program components. Moreover, they provide effective mechanisms for querying

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 81

and updating their local repositories. However, they do not provide any other mechanism to localize code fragments except the stored relations. Moreover no partial matching and no similarity measures between a query and a source code entity can be calculated.

Code duplication systems use a variety of methods to localize a code fragment given a model or apattem. One category of such tools uses structure graphs to identify the"fingerprint" of a program (Jankowitz, 1988). Other tools use metrics to detect code patterns (McCabe, 1990),(Halstead, 1977), common dataflow (Horwitz, 1990), approximate fingerprints from program text files (Johnson, 1993), text comparison enhanced with heuristics for approxi­mate and partial matching (Baker, 1995), and text comparison tools such as Unix d i f f.

The closest tool to the approach discussed in this paper, is SCRUPLE (Paul, 1994). The major improvement of the solution proposed here is a) the possibility of performing partial matching with feature vectors, providing similarity measures between a pattern and a matched code fragment, and b) the ability to perform hierarchical recognition. In this approach, explicit concepts such as i t e r a t i v e - s t a t e m e n t can be used allowing for multiple matches with a Whi le , a For or, a Do statement in the code. Moreover, recognized patterns can be classified, and stored so that they can be used inside other more complex composite patterns. An expansion process is used for unwrapping the composite pattern into its components.

2. Code to Code Matching

In this section we discuss pattern-matching algorithms applied to the problem of clone detection. Determining whether two arbitrary program functions have identical behavior is known to be undecidable in the general case. Our approach to clone detection exploits the observation that clone instances, by their nature, should have a high degree of structural similarity. We look for identifiable characteristics or features that can be used as a signature to categorize arbitrary pieces of code.

The work presented here uses feature vectors to establish similarity measures. Features examined include metric values and specific data- and control-flow properties. The analysis framework uses two approaches:

1. direct comparison of metric values between begin-end blocks, and

2. dynamic programming techniques for comparing begin-end blocks at a statement-by-statement basis.

Metric-value similarity analysis is based on the assumption that two code fragments Ci and C2 have metric values M{Ci) and M(C2) for some source code metric M. If the two fragments are similar under the set of features measured by M, then the values of M{Ci) and M{C2) should be proximate.

Program features relevant for clone detection focus on data and control flow program properties. Modifications of five widely used metrics (Adamov, 1987), (Buss et al., 1994) for which their components exhibit low correlation (based on the Spearman-Pierson corre­lation test) were selected for our analyses:

1. The number of functions called (fanout);

82 KONTOGIANNIS ET AL.

2. The ratio of input/output variables to the fanout;

3. McCabe cyclomatic complexity;

4. Modified Albrecht's function point metric;

5. Modified Henry-Kafura's information flow quality metric.

Detailed descriptions and references for metrics will be given later on in this section. Similarity of two code fragments is measured using the resulting 5-dimensional vector. Two methods of comparing metric values were used. The first, naive approach, is to make 0{'n?) pairwise comparisons between code fragments, evaluating the Euclidean distance of each pair. A second, more sophisticated analytical approach was to form clusters by comparing values on one or more axes in the metric space.

The selection of the blocks to be compared is based on the proximity of their metric value similarity in a selected metric axis. Specifically, when the source code is parsed an Abstract Syntax Tree (AST) Tc is created, five different metrics are calculated compositionally for every statement, block, function, and file of the program and are stored as annotations in the corresponding nodes of the AST. Once metrics have been calculated and annotations have been added, a reference table is created that contains source code entities sorted by their corresponding metric values. This table is used for selecting the source code entities to be matched based on their metric proximity. The comparison granularity is at the level of a begin-end block of length more than n lines long, where n is a parameter provided by the user.

In addition to the direct metric comparison techniques, we use dynamic programming techniques to calculate the best alignment between two code fragments based on insertion, deletion and comparison operations. Rather than working directly with textual representa­tions, source code statements, as opposed to begin-end blocks, are abstracted into feature sets that classify the given statement. The features per statement used in the Dynamic Programming approach are:

• Uses of variables, definitions of variables, numerical literals, and strings;

• Uses and definitions of data types;

• The five metrics as discussed previously.

Dynamic programming (DP) techniques detect the best alignment between two code fragments based on insertion, deletion and comparison operations. Two statements match if they define and use the same variables, strings, and numerical literals. Variations in these features provide a dissimilarity value used to calculate a global dissimilarity measure of more complex and composite constructs such as begin-end blocks and functions. The comparison function used to calculate dissimilarity measures is discussed in detail in Section 2.3. Heuristics have been incorporated in the matching process to facilitate variations that may have occurred in cut and paste operations. In particular, the following heuristics are currently considered:

• Adjustments between variable names by considering lexicographical distances;

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 83

• Filtering out short and trivial variable names such as i and j which are typically used for temporary storage of intermediate values, and as loop index values. In the current implementation, only variable names of more than three characters long are considered.

Dynamic progranmiing is a more accurate method than the direct metric comparison based analysis (Fig. 2) because the comparison of the feature vector is performed at the statement level. Code fragments are selected for Dynamic Programming comparison by preselecting potential clone candidates using the direct metric comparison analysis. Within this framework only the begin-end blocks that have a dissimilarity measure less than a given threshold are considered for DP comparison. This preselection reduces the comparison space for the more computationally expensive DP match.

The following sections further discuss these approaches and present experimental results from analyzing medium scale (< lOOkLOC) software systems.

2,1, Program Representation and the Development of the Ariadne Environment

The foundation of the Ariadne system is a program representation scheme that allows for the calculation of the feature vectors for every statement, block or function of the source code. We use an object-oriented annotated abstract syntax tree (AST). Nodes of the AST are represented as objects in a LISP-based development environment^.

Creating the annotated AST is a three-step process. First, a grammar and object (domain) model must be written for the programming language of the subject system. The tool vendor has parsers available for such conmion languages as C and COBOL. Parsers for other languages may be easily constructed or obtained through the user community. The domain model defines object-oriented hierarchies for the AST nodes in which, for example, an If-Statement and a While-Statement are defined to be subclasses of the Statement class.

The second step is to use the parser on the subject system to construct the AST repre­sentation of the source code. Some tree annotations, such as linkage information and the call graph are created automatically by the parser. Once the AST is created, further steps operate in an essentially language-independent fashion.

The final step is to add additional annotations into the tree for information on data types, dataflow (dataflow graphs), the results of external analysis, and links to informal informa­tion. Such information is typically obtained using dataflow analysis algorithms similar to the ones used within compilers.

For example, consider the following code fragment from an IBM-proprietary PL/1-like language. The corresponding AST representation for the i f statement is shown in Fig. 1. The tree is annotated with the fan-out attribute which has been determined during an analysis phase following the initial parse.

MAIN: PROCEDURE(OPTION);

DCL OPTION FIXED(31);

IF (OPTION>0) THEN

CALL SHOW_MENU(OPTION);

ELSE

84 KONTOGIANNIS ET AL.

CALL SHOW_ERROR("Invalid o p t i o n number"); END MAIN;

i-. _l 1 f OPTION ^ ( 0 ^ f SHOW. ^ f OPTION ^ f SHOW_ \ f "Invalid ^

I 1 1 J I MENU J I 1 1 ERROR J I option..." J

pLegend —

( NODE [ NAME

1 altributo naiiw

+ 1 (anoul

fh \ B

\ J »ASTnode

m Link from parent tocNIdvfaa named attribute.

• Fanout attribute containing Integer value V.

Figure 1. The AST for an IF Statement With Fanout Attributes.

2.2. Metrics Based Similarity Analysis

Metrics based similarity analysis uses five source-code metrics that are sensitive to several different control and data flow program features. Metric values are computed for each statement, block, and function. Empirical analysis ^ (Buss et al., 1994) shows the metrics components have low correlation, so each metric adds useful information.

The features examined for metric computation include:

• Global and local variables defined or used;

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 85

• Functions called;

• Files accessed;

• I/O operations (read, write operations);

• Defined/used parameters passed by reference and by value;

• Control flow graph.

Partial matching may occur because the metrics are not sensitive to variable names, source code white space, and minor modifications such as replacement of while with for loops and insertion of statements that do not alter the basic data and control flow of the original code structure.

A description of the metrics used is given below but a more detailed description can be found in (Adamov, 1987), (Fenton, 1991), (Moller93).

Let 5 be a code fragment. The description of the five modified metrics used is given below. Note that these metrics are computed compositionally from statements, to beg in-end blocks, functions, and files.

1. S-COMPLEXITY(s) = FANJOUT{sf where

• FAN_OUT(s) is the number of individual function calls made within s.

D_COMPLEXITY(s) = GLOBALS{S)/{FANJDUT{S) + 1) where

• GLOBALS(s) is the number of individual declarations of global variables used or updated within s. A global variable is a variable which is not declared in the code fragment s.

3. MCCABE(5) = € - n + 2 where

• € is the number of edges in the control flow graph

• n is the number of nodes in the graph.

Altematively McCabe metric can be calculated using

• MCCABE(s) = 1 + d, where d is the number of control decision predicates in j

4. ALBRECHT(s) = {

where,

f pi * VARSJJSEDJiNDJSET{s)-^ P2 * GLOBAL.VARS.SET{s)-\-P3 * USERJNPUT{s)+

[ P4 * FILEJNPUT{s)

86 KONTOGIANNIS ET AL.

VARSJJSED^NDJ5ET{s) is the number of data elements set and used in the state­ment s,

GLOBAL-VARSSET{s) is the number of global data elements set in the statement s,

USERJNPUT{s) is the number of read operations in statement s,

FILEJNPUT(s) is the number of files accessed for reading in 5-.

The factors pi,.., p4, are weight factors. In (Adamov, 1987) possible values for these factors are given. In the current implementation the values chosen are pi = 5, p2 = 4, p3 = 4 and, P4 = 7. The selection of values for the piS' ^0 does not affect the matching process.

5. KAFURA(s) = { {KAFURAJN{s) * KAFURA.OUT{s)y where,

• KAFURA JN(5) is the sum of local and global incoming dataflow to the the code fragment s.

• KAFURA_OUT(s) is the sum of local and global outgoing dataflow from the the code fragment s.

Once the five metrics Mi to M5 are computed for every statement, block and function node, the pattern matching process is fast and efficient. It is simply the comparison of numeric values.

We have experimented with two techniques for calculating similar code fragments in a software system.

The first one is based on pairwise Euclidean distance comparison of all begin-end blocks that are of length more than n lines long, where n is a parameter given by the user. In a large software system though there are many begin-end blocks and such a pairwise comparison is not possible because of time and space limitations. Instead, we limit the pairwise comparison between only these begin-end blocks that for a selected metric axis Mi their metric values differ in less than a given threshold di. In such a way every block is compared only with its close metric neighbors.

The second technique is more efficient and is using clustering per metric axis. The technique starts by creating clusters of potential clones for every metric axis A^^ (i = 1 .. 5). Once the clusters for every axis are created, then intersections of clusters in different axes are calculated forming intermediate results. For example every cluster in the axis Mi contains potential clones under the criteria implied by this metric. Consequently, every cluster that has been calculated by intersecting clusters in Mi and Mj contains potential clones under the criteria implied by both metrics. The process ends when all metric axis have been considered. The user may specify at the beginning the order of comparison, and the clustering thresholds for every metric axis. The clone detection algorithm that is using clustering can be summarized as:

1. Select all source code begin-end blocks B from the AST that are more than n lines long. The parameter n can be changed by the user.

2. For every metric axis Mi (i = 1.. 5) create clusters Cij that contain begin-end blocks with distance less than a given threshold di that is selected by the user. Each cluster

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 87

then contains potential code clone fragments under the metric criterion Mi. Set the current axis Mcurr = Mi, where i = 1. Mark Mi as used

3. For every cluster Ccurr,m in the current metric axis Mcurr > intersect with all clusters Cj^k in one of the non used metric axis Mj, j G {1 .. 5}. The clusters in the resulting set contain potential code clone fragments under the criteria Mcurr and Mj, and form a composite metric axis McurrOj- Mark Mj as used and set the current axis Mcurr ~ '^currQj'

4. If all metric axes have been considered the stop; else go to Step 3.

The pattern matching engine uses either the computed Euclidean distance or clustering in one or more metric dimensions combined, as a similarity measure between program constructs.

As a refinement, the user may restrict the search to code fragments having minimum size or complexity.

The metric-based clone detection analysis has been applied to a several medium-sized production C programs.

In tcsh, a 45 kLOC Unix shell program, our analysis has discovered 39 clusters or groups of similar functions of average size 3 functions per cluster resulting in a total of 17.7 percent of potential system duplication at the function level.

In bash, a 40KLOC Unix shell program, the analysis has discovered 25 clusters, of average size 5.84 functions per cluster, resulting to a total of 23 percent of potential code duplication at the function level.

In CLIPS, a 34 kLOC expert system shell, we detected 35 clusters of similar functions of average size 4.28 functions per cluster, resulting in a total of 20 percent of potential system duplication at the function level.

Manual inspection of the above results combined with more detailed Dynamic Program­ming re-calculation of distances gave some statistical data regarding false positives. These results are given in Table 1. Different programs give different distribution of false alarms, but generally the closest the distance is to 0.0 the more accurate the result is.

The following section, discusses in detail the other code to code matching technique we developed, that is based on Dynamic Programming.

2.3. Dynamic Programming Based Similarity Analysis

The Dynamic Programming pattern matcher is used (Konto, 1994), (Kontogiannis, 1995) to find the best alignment between two code fragments. The distance between the two code fragments is given as a summation of comparison values as well as of insertion and deletion costs corresponding to insertions and deletions that have to be applied in order to achieve the best alignment between these two code fragments.

A program feature vector is used for the comparison of two statements. The features are stored as attribute values in a frame-based structure representing expressions and statements in the AST. The cumulative similarity measure T> between two code fragments P , M, is calculated using the function

D{£{l,p,V),£{lJ,M)) = Mm{

88 KONTOGIANNIS ET AL,

D : Feature Vector X Feature^Vector —> Real

where:

A ( p , j ~ l , P , M ) + D{£{l,p,nS{lJ-l,M))

I{p-lj^V,M)^ (1)

C{p-lJ-l,V,M)-h D{£{l,p-l,V),£{lJ-l,M))

and,

• Al is the model code fragment

• 7 is the input code fragment to be compared with the model M

• £{h jt Q) is a program feature vector from position / to position y in code fragment Q

• -D(Vx , Vy) is the the distance between two feature vectors Vx, Vy

• A(i, J, 7 5 M) is the cost of deleting \hc']th statement of Al, at position / of the fragment V

• /( i , J, 7 , X ) the cost of inserting the ith statement of V at position^* of the model M and

• C(^, J, V^ M) is the cost of comparing the ith statement of the code fragment V with the j A fragment of the model M. The comparison cost is calculated by comparing the corresponding feature vectors. Currently, we compare ratios of variables set, used per statement, data types used or set, and comparisons based on metric values

Note that insertion, and deletion costs are used by the Dynamic Programming algorithm to calculate the best fit between two code fragments. An intuitive interpretation of the best fit using insertions and deletions is "if we insert statement i of the input at position 7 of the model then the model and the input have the smallest feature vector difference."

The quality and the accuracy of the comparison cost is based on the program features se­lected and the formula used to compare these features. For simplicity in the implementation we have attached constant real values as insertion and deletion costs.

Table 1 summarizes statistical data regarding false alarms when Dynamic Programming comparison was applied to functions that under direct metric comparison have given distance 0.0. The column labeled Distance Range gives the value range of distances between functions using the Dynamic Progranmiing approach. The column labeled False Alarms contains the percentage of functions that are not clones but they have been identified as such. The column labeled Partial Clones contains the percentage of functions which correspond

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 8 9

Table 1. False alarms for the Clips program

Distance Range 0.0

0.01 - 0.99

1.0-1.49

1.5-1.99

2.0 - 2.99

3.0 - 3.99

4.0 - 5.99

6.0 -15.0

False Alarms 0.0%

6.0%

8.0%

30.0%

36.0%

56.0%

82.0%

100.0%

Partial Clones 10.0%

16.0 %

3.0%

37.0 %

32.0 %

13.0 %

10.0 %

0.0%

Positive Clones 90.0%

78.0%

89.0%

33.0%

32.0%

31.0%

8.0%

0.0%

only in parts to cut and paste operations. Finally, the column labeled as Positive Clones contains the percentage of functions clearly identified as cut and paste operations.

The matching process between two code fragments M and V is discussed with an example later in this section and is illustrated in Fig.3

The comparison cost function C{i,j,M,V) is the key factor in producing the final distance result when DP-based matching is used. There are many program features that can be considered to characterize a code fragment (indentation, keywords, metrics, uses and definitions of variables). Within the experimentation of this approach we used the following three different categories of features

1. definitions and uses of variables as well as, literal values within a statement:

(A) Featurei : Statement —> String denotes the set of variables used in within a statement,

(B) Feature2 - Statement -^ String denotes the set of variables defined within a statement

(C) Features • Statement —> String denotes the set of literal values (i.e numbers, strings) within a statement (i.e. in a printf statement).

2. definitions and uses of data types :

(A) Featurei • Statement —> String denotes the set of data type names used in within a statement,

(B) Feature2 • Statement —> String denotes the set of data type names defined within a statement

The comparison cost of the ith statement in the input V and the jth statement of the model M. for the first two categories is calculated as :

90 KONTOGIANNIS ET AL.

. 1 Y^ card{InputFeaturem{Vi) O ModelFeaturem{-Mj)) *' ^ V ^ card{InputFeaturem{Vi)UModelFeaturemMj))

where v is the size of the feature vector, or in other words how many features are used,

3. five metric values which are calculated compositionally from the statement level to function and file level:

The comparison cost of the ith statement in the input V and the jth statement of the model M when the five metrics are used is calculated as :

C{VuMj) ^^{Mk{V,) - MUMj))^ (3) \ A:=l

Within this framework new metrics and features can be used to make the comparison process more sensitive and accurate.

The following points on insertion and deletion costs need to be discussed.

• The insertion and deletion costs reflect the tolerance of the user towards partial matching (i.e. how much noise in terms of insertions and deletions is allowed before the matcher fails). Higher insertion and deletion costs indicate smaller tolerance, especially if cut­off thresholds are used (i.e. terminate matching if a certain threshold is exceeded), while smaller values indicate higher tolerance.

The values for insertion and deletion should be higher than the threshold value by which two statements can be considered "similar", otherwise an insertion or a deletion could be chosen instead of a match.

A lower insertion cost than the corresponding deletion cost indicates the preference of the user to accept a code fragment V that is written by inserting new statements to the model M. The opposite holds when the deletion cost is lower than the corresponding insertion cost. A lower deletion cost indicates the preference of the Ubcr to accept a code fragment V that is written by deleting statements from the model M. Insertion and deletion costs are constant values throughout the comparison process and can be set empirically.

When different comparison criteria are used different distances are obtained. In Fig.2 (Clips) distances calculated using Dynamic Programming are shown for 138 pairs of func­tions (X - axis) that have been already identified as clones (i.e. zero distance) using the direct per function metric comparison. The dashed line shows distance results when def­initions and uses of variables are used as features in the dynamic programming approach, while the solid line shows the distance results obtained when the five metrics are used as features. Note that in the Dynamic Programming based approach the metrics are used at

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 91

Distances between Function pairs (Clips) Distances between Function Pairs (Bash)

- Distances on definitions and uses of variables

_ Distances on data and control flow measurements.

40 60 80 Function Pairs

3h

1

C 100 120 140 0

- Distances on definitions and uses of variables

_ Distances on data and control flow measurements

Figure 2. Distances between function pairs of possible function clones using DP-based matching.

the statement level, instead of the begin-end block level when metrics direct comparison is performed.

As an example consider the following statements M and V:

ptr = head;

while(ptr != NULL && !found)

{ i f ( p t r - > i t e i t i == s e a r c h l t e m )

found = 1 e l s e ptr = ptr->next;

while(ptr != NULL && !found)

{ if(ptr->item == searchltem)

92 KONTOGIANNIS ET AL.

^ ptr I- . . i£() .

•Ise-part

then-part

ptr->lten •>

1 y. t^—1— \ ^-^

M A

£ounJk> 1

^ - l - T -i2i

ptx->it«m H . . than-purt alfls part prlntfO. . found • 1

Figure 3. The matching process between two code fragments. Insertions are represented as horizontal hnes, deletions as vertical lines and, matches as diagonal hnes.

{

printf("ELEMENT FOUND

found = 1;

}

else

ptr = ptr->next;

%s\n", searchltem);

The Dynamic Programming matching based on definitions and uses of variables is illus­trated in Fig. 3.

In the first grid the two code fragments are initially considered. At position (0, 0) of the first grid a deletion is considered as it gives the best cumulative distance to this point (assuming there will be a match at position (0, 1). The comparison of the two composite while statements in the first grid at position (0, 1), initiates a nested match (second grid). In the second grid the comparison of the composite i f - t h e n - e l s e statements at position (1,1) initiates a new nested match. In the third grid, the comparison of the composite t h e -p a r t of the i f - t h e n - e l s e statements initiates the final fourth nested match. Finally, in the fourth grid at position (0, 0), an insertion has been detected, as it gives the best cumulative distance to this point (assuming a potential match in (1,0).

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 9 3

When a nested match process finishes it passes its result back to the position from which it was originally invoked and the matching continues from this point on.

3. Concept To Code Matching

The concept assignment (Biggerstaff, 1994) problem consists of assigning concepts de­scribed in a concept language to program fragments. Concept assignment can also be seen as a matching problem.

In our approach, concepts are represented as abstract-descriptions using a concept lan­guage called ACL. The intuitive idea is that a concept description may match with a number of different implementations. The probability that such a description matches with a code fragment is used to calculate a similarity measure between the description and the implemen­tation. An abstract-description is parsed and a corresponding AST Ta is created. Similarly, source code is represented as an annotated AST Tc. Both Ta and Tc are transformed into a sequence of abstract and source code statements respectively using transformation rules. We use REFINE to build and transform both ASTs. The reason for this transformation is to reduce the complexity of the matching algorithm as Ta and Tc may have a very complex and different to each other structure. In this approach feature vectors of statements are matched instead of Abstract Syntax Trees. Moreover, the implementation of the Dynamic Programming algorithm is cleaner and faster once structural details of the ASTs have been abstracted and represented as sequences of entities.

The associated problems with matching concepts to code include :

• The choice of the conceptual language,

• The measure of similarity,

• The selection of a fragment in the code to be compared with the conceptual represen­tation.

These problems are addressed in the following sections.

3,1, Language for Abstract Representation

A number of research teams have investigated and addressed the problem of code and plan localization. Current successful approaches include the use of graph granmiars (Wills, 1992), (Rich, 1990), query pattern languages (Paul, 1994), (Muller, 1992), (Church, 1993), (Biggerstaff, 1994), sets of constraints between components to be retrieved (Ning, 1994), and summary relations between modules and data (Canfora, 1992).

In our approach a stochastic pattern matcher that allows for partial and approximate matching is used. A concept language specifies in an abstract way sequences of design concepts.

The concept language contains:

94 KONTOGIANNIS ET AL.

• Abstract expressions £ that correspond to source code expression. The correspondence between an abstract expression and the source code expression that it may generate is given at Table 3

• Abstract feature descriptions T that contain the feature vector data used for matching purposes. Currently the features that characterize an abstract statement and an abstract expression are:

1. Uses of variables : variables that are used in a statement or expression

2. Definitions of variables', ariables that are defined in a statement or expression

3. Keywords: strings, numbers, characters that may used in the text of a code statement

4. Metrics : a vector of five different complexity, data and control flow metrics.

• Typed Variables X

Typed variables are used as a placeholders for feature vector values, when no actual values for the feature vector can be provided. An example is when we are looking for a Traversal of a list plan but we do not know the name of the pointer variable that exists in the code. A type variable can generate (match) with any actual variable in the source code provided that they belong to the same data type category. For example a List type abstract variable can be matched with an Array or a Linked List node source code pointer variable.

Currently the following abstract types are used :

1. Numeral: Representing Int, and float types

2. Character : Representing char types

3. List: Representing array types

4. Structure : Representing struct types

5. Named : matching the actual data type name in the source code

• Operators O

Operators are used to compose abstract statements in sequences. Currently the following operators have been defined in the language but only sequencing is implemented for the matching process :

1. Sequencing (;): To indicate one statement follows another

2. Choice ( 0 ) : To indicate choice (one or the other abstract statement will be used in the matching process

3. Inter Leaving (|| ) : to indicate that two statements can be interleaved during the matching process

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 95

Table 2. Generation (Allowable Matching) of source code statements from ACL statements

ACL Statement

Abstract Iterative Statement

Abstract While Statement

Abstract For Statement

Abstract Do Statement

Abstract Conditional Statement

Abstract If Statement

Abstract Switch Statement

Abstract Return Statement

Abstract GoTo Statement

Abstract Continue Statement

Abstract Break Statement

Abstract Labeled Statement

Abstract Statement*

AhstractStatement^

Generated Code Statement While Statement For Statement Do Statement

While Statement

For Statement

Do Statement

If Statement Switch Statement

If Statement

Switch Statement

Return Statement

GoTo Statement

Continue Statement

Break Statement

Labeled Statement

Zero or more sequential source code statements

One or more sequential source code statements

9 6 KONTOGIANNIS ET AL.

Table 3. Generation (Allowable Matching) of source code expressions from ACL expressions

ACL Expression Abstract Function Call

Abstract Equality

Abstract Inequality

Abstract Logical And

Abstract Logical Or

Abstract Logical Not

Generated Code Expression Function Call

Equality (==)

Inequality (\ =)

Logical And (Sz&z)

Logical Or (\\)

Logical Not (!)

• Macros M

Macros are used to facilitate hierarchical plan recognition (Hartman, 1992), (Chikof-sky, 19890). Macros are entities that refer to plans that are included at parse time. For example if a plan has been identified and is stored in the plan base, then special preprocessor statements can be used to include this plan to compose more complex patterns. Included plans are incorporated in the current pattern's AST at parse time. In this way they are similar to inline functions in C++.

Special macro definition statements in the Abstract Language are used to include the necessary macros.

Currently there are two types of macro related statements

1. include definitions: These are special statements in ACL that specify the name of the plan to be included and the file it is defined.

As an example consider the statement

include planl.acl traversal-linked-list

that imports the plan traversal-linked-list defined in file planl.acl.

2. inline uses : These are statements that direct the parser to inline the particular plan and include its AST in the original pattern's AST. As an example consider the inlining

p lan : traversal-linked-list that is used to include an instance of the traversal-linked-list plan at a particular point of the pattern. In a pattern more than one occurrence of an included plan may appear.

A typical example of a design concept in our concept language is given below. This pattern expresses an iterative statement (e.g. while ,for, do loop that has in its condition an inequality expression that uses variable ?x that is a pointer to the abstract type l i s t (e.g. array, linked list) and the conditional expression contains the keyword "NULL". The body of I t e r a t i v e - s t a t e m e n t contains a sequence of one or more stateme nts (+-statement)

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 9 7

that uses at least variable ?y (which matches to the variable obj) in the code below and contains the keyword meniber, and an Assignment-Statement that uses at least variable ?x, defines variable ?x which in this example matches to variable f i e ld , and contains the keyword next.

{

Iterative-statement(Inequality-Expression

abstract-description

uses : [ ?x : *list],

keywords : [ "NULL" ])

{

-(--Statement

abstract-description

uses : [?y : string, ..]

keywords : [ "member" ];

Assignment-Statement

abstract-description

uses : [?x, . . ] ,

defines : [?x], keywords : [ "next" ]

A code fragment that matches the pattern is:

{

while (field != NULL)

{

if (!strcmp(obj,origObj) ||

(!strcmp(field->AvalueType,"member") &&

notlnOrig ) )

if (strcmp(field->Avalue,"method") != 0)

INSERT_THE_FACT(o->ATTLIST[num].Aname,origObj,

field->Avalue);

field = field->nextValue;

}

}

3.2, Concept-tO'Code Distance Calculation

In this section we discuss the mechanism that is used to match an abstract pattern given in ACL with source code.

98 KONTOGIANNIS ET AL.

In general the matching process contains the following steps :

1. Source code (^i; ...5^) is parsed and an AST Tc is created.

2. The ACL pattern {Ai; ...A^) is parsed and an AST Ta is created.

3. A transformation program generates from Ta a Markov Model called Abstract Pattern Model (APM).

4. A Static Model called SCM provides the legal entities of the source language. The underlying finite-state automaton for the mapping between a APM state and an SCM state basically implements the Tables 2, 3.

5. Candidate source code sequences are selected.

6. A Viterbi (Viterbi, 1967) algorithm is used to find the best fit between the Dynamic Model and a code sequence selected from the candidate list.

A Markov model is a source of symbols characterized by states and transitions, A model can be in a state with certain probability. From a state, a transition to another state can be taken with a given probability. A transition is associated with the generation (recognition) of a symbol with a specific probability. The intuitive idea of using Markov models to drive the matching process is that an abstract pattern given in ACL may have many possible alternative ways to generate (match) a code fragment. A Markov model provides an appropriate mechanism to represent these alternative options and label the transitions with corresponding generation probabilities. Moreover, the Vitrebi algorithm provides an efficient way to find the path that maximizes the overall generation (matching) probability among all the possible alternatives.

The selection of a code fragment to be matched with an abstract description is based on the following criteria : a) the first source code statement Si matches with the first pattern statement Ai and, b) S2]S^]..Sk belong to the innermost block containing Si

The process starts by selecting all program blocks that match the criteria above. Once a candidate list of code fragments has been chosen the actual pattern matching takes place between the chosen statement and the outgoing transitions from the current active APM's state. If the type of the abstract statement the transition points to and the source code statement are compatible (compatibility is computed by examining the Static Model) then feature comparison takes place. This feature comparison is based on Dynamic Programming as described in section 2.3. A similarity measure is established by this comparison between the features of the abstract statement and the features of the source code statement. If composite statements are to be compared, an expansion function "flattens" the structure by decomposing the statement into a sequence of its components. For example an i f statement will be decomposed as a sequence of an express ion (for its condition), its then part and its e l s e part. Composite statements generate nested matching sessions as in the DP-based code-to-code matching.

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 9 9

3,3, ACL Markov Model Generation

Let Tc be the AST of the code fragment and Ta be the AST of the abstract representation. A measure of similarity between Tc and Ta is the following probability

where, (rci,...rc,,...rcj (5)

is the sequence of the grammar rules used for generating Tc and

{ra^,...ran"'raL) (6)

is the sequence of rules used for generating Ta. The probability in (1) cannot be computed in practice, because of complexity issues related to possible variations in Ta generating Tc. An approximation of (4) is thus introduced.

Let iSi, ..5fc be a sequence of program statements During the parsing that generates Ta, a sequence of abstract descriptions is produced. Each of these descriptions is considered as a Markov source whose transitions are labeled by symbols Aj which in turn generate (match) source code.

The sequence of abstract descriptions Aj forms a pattern A in Abstract Code Language (ACL) and is used to build dynamically a Markov model called Abstract Pattern Model (APM). An example of which is given in Fig.4.

The Abstract Pattern Model is generated an ACL pattern is parsed. Nodes in the APM represent Abstract ACL Statements and arcs represent transitions that determine what is expected to be matched from the source code via a link to a static, permanently available Markov model called a Source Code Model (SCM).

The Source Code Model is an alternative way to represent the syntax of a language entity and the correspondence of Abstract Statements in ACL with source code statements.

For example a transition in APM labeled as (pointing to) an Abs t rac t while S t a t e ­ment is linked with the while node of the static model. In its turn a while node in the SCM describes in terms of states and transitions the syntax of a legal while statement in c.

The best alignment between a sequence of statements S = 5i; ^2; 5fc and a pattern A = Ai;A2]....Aj is computed by the Viterbi (Viterbi, 1967) dynamic programming algorithm using the SCM and a feature vector comparison function for evaluating the following type of probabilities:

P , (5 i ,52 , . . .5 , |%,) ) (7)

where/(/) indicates which abstract description is allowed to be considered at step /. This is determined by examining the reachable APM transitions at the ith step. For the matching to succeed the constraint P^(*S'i|Ai) = 1.0 must be satisfied and ^/(fc) corresponds to a final APM state.

This corresponds to approximating (4) as follows (Brown, 1992):

Pr{Tc\Ta) c^ P.(5i; ..Sk\Ai- ..An) =

100 KONTOGIANNIS ET AL.

^maa : (P^ (5 l ;52 . .5 ,_ l |A l ;^2 ; . .%^- l ) ) •P r (5^ |%i ) ) ) (8) i=l

This is similar to the code-to-code matching. The difference is that instead of matching source code features, we allow matching abstract description features with source code features. The dynamic model (APM) guarantees that only the allowable sequences of comparisons are considered at every step.

The way to calculate similarities between individual abstract statements and code frag­ments is given in terms of probabilities of the form Pr{Si\Aj) as the probability of abstract statement Aj generating statement Si.

The probability p = Pr{Si\Aj) = Pscm{Si\Aj) * Pcomp{Si\Aj) is interpreted as "The probability that code statement Si can be generated by abstract statement Aj". The mag­nitude of the logarithm of the probability p is then taken to be the distance between Si and Aj.

The value ofp is computed by multiplying the probability associated with the correspond­ing state for Aj in SCM with the result of comparing the feature vectors of Si and Aj. The feature vector comparison function is discussed in the following subsection.

As an example consider the APM of Fig. 4 generated by the pattern ^ i ; ^2 5 ^3» where Aj is one of the legal statements in ACL. Then the following probabilities are computed for a selected candidate code fragment 5i , 52, S'a:

Figure 4. A dynamic model for the pattern Al\ A2*; A3*

Pr{Si\Ai) = 1.0 [delineation criterion) (9)

Pr{Su S2\A2) = PriSllAi) • Pr{S2\A2) (10)

PriSuS2\As) = PriSMl) ' Pr{S2\As) (11)

Pr{SuS2,S3\As)=Max Pr{SuS2\A2)'Pr{Ss\A3)

PriSi,S2\As)'Pr{Ss\As) (12)

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 101

Pr{SuS2,Ss\A2) = Pr{Si,S2\A2) ' Pr{Ss\A2) (13)

Note that when the first two program statements 81^82 have already been matched, (equations 12 and 13) two transitions have been consumed and the reachable active states currently are A2 or A3.

Moreover at every step the probabilities of the previous steps are stored and there is no need to be reevaluated. For example Pr{Si^S2\A2) is computed in terms of Pr{Si\Ai) which is available from the previous step.

With each transition we can associate a list of probabilities based on the type of expression likely to be found in the code for the plan that we consider.

For example, in the Traversa l of a l inked l i s t plan the while loop condition, which is an expression, most probably generates an i n e q u a l i t y of the form (list-node-ptr 1= NULL) which contains an identifier reference and the keyword NULL.

An example of a static model for the p a t t e r n - e x p r e s s i o n is given in Fig. 5. Here we assume for simplicity that only four C expressions can be generated by a P a t t e r n -Expression.

The initial probabilities in the static model are provided by the user who either may give a uniform distribution in all outgoing transitions from a given state or provide some subjectively estimated values. These values may come from the knowledge that a given plan is implemented in a specific way. In the above mentioned example of the Traversa l of a l inked l i s t plan the I t e r a t i v e - S t a t e m e n t pattern usually is implemented with a while loop. In such a scenario the i t e r a t i v e abstract statement can be considered to generate a while statement with higher probability than a for statement. Similarly, the expression in the while loop is more likely to be an inequality (Fig. 5). The preferred probabilities can be specified by the user while he or she is formulating the query using the ACL primitives. Once the system is used and results are evaluated these probabilities can be adjusted to improve the performance.

Probabilities can be dynamically adapted to a specific software system using a cache memory method originally proposed (for a different application) in (Kuhn, 1990).

A cache is used to maintain the counts for most frequently recurring statement patterns in the code being examined. Static probabilities can be weighted with dynamically estimated ones as follows :

Pscm{Si\Aj) = X . Pcache{Si\Aj) + (1 - A) • Pstatic{Si\Aj) (14)

In this formula Pcache{Si\Aj) represents the frequency that Aj generates Si in the code examined at run time while PstaticiSi\Aj) represents the a-priori probability of Aj gen­erating Si given in the static model. A is a weighting factor. The choice of the weighting factor A indicates user's preference on what weight he or she wants to give to the feature vector comparison. Higher A values indicate a stronger preference to depend on feature vector comparison. Lower A values indicate preference to match on the type of statement and not on the feature vector.

The value of A can be computed by deleted-interpolation as suggested in (Kuhn, 1990). It can also be empirically set to be proportional to the amount of data stored in the cache.

102 KONTOGIANNIS ET AL.

/ Pattern \ ,- -'*'' ^ y Inequality J

0.25 / ^^-.^^^

1 is-an-inequality

/ / Pattern \

/ 0.25^^^.^*\ Equality /

r Pattern \ is~an-equality

lExpression 7

I is-a-id-ref 7 Pattern \

\ V Id-Ref /

0.25 \

\ is-a-function-call

^'''^>..^/ Pattern \ V Fcn~Call /

1.0

expression

1.0

expression

1.0

id-ref

1.0

id-ref

Argl

Argl

Fen-Name

1.0

expression

1.0

expression

0.5

expression

Arg2

Arg2

-Args

expression

Figure 5. The static model for the expression-pattern. Different transition probability values may be set by the user for different plans. For example the traversal of linked-list plan may have higher probability attached to the is-an-inequality transition as the programmer expects a pattern of the form (field f= NULL)

As proposed in (Kuhn, 1990), different cache memories can be introduced, one for each Aj. Specific values of A can also be used for each cache.

3,4, Feature Vector Comparison

In this section we discuss the mechanism used for calculating the similarity between two feature vectors. Note that Si's and ^^'s feature vectors are represented as annotations in the corresponding ASTs.

The feature vector comparison of Si, Aj returns a value p = Pr{Si\Aj). The features used for comparing two entities (source and abstract) are:

1. Variables defined V : Source-Entity —> {String}

2. Variables usedU : Source-Entity —> {String}

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 1 0 3

3. Keywords /C : Source-Entity —> {String}

4. Metrics

• Fan out All : Source-Entity —> Number

• D-Complexity M2 - Source-Entity -^ Number

• McCabe Ms : Source-Entity -^ Number

• Albrecht M4 : Source-Entity —^ Number

• Kafura M5 : Source-Entity —> Number

These features are AST annotations and are implemented as mappings from an AST node to a set of AST nodes, set of Strings or set of Numbers.

Let Si be a source code statement or expression in program C and Aj an abstract statement or expression in pattern A. Let the feature vector associated with Si be Vi and the feature vector associated with Aj be Vj. Within this framework we experimented with the following similarity considered in the computation as a probability:

/CM \ -'• ^r^ car d{ Abstract Feature j^n^CodeF eaturci^n) comp % 3 ^ £^ card{AbstractFeaturej^n ^ CodeFeaturCi^n)

where v is the size of the feature vector, or in other words how many features are used, CodeFeaturei^n is the nth feature of source statement Si and, AbstractFeaturCj^n is the nth feature of the ACL statement Aj.

As in the code to code dynamic programming matching, lexicographical distances be­tween variable names (i.e. next, next value) and numerical distances between metrics are used when no exact matching is the objective. Within this context two strings are considered similar if their lexicographical distance is less than a selected threshold, and the comparison of an abstract entity with a code entity is valid if their corresponding metric values are less than a given threshold.

These themes show that ACL is viewed more as a vehicle where new features and new requirements can be added and be considered for the matching process. For example a new feature may be a link or invocation to another pattern matcher (i.e. SCRUPLE) so that the abstract pattern in ACL succeeds to match a source code entity if the additional pattern matcher succeeds and the rest of the feature vectors match.

4. System Architecture

The concept-to-code pattern matcher of the Ariadne system is composed of four modules. The first module consists of an abstract code language (ACL) and its corresponding parser.

Such a parser builds at run time, an AST for the ACL pattern provided by the user. The ACL AST is built using Refine and its corresponding domain model maps to entities of the C language domain model. For example, an Abstract-Iterative-Statement corresponds to an Iterative-Statement in the C domain model.

104 KONTOGIANNIS ET AL.

A Static explicit mapping between the ACL's domain model and C's domain model is given by the SCM (Source Code Model), Ariadne's second module. SCM consists of states and transitions. States represent Abstract Statements and are nodes of the ACL's AST. Incoming transitions represent the nodes of the C language AST that can be matched by this Abstract Statement. Transitions have initially attached probability values which follow a uniform distribution. A subpart of the SCM is illustrated in Fig. 5 where it is assumed for simplicity that an Abstract Pattern Expression can be matched by a C inequa l i t y , equa l i ty , i d e n t i f i e r reference , and a function c a l l .

The third module builds the Abstract Pattern Model at run time for every pattern provided by the user. APM consists of states and transitions. States represent nodes of the ACL's AST. Transitions model the structure of the pattern given, and provide the pattern statements to be considered for the next matching step. This model directly reflects the structure of the pattern provided by the user. Formally APM is an automaton <Q, E, 5, qo, F> where

• Q, is the set of states, taken from the domain of ACL's AST nodes

• S, is the input alphabet which consists of nodes of the C language AST

• <5, is a transition function implementing statement expansion (in the case of composite abstract or C statements) and the matching process

• qo,i^ the Initial state. The set of outgoing transitions must match the first statement in the code segment considered.

• F, is a set of final states. The matching process stops when one of the final states have been reached and no more statements from the source code can be matched.

Finally, the fourth module is the matching engine. The algorithm starts by selecting candidate code fragments V = Si; 82', .-Sk, given a model M = A\\A2\ ..An.

The Viterbi algorithm is used to evaluate the best path from the start to the final state of the APM,

An example of a match between two simple expresssions (di function call and an Abstract-Expression is given below :

INSERT_THE_FACT(o->ATTLIST[num].Aname,origObj,

field->Avalue);

is matched with the abstract pattern

Expression(abstract-description

uses : ["ATTLIST", "Aname", "Avalue"] Keywords : ["INSERT", "FACT"] )

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 105

In this scenario both abstract and code statements are simple and do not need expansion. Expression and INSERT_THE_FACT(...) are type compatible statements because an expression can generate a function call (Fig. 5) so the matching can proceed. The next step is to compare features, and lexicographical distances between variable names in the abstract and source statement. The finalvalue is obtained by multiplying the value obtained from the feature vectors comparison and the probability that Expression generates a Function Call. As the pattern statement does not specify what type of expression is to be matched the static model (SCM) provides an estimate. In the SCM given in Fig. 5 the likelihood that the Expression generates a function call is 0.25. The user may provide such a value if a plan favours a particular type instead of another. For example in the Traversa l of a 1 inked l i s t plan the loop statement is most likely to be a whi 1 e loop. Once a final value is set then a record < abstract jpattern, matched-Code, distance-value > is created and is associated with the relevant transition of the APM. The process ends when a final state of the APM has been reached and no more statements match the pattern.

With this approach the matching process does not fail when imperfect matching between the pattern and the code occurs. Instead, partial and inexact matching can be computed. This is very important as the programmer may not know how to specify in detail the code fragment that is sought.

To reduce complexity when variables in the pattern statement occur, Ariadne maintains a global binding table and it checks if the given pattern variable is bound to one of the legal values from previous instantiations. These legal values are provided by the binding table and are initialized every time a new pattern is tried and a new APM is created.

5. Conclusion

Pattern matching plays an important role for plan recognition and design recovery. In this paper we have presented a number of pattern matching techniques that are used for code-to-code and concept-to-code matching. The main objective of this research was to devise methods and algorithms that are time efficient, allow for partial and inexact matching, and tolerate a measure of dissimilarity between two code fragments. For code representation schemes the program's Abstract Syntax Tree was used because it maintains all necessary information without creating subjective views of the source code (control or data flow biased views).

Code-to-code matching is used for clone detection and for computing similarity distances between two code fragments. It is based on a) a dynamic programming pattern matcher that computes the best alignment between two code fragments and b) metric values obtained for every expression, statement, and block of the AST. Metrics are calculated by taking into account a number of control and data program properties. The dynamic programming pattern matcher produces more accurate results but the metrics approach is cheaper and can be used to limit the search space when code fragments are selected for comparison using the dynamic progranuning approach.

We have experimented with different code features for comparing code statements and are able to detect clones in large software systems > 300 KLOC. Moreover, clone detection is used to identify "conceptually" related operations in the source code. The performance

106 KONTOGIANNIS ET AL.

is limited by the fact we are using a LISP environment (frequent garbage collection calls) and the fact that metrics have to be calculated first. When the algorithm using metric values for comparing program code fragments was rewritten in C it performed very well. For 30KLOCS of the CLIPS system and for selecting candidate clones from approximately 500,000 pairs of functions the C version of the clone detection system run in less than 10 seconds on a Sparc 10, as opposed to a Lisp implementation that took 1.5 minutes to complete. The corresponding DP-based algorithm implemented in Lisp took 3.9 minutes to complete.

Currently the system is used for system clustering, redocumentation and program un­derstanding. Clone detection analysis reveals clusters of functions with similar behaviour suggesting thus a possible system decomposition. This analysis is combined with other data flow analysis tools (Konto, 1994) to obtain a multiple system decomposition view. For the visualization and clustering aspect the Rigi tool developed at the University of Victoria is used. Integration between the Ariadne tool and the Rigi tool is achieved via the global software repository developed at the University of Toronto.

The false alarms using only the metric comparison was on average for the three systems 39% of the total matches reported. When the DP approach was used,this ratio dropped to approximately 10% in average (when zero distance is reported). Even if the noise presents a significant percentage of the result, it can be filtered in almost all cases by adding new metrics (i.e. line numbers, Halstead's metric, statement count). The significant gain though in this approach is that we can limit the search space to a few hundreds (or less than a hundred, when DP is considered) of code fragment pairs from a pool of half a million possible pairs that could have been considered in total. Moreover, the method is fully automatic, does not require any knowledge of the system and is computationally acceptable 0{n * m) for DP, where m is the size of the model and n the size of the input.

Concept-to-code matching uses an abstract language (ACL) to represent code operations at an abstract level. Markov models and the Viterbi algorithm are used to compute similarity measures between an abstract statement and a code statement in terms of the probability that an abstract statement generates the particular code statement.

The ACL can be viewed not only as a regular expression-like language but also as a vehicle to gather query features and an engine to perform matching between two artifacts. New features, or invocations and results from other pattern matching tools, can be added to the features of the language as requirements for the matching process. A problem we foresee arises when binding variables exist in the pattern. If the pattern is vague then complexity issues slow down the matching process. The way we have currently overcome this problem is for every new binding to check only if it is a legal one in a set of possible ones instead of forcing different alternatives when the matching occurs.

Our current research efforts are focusing on the development of a generic pattern matcher which given a set of features, an abstract pattern language, and an input code fragment can provide a similarity measure between an abstract pattern and the input stream.

Such a pattern matcher can be used a) for retrieving plans and other algorithmic struc­tures from a variety of large software systems ( aiding software maintenance and program understanding), b) querying digital databases that may contain partial descriptions of data and c) recognizing concepts and other formalisms in plain or structured text (e.g.,HTML)

PATTERN MATCHING FOR CLONE AND CONCEPT DETECTION 1 0 7

Another area of research is the use of metrics for finding a measure of the changes introduced from one to another version in an evolving software system. Moreover, we investigate the use of the cloning detection technique to identify similar operations on specific data types so that generic classes and corresponding member functions can be created when migrating a procedural system to an object oriented system.

Notes

1. In this paper, "reverse engineering*' and related terms refer to legitimate maintenance activities based on source-language programs. The terms do not refer to illegal or unethical activities such as the reverse compilation of object code to produce a competing product.

2. "The Software Refinery" and REFINE are trademarks of Reasoning Systems, Inc.

3. We are using a commercial tool called REFINE (a trademark of Reasoning Systems Corp.).

4. The Spearman-Pearson rank correlation test was used.

References

Adamov, R. "Literature review on software metrics", Zurich: Institutfur Informatik der Universitat Zurich, 1987. Baker S. B, "On Finding Duplication and Near-Duplication in Large Software Systems" In Proceedings of the

Working Conference on Reverse Engineering 1995, Toronto ON. July 1995 Biggerstaff, T, Mitbander, B., Webster, D., "Program Understanding and the Concept Assignment Problem",

Communications of the ACM, May 1994, Vol. 37, No.5, pp. 73-83. P. Brown et. al. "Class-Based n-gram Models of natural Language", Journal of Computational Linguistics, Vol.

18, No.4, December 1992, pp.467-479. Buss, E., et. al. "Investigating Reverse Engineering Technologies for the CAS Program Understanding Project",

IBM Systems Journal, Vol. 33, No. 3,1994, pp. 477-500. G. Canfora., A. Cimitile., U. Carlini., "A Logic-Based Approach to Reverse Engineering Tools Production"

Transactions of Software Engineering, Vol.18, No. 12, December 1992, pp. 1053-1063. Chikofsky, E.J. and Cross, J.H. II, "Reverse Engineering and Design Recovery: A Taxonomy," IEEE Software,

Jan. 1990, pp. 13 -17. Church, K., Helfman, I., "Dotplot: a program for exploring self-similarity in millions of lines of text and code",

/. Computational and Graphical Statistics 2,2, June 1993, pp. 153-174. C-Language Integrated Production System User's Manual NASA Software Technology Division, Johnson Space

Center, Houston, TX. Fenton, E. "Software metrics: a rigorous approach". Chapman and Hall, 1991. Halstead, M., H., "Elements of Software Science", New York: Elsevier North-Holland, 1977. J. Hartman., "Technical Introduction to the First Workshop on Artificial Intelligence and Automated Program

Understanding" First Workshop on Al and Automated Program Understanding, AAAr92, San-Jose, CA. Horwitz S., "Identifying the semantic and textual differences between two versions of a program. In Proc. ACM

SIGPLAN Conference on Programming Language Design and Implementation, June 1990, pp. 234-245. Jankowitz, H., T, "Detecting plagiarism in student PASCAL programs". Computer Journal, 31.1, 1988, pp. 1-8. Johnson, H., "Identifying Redundancy in Source Code Using Fingerprints" In Proceedings of GASCON '93, IBM

Centre for Advanced Studies, October 24 - 28, Toronto, Vol.1, pp. 171-183. Kuhn, R., DeMori, R., "A Cache-Based Natural Language Model for Speech Recognition", IEEE Transactions

on Pattern Analysis and Machine Intelligence, Vol. 12, No.6, June 1990, pp. 570-583. Kontogiannis, K., DeMori, R., Bernstein, M., Merlo, E., "Localization of Design Concepts in Legacy Systems",

In Proceedings of International Conference on Software Maintenance 1994, September 1994, Victoria, BC. Canada, pp. 414-423.

1 0 8 KONTOGIANNIS ET AL.

Kontogiannis, K., DeMori, R., Bernstein, M., Galler, M., Merlo, E., "Pattern matching for Design Concept Localization", In Proceedings of the Second Working Conference on Reverse Engineering, July 1995, Toronto, ON. Canada, pp. 96-103.

"McCabe T., J. "Reverse Engineering, reusability, redundancy : the connection", American Programmer 3, 10, October 1990, pp. 8-13.

MoUer, K., Software metrics: a practitioner's guide to improved product development" Muller, H., Corrie, B., Tilley, S., Spatial and Visual Representations of Software Structures, Tech. Rep. TR-74.

086, IBM Canada Ltd. April 1992. Mylopoulos, J., "Telos : A Language for Representing Knowledge About Information Systems," University of

Toronto, Dept. of Computer Science Technical Report KRR-TR-89-1, August 1990, Toronto. J. NIng., A. Engberts., W. Kozaczynski., "Automated Support for Legacy Code Understanding", Communications

of the ACM, May 1994, Vol.37, No.5, pp.50-57. Paul, S., Prakash, A., "A Framework for Source Code Search Using Program Patterns", IEEE Transactions on

Software Engineering, June 1994, Vol. 20, No.6, pp. 463-475. Rich, C. and Wills, L.M., "Recognizing a Program's Design: A Graph-Parsing Approach," IEEE Software, Jan

1990, pp. 82 - 89. Tilley, S., Muller, H., Whitney, M., Wong, K., "Domain-retargetable Reverse Engineeringll: Personalized User

Interfaces", In CSM'94 : Proceedings of the 1994 Conference on Software Maintenance, September 1994, pp. 336 - 342.

Viterbi, A.J, "Error Bounds for Convolutional Codes and an Asymptotic Optimum Decoding Algorithm", IEEE Trans. Information Theory, 13(2) 1967.

Wills, L.M.,"Automated Program Recognition by Graph Parsing", MIT Technical Report, AI Lab No. 1358,1992

Automated Software Engineering, 3, 109-138 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netheriands.

Extracting Architectural Features from Source Code* DAVID R. HARRIS, ALEXANDER S. YEH drh@mitre.org The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA

HOWARD B. REUBENSTEIN * hbr@mitretek.org Mitretek Systems, 25 Burlington Mall Road, Burlington, MA 01803, USA

Abstract. Recovery of higher level design information and the ability to create dynamic software documen­tation is crucial to supporting a number of program understanding activities. Software maintainers look for standard software architectural structures (e.g., interfaces, interprocess communication, layers, objects) that the code developers had employed. Our goals center on supporting software maintenance/evolution activities through architectural recovery tools that are based on reverse engineering technology. Our tools start with existing source code and extract architecture-level descriptions linked to the source code firagments that implement architectural features. Recognizers (individual source code query modules used to analyze the target program) are used to locate architectural features in the source code. We also report on representation and organization issues for the set of recognizers that are central to our approach.

Keywords: Reverse engineering, software architecture, software documentation

1. Introduction

We have implemented an architecture recovery framework on top of a source code exam­ination mechanism. The framework provides for the recognition of architectural features in program source code by use of a library of recognizers. Architectural features are the constituent parts of architectural styles (Perry and Wolf, 1992), (Shaw, 1991) which in turn define organizational principles that guide a programmer in developing source code. Ex­amples of architectural styles include pipe and filter data processing, layering, abstract data type, and blackboard control processing.

Recognizers are queries that analysts or applications can run against source code to identify portions of the code with certain static properties. Moreover, recognizer authors and software analysts can associate recognition results with architectural features so that the code identified by a recognizer corresponds to an instance of the associated architectural

This is a revised and extended version based on two previous papers: 1. "Reverse Engineering to the Ar­chitectural Level" by Harris, Reubenstein and Yeh, which appeared in the Proceedings of the 17th International Conference on Software Engineering, April 1995, © 1995 ACM. 2. "Recognizers for Extracting Architectural Features from Source Code" by Harris, Reubenstein and Yeh, which appeared in the Proceedings of the 2nd Working Conference on Reverse Engineering, July 1995, © 1995 IEEE. The work reported in this paper was sponsored by the MITRE Corporation's internal research program and was performed while all the authors were at the MITRE Corp, This paper was written while H. Reubenstein was at GTE Laboratories. H. Reubenstein's current address is listed above.

110 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

feature. Within our implementation, we have developed an extensive set of recognizers targeted for architecture recovery applications. The implementation provides for analyst control over parameterization and retrieval of recognizers from a library.

Using these recognizers, we have recovered constituent features of architectural styles in our laboratory experiments (Harris, Reubenstein, Yeh: ICSE, 1995). In addition, we have used the recognizers in a stand-alone mode as part of a number of source code quality assessment exercises. These technology transfer exercises have been extremely useful for identifying meaningful architectural features.

Our motivation for building our recovery framework stems from our efforts to understand legacy software systems. While it is clear that every piece of software conforms to some design, it is often the case that existing documentation provides little clue to that design. Recovery of higher level design information and the ability to create as-built software documentation is crucial to supporting a number of program understanding activities. By stressing as-built, we emphasize how a program is actually structured versus the structure that designers sketch out in idealized documentation.

The problem with conventional paper documentation is that it quickly becomes out of date and it often is not adequate for supporting the wide range of tasks that a software maintainer or developer might wish to perform, e.g., general maintenance, operating system port, language port, feature addition, program upgrade, or program consolidation. For example, while a system block diagram portrays an idealized software architecture description, it typically does not even hint at the source level building blocks required to construct the system.

As a starting point, conmiercially available reverse engineering tools (Olsem and Sitte-nauer, 1993) provide a set of limited views of the source under analysis. While these views are an improvement over detailed paper designs in that they provide accurate information derived directly from the source code, they still only present static abstractions that focus on code level constructs rather than architectural features.

We argue that it is practical and effective to automatically (sometimes semi-automatically) recognize architectural features embedded in legacy systems. Our framework goes beyond basic tools by integrating reverse engineering technology and architectural style represen­tations. Using the framework, analysts can recover multiple as-built views - descriptions of the architectural structures that actually exist in the code. Concretely, the representation of architectural styles provides knowledge of software design beyond that defined by the syntax of a particular language and enables us to respond to questions such as the following:

• When are specific architectural features actually present?

• What percent of the code is used to achieve an architectural feature?

• Where does any particular code fragment fall in an overall architecture?

The paper describes our overall architecture recovery framework including a description of our recognition library. We begin in Section 2 by describing the overall framework. Next, in Section 3, we address the gap between idealized architectural descriptions and source code and how we bridge this gap with architectural feature recognizers. In Section 4, we describe the underlying analysis tools of the framework. In Section 5, we describe

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 111

the aspects of the recognition library that support analyst access and recognizer authoring. In section 6, we describe our experience in using our recovery techniques on a moderately sized (30,000 lines of code) system. In addition, we provide a very preliminary notion of code coverage metrics that researchers can used for quantifying recovery results. Related work and conclusions appear in Sections 7 and 8 respectively.

2. Architecture Recovery - Framework and Process

Our recovery framework (see Figure 1) spans three levels of software representation:

• a program parsing capability (implemented using Software Refinery (Reasoning Systems, 1990)) with accompanying code level organization views, i.e., abstract syntax trees and a "bird's eye" file overview

• an architectural representation that supports both idealized and as-built architectural representations with a supporting library of architectural styles and constituent archi­tectural features

• a source code recognition engine and a supporting library of recognizers

Figure 1 shows how these three levels interact. The idealized architecture contains the initial intentions of the system designers. Developers encode these intentions in the source code. Within our framework, the legacy source code is parsed into an internal abstract syntax tree representation. We run recognizers over this representation to discover archi­tectural features - the components/connectors associated with architectural styles (selecting a particular style selects a set of constituent features to search for). The set of architectural features discovered in a program form its as-built architecture containing views with respect to many architectural styles. Finally, note that the as-built architecture we have recovered is both less than and more than the original idealized architecture. The as-built is less than the idealized because it may miss some of the designer's original intentions and because it may not be complete. The as-built is also more than the idealized because it is up-to-date and because we now have on-line linkage between architecture features and their imple­mentation in the code. We do not have a definition of a complete architecture for a system. The notions of code coverage described later in the paper provides a simple metric to use in determining when a full understanding of the system has been obtained.

The framework supports architectural recovery in both a bottom-up and top-down fashion. In bottom-up recovery, analysts use the bird's eye view to display the overall file structure and file components of the system. The features we display (see Figure 2) include file type (diamond shapes for source files with entry point functions; rectangles for other source files), name, pathname of directory, number of top level forms, and file size (indicated by the size of the diamond or rectangle). Since file structure is a very weak form of architectural organization, only shallow analysis is possible; however, the bird's eye view is a place where our implementation can register results of progress toward recognition of various styles.

In top-down recovery, analysts use architectural styles to guide a mixed-initiative recovery process. From our point of view, an architectural style places an expectation on what

112 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Idealized Architecture Views of the As-Built Architecture

implemented by I

combine using architectural styles to form

Architectural Features

provides clues for recognizing 1

Program parses into

1 Abstract Syntax Tree

Figure 1. Architectural recovery framework

recovery tools will find in the softw are system. That is, the style establishes a set of architectural feature types which define components/connectors types to be found in the software. Recognizers are used to find the component/connector features. Once the features are discovered, the set of mappings from feature types to their realization in the source code forms the as-built architecture of the system.

2. /. Architectural Styles

The research community has provided detailed examples (Garlan and Shaw, 1993, Shaw, 1989, Shaw, 1991, Perry and Wolf, 1992, Hofmeister, Nord, Soni, 1995) of architec­tural styles, and we have codified many of these in an architecture modeling language. Our architecture modeling language uses entity/relation taxonomies to capture the com­ponent/connector style aspects that are prevalent in the literature (Abowd, Allen, Garlan, 1993, Perry and Wolf, 1992, Tracz, 1994). Entities include clusters, layers, processing el­ements, repositories, objects, and tasks. Some recognizers discover source code instances of entities where developers have implemented major components - "large" segments of source code (e.g., a layer may be implemented as a set of procedures). Relations such as contains, initiates, spawns, and is-connected-to each describe how entities are linked. Component participation in a relation follows from the existence of a connector - a specific code fragment (e.g., special operating system invocation) or the infrastructure that pro­cesses these fragments. This infrastructure may or may not be part of the body of software under analysis. For example, it may be found in a shared library or it may be part of the implementation language itself.

As an illustration. Figure 3 details the task entity and the spawns relation associated with a task spawning style. In a task spawning architectural style, tasks (i.e., executable processing

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 113

Figure 2. Bird's Eye Overview

114 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

elements) are linked when one task initiates a second task. Task spawning is a style that is recognized by the presence of its connectors (i.e., the task invocations). Its components are tasks, repositories, and task-functions. Its connectors are spawns (invocations from tasks to tasks), spawned-by (the inverse of spawns), uses (relating tasks to any tasks with direct in­terprocess communications and to any repositories used for interprocess communications), and conducts (relating tasks to functional descriptions of the work performed).

Tasks are a kind of processing element that programmers might implement by files (more generally, by call trees). A default recognizer named executables will extract a collection of tasks. Spawns relates tasks to tasks (i.e., parent and child tasks respectively). Spawns might be implemented by objects of type system-call (e.g., in Unix/C, programmers can use a system, execl, execv, execlp, or execvp call to start a new process via a shell command). Analysts can use the default recognizer, find-executable-links, to retrieve instances of task spawning.

defentity TASK

:specialization-of processing-element

:possible-implementation file

:recognized-by executables

defrel SPAWNS

:specialization-of initiates

:possible-implementation system-call

:recognized-by find-executable-links

:domain task

:range task

Figure 3. Elements in an architecture modeling language

Many of the styles we work with have been elaborated by others (e.g., pipe and filter, object-oriented, abstract data type, implicit invocation, layered, repository). In addition we have worked with a few styles that have special descriptive power for the type of programs we have studied. These include application programming interface (API) use, the task spawning associated with real time systems, and a service invocation style. Space limitations do not permit a full description of all styles here. However, we offer two more examples to help the reader understand the scope of our activities.

Layered: In a layered architecture the components (layers) form a partitioning of a subset, possibly the entire system, of the program's procedures and data structures. As mentioned in (Garlan and Shaw, 1993), layering is a hierarchical style: the connectors are the specific references that occur in components in an upper layer and reference components that are defined in a lower layer. One way to think of a layering is that each layer provides a service to the layer(s) above it. A layering can either be opaque: components in one layer cannot reference components more than one layer away, or transparent: components in one layer can reference components more than one layer away.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 1 5

Data Abstractions and Objects: Two related ways to partially organize a system are to identify its abstract data types and its groups of interacting objects (Abelson and Sussman, 1984, Garlan and Shaw, 1993). A data abstraction is one or more related data representations whose internal structure is hidden to all but a small group of procedures, i.e., the procedures that implement that data abstraction. An object is an entity which has some persistent state (only directly accessible to that entity) and a behavior that is governed by that state and by the inputs the object receives. These two organization methods are often used together. Often, the instances of an abstract data type are objects, or conversely, objects are instances of classes that are described as types of abstract data.

3. Recognizers

Recognizers map parts of a program to features found in architectural styles. The recogniz­ers traverse some or all of a parsed program representation (abstract syntax tree, or AST) to extract code fragments (pieces of concrete syntax) that implement some architectural feature. Examples of these code fragments include a string that names a data file or a call to a function with special effects. The fragments found by recognizers are components and connectors that implement architectural style features. A component recognizer re­turns a set of code-fragments in which each code-fragment is a component. A connector recognizer returns a set of ordered triples - code-fragment, enclosing structure, and some meaningful influence such as a referenced file, executable object, or service. In each triple, the code-fragment is a connector, and the other two elements are the two components being connected by that connector.

3.1. A Sample Recognizer

The appendix contains a partial listing of the recognizers we use. Here, we examine parts of one of these in detail.

Table 1 shows the results computed by a task spawning recognizer (named Find-Executable-Links) applied to a network management program. For each task to task connector, the ordered triple contains the special function call that is the connector, the task which makes the spawn (one end of the connector), and the task that is spawned (invoked) by the call (the other end). This recognizer has a static view of a task: a task is the call tree subset of the source code that might be run when the program enters the task.

The action part of a recognizer is written in our RRL (REFINE-based recognition lan­guage). The main difference between RRL and REFINE is the presence in RRL of iteration operators that make it easy for RRL authors to express iterations over pieces of a code frag­ment. The RRL code itself may call functions written either in RRL or REFINE. Figure 4 shows the action part of the previously mentioned task spawning recognizer.

This recognizer examines an AST that analysts generate using a REFINE language work­bench, such as REFINE/C (Reasoning Systems, 1992). The recognizer calls the function invoca t ions -o f - type , which finds and returns a set of all the calls in the program to functions that may spawn a task. For each such call, the recognizer calls process - invoked

1 1 6 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Table 1. The results of task spawning recognition

Function Call Spawning Task Spawned Task

system(...

system{...

system(...

system(...

system(...

system(...

execlp(...

RUN_SNOOPY SNOOPY

SNOOPY EXNFS

SNOOPY EX69

SNOOPY EX25

SNOOPY EX21

SNOOPY SCANP

MAIN RUN_SNOOPY

let (results = {})

(for-every call in invocations-of-type('system-calls) do

let (target = process-invoked(call))

if ~(target = undefined) then

let (root = go-to-top-from-root(call))

results <- prepend(results, [call, root, target]));

results

Figure 4. The action part of the task spawning recognizer Find-Executable-Links

to determine if a task is indeed spawned, and if so, get the target task being spawned. If process- invoked finds a target task, the recognizer then calls go- to- top- f rom-root , which finds the root of the task which made the call and then returns the entire call tree (the task) starting from that root. The target task is also in the form of the entire call tree starting from the target task's root function. These triples of function calls, spawning tasks and target tasks are saved in r e s u l t s and then returned by the recognizer.

Figure 5 shows what this task spawning recognizer examines when it encounters the special function call systeiti(cmd), which is embedded in the task Run_snoopy and is used by Run_snoopy to spawn the task Snoopy. The command "system(cind) ;" is a connector. Starting from that connector, the recognizer finds and connects Run_snoopy's call tree to Snoopy's call tree. The figure also shows processing details that are described in Section 4, where we highlight our underlying analysis capabilities.

3,2, Rationale for Level of Recovery

In addition to architectural features actually found in the source code, we would like to recover the idealized architecture - a description of the intended structure for a program. Unfortunately, these idealized descriptions cannot be directly recognized from source code. The structural information at the code level differs from idealized descriptions in two im­portant ways. First, while a program's design may commit to certain architectural features (e.g., pipes or application programming interfaces), actual programs implement these fea-

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 117

main(argc, argv) i

find task root map from

snoopy-' maitefile

special pattern

"^ file snoopy.c

Call tree of task spawned (Snoopy)

Call tree of spawning task

(RunjBnoopy)

Figure 5. Task spawning recognizer examines task Run_snoopy spawning task snoopy via the connector "system (cmd) ;*'

lures with source code constructs (e.g., procedure parameter passing, use of Unix pipes, export lists) - a one-to-many conceptual shift from the idealized to the concrete. Second, there are differences due to architectural mixing/matching and architectural violations. Rea­sons for such violations are varied. Some are due to a developer's failure to either honor or understand the entailments of one or more architectural features. This erosion from the ideal usually increases over the life cycle of the program due to the expanding number of developers and maintainers who touch the code. Other violations are due to the inability of an existing or required environment (e.g., language, host platform, development tools, or commercial enabling software) to adequately support the idealized view and may occur with the earliest engineering decisions.

If we just target idealized architectures directly and do not search for architectural features as they are actually built in the source code, we risk missing important structures because the ideal does not exist in the code. To overcome this difficulty we use a partial recognition approach that does not require finding full compliance to an idealized architecture nor does it bog down in a detailed analysis of all of the source code. As described at the start of Section 3, we aim our recognizers at extracting code fragments that implement specific architectural features. Together, a collection of these code fragments forms a view on the program's as-built architecture, but generating such an aggregation is not the responsibility of the individual recognizers. Note that this restriction relaxes expectations that we will find fully formed instantiations of architectural styles in existing programs. Rather our recognizers will find partial instantiations of style concepts and are tolerant of architectural pathologies such as missing components and missing connectors.

118 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

The recognizers are designed to recognize typical and possible patterns of architectural feature implementations. The recognizers are not fool-proof. A programmer can always find an obscure way to implement an architectural feature which the recognizers will not detect and a programmer may write code that accidently aligns with an architectural feature. However, the recognizers written so far capture the more common patterns and have worked well on the examples we have seen. As we encounter more examples, we will modify and expand the recognizers as needed.

The more advanced recognizers from the set of recognizers (listing in the appendix) capture task spawnings and service invocations via slice evaluation and searching for special progranuning patterns. Section 4 highlights this analysis. In addition, (Holtzblatt, Piazza, Reubenstein, Roberts, 1994) describes our related work on CMS2 code. In most of the other cases, the features are not difficult to recognize. Among other things, the recognizers cover a wide spectrum of components and connectors that C/Unix programmers typically use for implementing architectural features.

4. Analysis Tools for Supporting Recognition

The recognizers make use of conunercially available reverse engineering technology, but there are several important analysis capabilities that we have added. The capabilities them­selves are special functions that recognizer authors can include in a recognizer's definition. The most prominent capabilities find potential values of variables at a given line of source code, analyze special patterns, manage clusters (i.e., collections of code ft'agments), and encode language-specific ways of accomplishing abstract tasks.

4,1. Values of Variables

Several recognizers use inter-procedural dataflow. We implement this analysis by first com­puting a program slice (Gallagher and Lyle, 1991), (Weiser, 1984) that handles parameter passing and local variables. From the slice, we compute a slice evaluation to retrieve the potential variable values at given points in the source code. This approach is used for finding users of communication channels, data files that a procedure accesses or modifies, and references to executable programs. Figure 6 shows two code fragments that illustrate the requirements for the slice evaluator. Starting with the first argument to the system call or the fourth argument to the execlp call, the slice evaluator finds the use of C's sprintf to assign the cmd variable with a command line string. The string contains a pathname to an executable image.

Our "slice evaluator" algorithm makes several assumptions to avoid intractable compu­tation. Most notably it ignores control flow and finds the potential values of argument assignments but not the conditions under which different choices will be made. For exam­ple, if variable x is bound to 3 and 5 respectively in the "then" and "else" parts of an "if" statement, the slice evaluator identifies 3 and 5 as possible values, but does not evaluate the conditional part of the " i f statement.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 119

4,2. Special Patterns

Slicing provides only part of the story for the examples in Figure 6. Programmers use stereotypical code patterns to implement frequently occurring computations. Some of these patterns can be easily recognized in abstract syntax trees. For example, the code in Figure 6 shows two standard ways of invoking an executable (and potentially invoking a task). To uncover this architectural feature, we need to exploit knowledge of two patterns. The first pattern identifies the position - first argument for system calls, last but for the null string for execlp - of the key command string that contains the name of the executable. The second pattern describes potential ways programmers can encode pathnames in the command strings. In the first example, the function sprintf binds the variable cmd to the string "%s/snoopy" where the %s is replaced by the name of the directory stored in the variable b in_di r . In the second, the movement to the appropriate directory ("cd %s / b i n ; " ) is separated from the actual spawning of "snoopy". We designed our approach to catch such dominate patterns and to ferret out the names of files and executable images (possibly tasks) within string arguments.

1. sprintf(cmd, "%s/snoopy", bin_dir);

if ( debug == 0)

status = system (cmd);

2. sprintf(cmd,"cd %s/bin; ./snoopy",

top_dir);

if (forkO == 0) {

e x e c l p ( " / b i n / s h " , " s h " , " - c " , cmd, ( c h a r * ) 0 ) ; }

Figure 6. Two approaches for invoking an executable image

Other examples of patterns for C/Unix systems include the use of socket calls with con­nect or bind calls for creating client-server architectures, and the declaration of read/write modes in fopen calls. While our approach has been somewhat catch-as-catch-can, we have found that identifying only a few of these patterns goes a long way toward recovering architectural features across many architectural styles.

43, Clustering

Clusters are groupings of features of the program - a set of files, a set of procedures, or other informal structures of a program. Some recognizers need to bundle up collections of objects that may be de-localized in the code. Clustering facilities follow some algorithm for gathering elements from the abstract syntax tree. They create clusters (or match new

120 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

collections to an old cluster), and, in some cases, conduct an analysis that assigns properties to pairs of clusters based on relationships among constituent parts of the clusters.

For example, our OBject and Abstract Data type (OBAD) recovery sub-tool (Harris, Reubenstein, Yeh: Recovery, 1995) builds clusters whose constituents are collections of procedures, data structures, or global variables. OBAD is an interactive approach to the recovery of implicit abstract data types (ADTs) and object instances from C source code. This approach includes automatic recognition and semi-automatic techniques that handle potential recognition pitfalls.

OBAD assumes that an ADT is implemented as one or a few data structure types whose internal fields are only referenced by the procedures that are part of the ADT. The basic version of OBAD finds candidate ADTs by examining a graph where the procedures and structure types are the nodes of the graph, and the references by the procedures to the internal fields of the structures are the edges. The set of connected components in this graph form the set of candidate ADTs. OBAD has automatic and semi-automatic enhancements to handle pitfalls by modifying what is put into the above graphs. Currently, OBAD constructs the graph from the abstract syntax tree. In the future, OBAD will use graphs made from the results returned by more primitive recognizers.

Also, recognizers can use clusters as input and proceed to detect relationships among clusters. For example, a computation of pairwise cluster level dominance looks at the procedures within two clusters. If cluster A contains a reference to an entry point defined in cluster B, while cluster B does not reference cluster A, we say that A is dominant over B. This notion of generalizing properties held by individual elements of groups occurs in several of our recognizers.

4,4, Language/Operating-System Models

A design goal has been to write recognizers that are LOL-independent - independent of specific patterns due to the source code Language, the Operating system, and any Legacy system features. Our hope is that we will be able to reuse most recognizers across ASTs associated with different LOL combinations. While we have not explored this goal exten­sively, we have had some success with recognizers that work for both FORTRAN (under the MPX operating system) and C (under Unix). Our approach to this is two-fold. First, we write recognizers using special accessors and analysis functions that have distinct im­plementations for each LOL. That is, the special access functions need to be re-written for each LOL, but the recognizer's logic is reusable across languages. Second, we isolate LOL-specific function (e.g., operating system calls) names in separately loadable libraries of call specifications. Each call specification describes the language, operating system, and sometimes even target system approach for coding a LOL-neutral behaviors such as system calls, time and date calls, communication channel creators, data accessing, data transmis­sion, input/output calls, API's for commercial products, and network calls. For examples. Figure 7 is the C/Unix model for system-calls (i.e., calls that run operating system line commands or spawn a task) while Figure 8 shows an analogous FORTRAN/MPX model.

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 2 1

These specifications are also a convenient place for describing attributes of special pat­terns. In these examples, the key-positions field indicates the argument position of the variable that holds the name of the executable invoked.

defcalls SYSTEM-CALLS

:call-desc "System Calls"

:call-type system-call

:call-ref-names "system", "execve",

"exec1", "execV",

"execlp", "execvp", "execle"

:key-positions first, next-last, next-last, next-last,

next-last, next-last, next-last

Figure 7. A C/Unix Call Specification

defcalls SYSTEM-CALLS

:call-desc "System Calls"

:call-type system-call

:call-ref-names "m::rsum", "m::sspnd"

:key-positions first, first

Figure 8. A FORTRAN/MPX Call Specification

4,5. An Example - Putting it all together

We return to the find-executable-links recognizer described in Section 3.1. When faced with either code fragment of Figure 6, this recognizer will collect the appropriate triple. We explain this activity in terms of the above analysis capabilities. The functions g o - t o - t o p -from-root and invoca t ions -o f - type perform their job by traversing the program AST. invoca t ions -o f - type accesses the call-specification to tell it which functions in the examined program can implement some architectural style feature. For example, in the Unix operating system, the system-call specification names the functions that can spawn a task (i.e., system or members of the execlp family of functions). The function p r o c e s s -invoked uses slice evaluation to find the value(s) of the arguments to the fiinction calls returned by invoca t ions -of - type . process- invoked then uses special patterns to determine the name of the executable image within the command string. In addition, process- invoked consults a map to tell it which source code file has the root for which task. The map is currently hand generated from examining system makefvX^^. In the file with the root, process- invoked finds the task's root function (in the C language, this is

122 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

the function named main) and then traverses the program AST to collect the call tree into a cluster starting at that root function. Figure 5 shows how these various actions are put together for the sample recognition described in Section 3.1.

The database of language and operating system specific functions, the program slicing (and slice evaluation), and the special patterns described in this section are all areas where our architecture recovery tool adds value beyond that of commercially available software reverse engineering tools.

5. Recognizers in Practice

As we developed a set of recognizers, it quickly became clear to us that we needed to pay attention to organization and indexing issues. Even at a preliminary stage addressing only a few architecture styles and a single implementation language, we found we could not easily manage appropriate recognizers without some form of indexing. Since we intend that software maintenance and analysis organizations treat recognizers as software assets that can be used interactively or as as part of analysis applications, we have augmented the recognizer representations with retrieval and parameterization features. These features provide support so that families of recognizers can be managed as a software library. As part of this effort, we identified reusable building blocks that enable us to quickly construct new recognizers and manage the size of the library itself. This led us to codify software knowledge in canonical forms that can be uniformly accessed by the recognizers. In addition, we discovered that architectural commitments map to actual programs at multiple granularity levels and this imposed some interesting requirements on the types of recognizers we created.

In this section, we describe several of the features of our framework that facilitate recogni­tion authoring and recognizer use. In particular, we describe a retrieval by effect mechanism and several recognizer composition issues.

5. i. Recognizer Authoring

Recognizer authors (indeed all plan/recognition library designers) face standard software development trade-off issues that impact the size of the library, the understandability of the individual library members, and the difficulty of composing new library members from old. While our REFINE-based recognition language (RRL) does not support input variables, it does have a mechanism for parameterization. These parameters have helped us keep the recognition library size small. The parameters we currently use are focus, program, reference, and functions-of-interest. The parameters provide quite a bit of flexibility for the recognizer author who can populate the library with the most appropriate member of a family of related recognizers. As an illustration, when f unctions-of-interest is bound to the set of names "system", "execve", "execl", "execv", "execlp", "execvp", and "execle" and reference is bound to "system-calls", the three fragments in Figure 9 yield an equivalent enumeration (over the same sets of objects in a legacy program).

The first fragment maximizes programming flexibility, but does require analyst tailoring (i.e., building the appropriate functions-of-interest list from scratch or setting it to some

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 123

1. let (function-names =

FUNCTIONS-OF-INTEREST)

(for-every item in program such-that

function-call(item) and

name(item) in function-names do

2. (for-every item in invocations-of-type(reference) do

3. (for-every item in invocations-of-type('system-calls) do

Figure 9. A family of recognizer fragments

pre-defined list of special calls). In addition, more of the processing is explicitly stated, perhaps making the fragment more difficult to understand (i.e., lacking abstractions). In contrast, the third special purpose recognizer does not require any external parameter set­tings, but would co-exist in a library with many close cousins. The second fragment is a compromise. In general, our set of parameters allows recognizer authors to modulate abstraction versus understandability issues to produce a collection that best suits the needs of their specific user community.

5,2, Operation and Control

Analysts use recognizers in two ways. First, recognizers can be stand-alone analysis meth­ods for answering a specific question about the source code. For example, an analyst might ask for the locations where the source code invokes the sendmail service. Second, within our architecture recovery implementation, recognizers are semi-automatically bundled to­gether to produce a composite view. For example, Section 6 below shows a system's as-built architecture with respect to the task-spawning style. This view was constructed using the set of default recognizers associated with the entities and relations of the task-spawning style. Three recognizers were employed. The find-executable-links recognizer found instances of the spawns relation (encoded in the system or execlp calls of the program), a second recognizer found instances of file/shared-memory interprocess communication (through fopen and open calls), and a third looked for separate executables (identified by "main" procedures) that may not have been found by the other recognizers. Within our recovery framework, analysts can override the defaults by making selections from the recognition library. Thus, either in stand-alone or as-built architecture recovery modes, recovery is an interactive process and we need facilities that will help analyst make informed selections from the library.

124 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

5.2.1. Recognizer Retrieval

Since the library is large (60 or more entries), we have provided two indexing schemes that help the analyst find an appropriate recognizer. The first scheme simply uses the text strings in a description attribute associated with each recognizer. The analyst enters a text string and the implementation returns a list of all recognizers whose description contains text that matches the string. The analyst can review the list of returned descriptions and select the recognizer that looks most promising.

The second scheme allows an analyst to see and select from all the recognizers that would return some type of information. While, analysts may not remember the name of a recognizer, they will probably know the type of information (e.g., file, function-call, procedure) that they are looking for. To support this retrieval, we have attached effect descriptions to each recognizer. Since, the result of running a recognizer may be that some part of the source code is annotated with markers, we think of the "effects" of running a recognizer on the AST. For example, the task-spawning recognizer in Figure 4 finds function calls and files (associated with tasks). The format for these effect descriptions is "[<category> <type>]" where <category> is either "know" or "check" and <type> is some entry in the type hierarchy. Such tuples indicate that the recognizer will "know" about fragments of the stated type or "check" to each if fragments are of the stated type.

Figure 10 is the type taxonomy our implementation uses. Uppercase entries are top entries of taxonomies based on the language model (e.g., C, FORTRAN) along with our specializations (e.g., specializations of function call) and clustering extensions. The depth of indentation indicates the depth in a subtree.

When analysts select a type from this list, the system shows them a list of all the recognizers that find items of that type. Figure 11 is an example that shows the restricted menu of recognizers that achieve [know function-call]. In the event that the analyst does not find a relevant recognizer in the list, the system helps by offering to expand the search to find recognizers that know generalizations of the current type. For example, a request [know special-call] would be extended to the request [know function-call] to the request [know expression] climbing into the upper domain model for the legacy system's language.

Once a recognizer is selected, the system prompts the analyst for parameters that the recognizer requires. Analysts can set the reference parameter to the result of a previous recognition thus providing a mechanism for cascading several recognizers together to re­trieve a complex pattern. In addition, there is an explicit backtracking scheme encoded for the recognizers. If a recognizer requires other recognizers to have been run (i.e., to populate some information on the AST) its representation indicates that the second recog­nizer is a pre-condition. The analyst can review the result and select some subset of the returned results for subsequent analysis. Reasons for only selecting a subset could range from abstracting away details (for understanding or analysis) to removing irrelevant details that cannot be detected syntactically (e.g., a module is only used for testing).

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 125

CLUSTER

Ne twork-exchange

RPC-exchange

Port-exchange

Pipe-exchange

Unix-pipe

Code-fragment

Connector-fragment

Module

Service

Non-source-file

Shell-script

Input-file

Output-file

Source-file

Executable-object

FUNCTION-CALL

Special-call

Network-call

System-call

I/O-call

Non-POSIX-compliant-call

FUNCTION-DEF

STRUCT-TYPE

Figure 10. k Taxonomy of recognition types

126 D.R. HARRIS. H.B. REUBENSTEIN. A.S. YEH

FUHCTIOH-CALL-ARTIFACT

NETWORK-CALL : implementations of client process NETWORK-CALL : implementations of server process

SERVICE : LINKS between the program and any network services or remote procedun NETWORK-CALL : LINKS between procedures and some service

NETWORK-CALL : LINKS between procedures and network services PROCESS-INVOCATION : LINKS between procedures and shell commands

SPECIAL-CALL : Connection family used in a network exchange SPECIAL-CALL : Connection type used in a network exchange

PROCESS-INVOCATION : Spawning LINKS between executable modules PROCESS-INVOCATION : Invocations that activate executables FUNCTION-DEF : LINKS between local and remote procedures

SPECIAL-CALL : Function calls identified directly or by dereferenced function name SPECIAL-CALL : Invocations of members of a family of functions

Abort

Figure 11. Recognizers with effect [know function-call]

5.2.2. Recognizer Results

From among several possible representations our recognition results are either sets of objects from the AST or sets of tuples of objects from the AST, This choice has been motivated by the multiple purposes we envision for recognizer use. As we have mentioned, recognition results may stand by themselves in answering a question, they may be joined with other results to form a composite picture (i.e., this is how style recognition is accomplished), or they may be used as inputs to other recognizers in a more detailed analysis of the code. Standard output results are needed to support interoperability among recognizers and to provide a uniform API to applications. This notion needs to be balanced with the need to allow analyst to flexibly compose solutions to a wide variety of questions involving multiple aggregation modes. For example, many architectural features (e.g., tasks, functional units) requires an analysis of a calling hierarchy. Given a set of procedures - perhaps a functionally cohesive unit - several aggregations are possible. We might be interested in identifying a set of common callers of these procedures, the entire calling hierarchy, a calling hierarchy that is mutually exclusive with some other set of procedures (i.e., a distinct functional unit), or a set of root nodes (i.e., candidates for task entry points). All of these are meaningful for identifying architectural components. Thus our library contains recognizers that return various aggregations within the calling hierarchy. The danger is that if we have too many different output forms, we will drastically limit our ability to compose recognition results.

Our solution deals with this problem in two ways. First, we output results in a manner that reduces the need for repeating computationally expensive analyses in subsequent rec-

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 2 7

ognizers of a cascaded chain. Second, we standardize output levels so that results can be compared and bundled together easily.

Avoiding Redundant Computations: One approach to recognition would be to assume that each recognizer always returns a single object and that adjoining architectural structures can be found piecemeal by following the AST (or using some of the analysis tools described above). We have found this approach to be unsatisfactory because many of the recognizers collect objects in the context of some useful larger structure. Rather, it is useful to return a structure (i.e. the ordered triples described above) that contains contextual information. For example, a slice evaluation coupled with the use of program patterns (e.g., the slice associated with the code in Figure 6) can be a relatively expensive computation. Once the recognizer completes this examination it caches the result as the third element of a triple (as in Table 1) to avoid re-computations. This format has enabled us to support extensive architecture recovery without excessive duplication of computations.

Standard Contexts: Each recognizer has only a local view; it cannot know how some other recognizer will use its results. The critical concern is to identify some standard contexts so that other parts of an analysis process can rely on a uniform type of response. If we do not have some standardization, the enclosing structure part of a recognition could be a procedure, a file, a directory, a task, or something else. This would require each recognizer to carry out a normalization step prior to using the results of another recognizer.

For the current framework, we selected the procedure level as a standard context. That is to say, unless there is reason to report some other structures, triples will be of the form < object, procedure, procedure >. Our justification for this is that, if necessary, courser grained structure (e.g., file, directory) can be easily re-derived from the AST, while procedures offer an architecture level result that embodies the results of expensive lower-level analyses such as slice evaluation.

5.3. Recognizer Representation

We can summarize the above issues by displaying the internal representation we use for each recognizer. In our implementation, each recognizer is an object with a set of attributes that the implementation uses for composition and retrieval. The attributes are as follows:

• Name: a unique identifier

• Description: a textual description of what the recognizer finds (used in indexing)

• Effects: effects indicate the types of source code fragments that are found (also used in indexing)

• Pre-condition - other recognizers that are run before this recognizer will run

• Environment: the set of parameters that analysts must set before invoking the recognizer

• Recognition method: the action part of the recognizer; written in RRL (as illustrated in Section 3.1 above)

128 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

In summary, recognizer authors build the RRL descriptions using the RRL language con­structs and special analysis functions. They set pre-conditions and environment attributes to link the recognizer into the library. At this time they may add the new recognizer's name to default recognizer lists for the style-level entities/relations.

Subsequently, during an investigation, an analyst retrieves the recognizer either by se­lecting an entity/relation with a default, by recognizer name, by indicating a text fragment of the description, or by indicating the effect desired. The implementation recursively runs recognizers in the pre-condition attribute, asks the analyst to set any of the required parameters, and interprets the RRL code in the recognizer's method.

If the analyst employed the recognizer in architecture recovery, the results are added to the as-built architecture with respect to some style. We provide additional support via specialization hierarchies among the architectural entities and relation. Upon finding that few examples of an architectural feature are recognized, the analyst has the option of expanding a search by following generalization and specialization links and searching for architecturally related information. This capability complements the recognizer indexing scheme based on code level relationships.

6. Experience

During the past year, we employed our architecture recovery tools on six moderate sized programs. Our most successful example was XSN, a MITRE-developed network-based program for Unix system management. The program contains approximately 30,000 source lines of code.

This program contains several common C/Unix building blocks and has the potential for matching aspects of multiple styles. It is built on top of the X window system and hence contains multiple invocations of the X application program interface. It consists of executable files for multiple tasks developed individually by different groups over time. These executables are linked in an executive task that uses operating system calls to spawn specific tasks in accordance with switches set when the user initiates the program. Each task is a test routine consisting of a stimulus, a listener, and analysis procedures. Calls using socket constructs provide communications between host platforms on the network to implement a client/server architecture.

Periodically, we were able to present our analysis to the original code developers and receive their feedback and suggestions on identifying additional architectural features in the code.

Our first recovery effort involved looking for the task spawning structure. XSN contains several tasks and specific operating system calls that are used to connect these modules. Figure 12 is a screen image of the graphical view of task spawning recovered from XSN. This view also contains elements of what we call the file-based repository style - the connections between the tasks and the data files that they access or modify. The rectangular boxes represent a static view of a task: the source code that may be run when entering that task. The diamonds represent data files. The data files' names (and indication of their existence) are recovered from the source code. The oval is an unknown module. The arrows indicate connections of either one task spawning another task or the data flow between data files and

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 129

tasks. Figure 12 is actually a view of a thinned-out XSN: several tasks and data files have been removed to reduce diagram clutter. In the view's legend, "query" is another term for "recognizer".

We next looked for layering structure. We attempted an approach that bundled up cycles within the procedure calling hierarchy but otherwise used the procedure calling hierarchy in its entirety. This approach lead to little reduction over the basic structure chart report. We felt that additional clustering was possible using either deeper dominance analysis or domain knowledge, but we did not pursue these approaches. We did build some preliminary capabilities based on advertised API's for commercial subsystems or program layers. These capabilities found portions of the code identified as users of some API. One predominate example, particularly informative for XSN, was the code that accesses the underlying X window system. We have not yet been able to implement a method that would combine such bottom-up recognition with more globally-based layering recovery methods.

XSN acts as a client (sometimes a server) in its interactions with network services such as sendmail or ftp. A service-invocation recognizer shown in Figure 13 recovered elements of this style successfully. Over time we made several enhancements to the recognizer to improve its explanation power. First, we refined its ability to identify the source of an interaction. The notion we settled on was to identify the procedures that set port numbers (i.e., indicate the service to be contacted) rather than the procedure containing the service invocation call. Second, we enhanced the recognizer so that it would recognize a certain pattern of complex, but stereotypical client/server interaction. In this pattern, we see the client setting up a second communication channel in which it now acts as the server. It was necessary to recognize this pattern in order to identify the correct external program associated with the second channel.

At this point, we inspected the code to see if there were any obvious gaps in system coverage by the as-built architecture we had found. We discovered that there were several large blocks of code that did not participate in any of the styles. By examining the code, it was clear that the developers had implemented several abstract data types - tables, lists. Thus, we set about building and applying OBAD to the XSN system. These table and list abstractions were recognized interactively by our OBAD sub-tool (see Section 4.3).

We developed over sixty recognizers for this analysis. Thirteen were used for client/server recovery, seven for task spawning, nine were used for some form of layering, four for repos­itory, seven for code level features, two for ADT recovery, and one for implicit invocation. Thirteen of the recognizers were utilities producing intermediate results that could be used for recovering features of multiple styles. The library also contains seven recognizers that make some simplifying assumptions in order to approximate the results of more computa­tional intensive recognizers. These recognizers proved to be particularly useful in situations where it was not possible to obtain a complete program slice (Section 4.1).

Since the above profile of recognizers is based on recognition adequacy with respect to only a few systems, the numbers should be taken in context. What is important is that they indicate the need for serious recognition library management of the form we have described in this paper.

We feel that we have gone a long way toward recognizing standard C/Unix idioms for encoding architectural features. We are still at a stage where each new system we analyze

130 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

UJ

Figure 12. Task spawning view of (a thinned-out version of) XSN

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 131

let (result = [] )

for-every call in invocations-of-type('service-invocations)

for-each port in where-ports-set(call)

let (target = service-at(second(port)))

let (proc = enclosing-procedure(first(port)))

(result <- prepend(result, [call, target, proc])),

result

Figure 13. A service recognizer uses the invocations-of-type construct.

requires some modifications to our implementation, but the number of required modifi­cations is decreasing. In one case, we encoded a new architecture style called "context" (showing the relationship between system processes and the connections to external files and devices) as a means to best describe a new system's software architecture. We were able to recognize all features of this style by just authoring one new recognizer and reusing several others. More frequently, we have found that the set of recognizers is adequate but we need to refine existing recognizers to account for subtleties that we had not seen before.

Table 2 summarizes the amount of code in XSN covered when viewed with respect to the various styles. The first row gives the percentage of the lines of code used in the connectors for that style. The second row gives the percentage of the procedures covered by that style. A procedure is covered if it is included in some component in that style.

Table 2. Code coverage measures for XSN

Style:

% connector LOC: % of procedures:

ADT

0 39.3

API

0 13.9

c/s 0.3 3.3

Repository

2.2 13.1

Task Spawning

0.7 2.5

Combining all the styles whose statistics are given results in a total connector coverage of about 3% of the lines of code and over 47% of the procedures. Procedure coverage total is less than the sum of its constituents in the above table because the same procedure may be covered by multiple styles.

We offer these statistics as elementary examples of architectural recovery metrics. This endeavor is important both to determine the effectiveness of the style representations (e.g., what is the value-added of authoring a new style) and to provide an indicator for analysts of how well they have done in understanding the system under analysis.

The measures we provide are potentially subject to some misinterpretation. It is difficult to determine how strongly a system exhibits a style and how predictive that style is of the entire body of code. As an extreme example, one could fit an entire system into one layer. This style mapping is perfectly legal and covers the whole system, but provides no abstraction of the system and no detailed explanation of the components.

132 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

In spite of these limits, there are experimental and programmatic advantages for defin­ing code coverage metrics. The maintenance conmiunity can benefit from discussion on establishing reasonable measures of progress toward understanding large systems.

7. Related Work

We can contrast our work with related work in recovery of high-level software design, top-down and bottom-up approaches to software understanding, and interactive reverse engineering.

7. /. Recovery of high-level design

Program structure has been analyzed independently of any pre-conceived architectural styles to reveal program organization as discussed in (Biggerstaff, 1989, Biggerstaff, Mit-bander, Webster, 1994,),(Schwanke, 1991, (Richardson and Wilde, 1993). General in­quiry into the structure of software can be supported by software information systems such as LaSSIE (Devanbu,Ballard, Brachman, Selfridge, 1991). LaSSIE represents pro­grams from a relational view that misses some of the architectural grist we find deeply embedded in abstract syntax trees. However, their automatic classification capabilities are more powerful than our inferencing capabilities. In contrast to our work, DESIRE (Biggerstaff, Mitbander, Webster, 1994) relies on externally supplied cues regarding pro­gram structure, modularization heuristics, manual assistance, and informal information. Informal information and heuristics can also be used to reorganize and automatically re­fine recovered software designs/modularizations. Schwanke (Schwanke, 1991) describes a clustering approach based on similarity measurements. This notion matches well to some of the informal clustering that we are doing although their work is not used to find components of any particular architectural style.

Canfora et al (Canfora, De Lucia, DiLucca, Fasolino, 1994) recovers architectural mod­ules by aggregating program units into modules via a concept of node dominance on a directed graph. This work addresses a functional architectural style that we have not con­sidered, but again there are similarities to the clustering we perform within our OBAD subsystem.

7.2, Top-down approaches

Our recognizers are intended for use in explorations of architectural hypotheses - a form of top-down hypothesis-driven recognition coupled with bottom-up recognition rules. Quilici (Quilici, 1993) also explores a mixed top-down, bottom-up recognition approach using traditional plan definitions along with specialization links and plan entailments.

It is useful to compare our work to activities in the tutoring/bug detection community. Our context-independent approach is similar to MENO (Soloway, 1983). In the tutoring domain, context-independent approaches suffer because they cannot deal with the higher

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 3 3

level plans in the program. PROUST (Johnson and Soloway, 1985) remedies some of this via a combination of bottom-up recognition and top-down analysis - i.e., looking up typical patterns which implement a programmer's intentions. In contrast, architectural commit­ments to use a particular architectural style are made at the top level, thus the mapping between intentions and code is more direct.

73, Bottom-up approaches

The reverse engineering and program understanding conmiunity has approached software understanding problems generally with a bottom-up approach where a program is matched to a set of pre-defined plans/cliches from a library. This work is not motivated by archi­tecture organizational principles essential for the construction of large programs. Current work on program concept recognition is exemplified by (Kozaczynski, Ning, Sarver, 1992), (Engberts, Kozaczynski, Ning, 1991), (Dekker and Ververs, 1994) which continues the cliche-based tradition of (Rich and Wills, 1990). This work is based on a precise data and con­trol flow match which indicates that the recognized source component is precisely the same as the library template. Our partial recognition approach does not require algo­rithmic equivalence between a plan and the source being matched, rather they are based on events (Harandi and Ning, 1990) in the source code. That is to say, the existence of patterns of these events is sufficient to establish a match. Our style of source code event-based recognition rules is also exemplified in (Kozaczynski, Ning, Sarver, 1992), (Engberts, Kozaczynski, Ning, 1991) which demonstrates a combination of precise control and data flow relation recognition and more abstract code event recognition.

7A. Interactive Reverse Engineering

Wills (Wills, 1993) points out the need for flexible, adaptable control structures in reverse engineering. Her work attacks the important problem of building interactive support that cuts across multiple types of software analysis. In contrast, our work emphasizes authoring and application of multiple analysis approaches applicable for uncovering architectural features in the face of specific source code nuances and configurations.

Paul and Prakash (Paul and Prakash: patterns, 1994) (Paul and Prakash: queries, 1994) investigate source code search using program patterns. This work uses a query language for specifying high level patterns on the source code. Some of these patterns correspond to specific recognition rules in our approach. Our approach focuses more on analyst use of a pre-defined set of parameterizable recognizers each written in a procedural language. That is, we restrict analyst access to a set of predefined recognizers, but allow recognizer authors the greater flexibility of a procedural language.

134 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

8. Evaluation and Conclusions

We have implemented an architecture recovery framework that merges reverse engineering and architectural style representation. This is an important first step toward long range goals of providing custom, dynamic documentation to over a variety of software analysis tasks. The framework provides for analyst control over parameterization and retrieval of recognition library elements. We have described methods for recognizer execution and offered some recognizer authoring guidance for identifying recognizers that will interact well with other recognizers in the library.

The recognizers make use of commercially available reverse engineering technology, but there are several important analysis capabilities that we have added. In addition, one of our major contributions has been to determine the architectural patterns to recognize and to express these patterns with respect to the underlying analysis capabilities.

Our current recognition capabilities have been motivated by thinking about a CAJnix environment which does have its unique progranmiing idioms. While we phrase our rec­ognizers at a general language/operating system independent level (e.g., task spawning or service invocation), there are some biases within the recognition library itself and we would like to extend our approaches to cover idioms of other high level languages and operating systems. Primarily, there is a dependence of a set of functions on specifics of the legacy language or operating system. In addition, many of the features that are recognized through low level patterns in C/Unix implementations (e.g., a call to the system function spawns a task, a struct-type) will appear explicitly in other languages/operating systems as special constructs (e.g., tasks, class definitions).

There are four broad areas in which we intend to extend our work:

• Additional automation

We would like to expand our ability to index into the growing library of recognizers and would like to develop additional capabilities for bridging the gap from source code to style descriptions. The ultimate job of recognizers is to map the entities/relations (i.e., objects in the domain of system design such as pipes or layers) to recognizable syntactic features of programs (i.e., objects in the implementation domain). Clearly, we are working with a moving target. New programming languages, COTS products, and standard patterns present the reverse engineering community with the challenge of recovering the abstractions from source code. We are hopeful that many of the mechanisms we have put in place will enable us to rapidly turn out new recognizers that can deal with new abstractions.

An enhancement that we intend to consider is the automatic generation of effect descrip­tions from information encoded in explicit output lists of recognizers. This scheme is similar to the transformation indexing scheme of Aries (Johnson, Feather, Harris, 1992).

• Combining Styles

We intend to investigate combining architectural styles. The as-built architectural views each provides only a partial view unto the structure of a program and such partial views can overlap in fundamental ways (e.g., a repository view emphasizing data elements

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 3 5

contains much in common with an interprocess-communication view emphasizing data transmissions through shared memory or data files on disk) In addition, style combi­nations can be used to uncover hybrid implementations where individual components with respect to one style are implemented in terms of a second style.

• COTS modeling

Systems that we wish to analyze do not always come with the entire body of source code, e.g., they may make use of COTS (commercial off-the-shelf) packages that are simply accessed through an API. For example, from the analysis point of view, the Unix operating system is a COTS package. We have developed representations for COTS components that allow us to capture the interface and basic control and dataflow dependencies of the components. This modeling needs to be extended to represent architectural invariants required by the package.

• Requirements modeling

The distinction between functional and non-functional requirements suggests two broad thrusts for reverse engineering to the requirements level. For functional requirements we want to answer the important software maintenance question: "Where is X imple­mented?". For example, a user may want to ask where message decoding is imple­mented. Message and decoding are concepts at the user requirements level. Answering such questions will require building functional models of systems. These models will contain parts and constraints that we can use to map function to structure. For non­functional requirements, we need to first recognize structural components that imple­ment the non-functional requirements. For example, fault tolerant requirements will to some degree be observable as exception handling in the code. We believe our frame­work is well suited for extensions in this direction. As a second step, we need to identify measures of compliance (e.g., high "coverage" by abstract data types means high data modifiability). Preliminary work in this area appears in (Chung, Nixon, Yu, 1995) and (Kazman, Bass, Abowd, Clements, 1995).

While we are continuing to refine our representations to provide more automated assis­tance both for recognizer authors and for analysts such as software maintainers, the current implementation is in usable form and provides many insights for long range development of architectural recognition libraries.

Acknowledgments

We would like to thank The MITRE Corporation for sponsoring this work under its internal research program. We also thank MITRE colleagues Melissa Chase, Susan Roberts, and Richard Piazza. Their work on related MITRE efforts and enabling technology has been of great benefit for the research we have reported on above. Finally, we acknowledge the many insightful suggestions of the anonymous reviewers.

136 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

Appendix

The Recognizer Library

Our recognition library contains approximately sixty recognizers directed toward discovery of the architectural components and relations of nine styles.

The following partial list of recognizers shows the variety of the elements of our recogni­tion library and is organized by analysis method. Italicized words in the descriptions (e.g., focus, reference) highlight a parameter that analysts must set before running the recognizer.

1. Program structure - found directly on abstract syntax trees (ASTs)

• Find-Structu re-With-Attribute: structures that have reference as an attribute name

• Find-Loops: find all loops

• Hill-Climbing: instances of hill-climbing algorithms

2. Typed function calls - use special call specifications

• Find-Interesting-Invocations: invocations oifunctions-of-interest

• Find-lnvocations-Of-Executables: invocations that activate other executables

• Find-UI-Demons: registrations of user-interface demons

3. Forward references - procedures that use a variable set by a special call

• Envelope-Of-A-Conduit: procedures that use the communication endpoint cre­ated hy focus

4. Clusters of objects

• Decomposables: decomposable objects of an architecture

• Top-Clusters: top clusters of the current architecture

5. Clusters derived from dependency analysis

• Find-Upper-Layers: clusters that are layered above iht focus cluster

• Find-Global-Var-Based-Clusters: find clusters based on common global vari­able reference

6. Structures referenced in special invocations

Find-Executable-Links: links between spawned tasks and the tasks that spawned them

Task-invocation: task invoked (spawned) by a special function call

File-Access-Links: links between procedures and the files that they access

File-IPC: files touched by more than one process

Service-Thru-Port: all relations to a reference port

EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 3 7

• Find-Port-Connections: links (relations) between program layers and local or network services

7. Clusters derived from calling hierarchy

• Find-Upper-Functional-Entry-Points: high level functional entry points

• Find-Mid-Level-Functional-Entry-Points: mid-level functional entry points

• Find-Common-Callers: common callers of a set of functions

• Who-Calls-lt: procedures that call focus

8. Procedures within some context - using containment within clusters

• Find-Functions-Of-Cluster: Functions of a cluster

• Find-Exported-Functions: Exported functions of focus cluster

• Find-Localized-Function-Calls: procedure invocations within the focus proce­dure

• Has-Non-Local-Referents: non-local procedures that call definitions located in focus

References

H. Abelson and G. Sussman. Structure and Interpretation of Computer Programs. The MIT Press, 1984. G. Abowd, R. Allen, and D. Garlan. Using style to understand descriptions of software architecture. ACM Software

Engineering Notes, 18(5), 1993. Also in Proc. of the 1st ACM SIGSOFT Symposium on the Foundations of Softwa re Engineering, 1993.

T. Biggerstaff. Design recovery for maintenance and reuse. IEEE Computer, July 1989. T. Biggerstaff, B. Mitbander, and D. Webster. Program understanding and the concept assignment problem.

Communications of the ACM, 37(5), May 1994. G. Canfora, A. De Lucia, G. DiLucca, and A. Fasolino. Recovering the architectural design for software

comprehension. In IEEE 3rd Workshop on Program Comprehension, pages 30-38. IEEE Computer Society Press, November 1994.

L. Chung, B. Nixon, and E. Yu. Using non-functional requirements to systematically select among alternatives in architectural design. In First International Workshop on Architectures for Software Systems, April 1995.

R. Dekker and F. Ververs. Abstract data structure recognition. The Ninth Knowledge-Based Software Engineering Conference, 1994.

P. Devanbu, B. Ballard, R. Brachman, and P. Selfridge. Automating Software Design, chapter LaSSIE: A Knowledge-Based Software Information System. AAAI/MIT Press, 1991.

A. Engberts, W. Kozaczynski, and J. Ning. Concept recognition-based program transformation. In 1991 IEEE Conference on Software Maintenance, 1991.

K. Gallagher and J. Lyle. Using program slicing in software maintenance. IEEE Transactions on Software Engineering, 17(8), 1991.

D. Garlan and M. Shaw. An introduction to software architecture. Tutorial at 15th International Conference on Software Engineering, 1993.

M. Harandi and J. Ning. Knowledge-based program analysis. IEEE Software, 7(1), 1990. D. Harris, H. Reubenstein, and A. Yeh. Recognizers for extracting architectural features from source code. In

Second Working Conference on Reverse Engineering, July 1995. D. Harris, H. Reubenstein, and A. Yeh. Recoverying abstract data types and object instances from a conventional

procedure language. In Second Working Conference on Reverse Engineering, July 1995.

138 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH

D. Harris, H. Reubenstein, and A. Yeh. Reverse engineering to the architectural level. In ICSE-I7 Proceedings, April 1995.

C. Hofmeister, R. Nord, and D. Soni. Architectural descriptions of software systems. In First International Workshop on Architectures for Software Systems, April 1995.

L. Holtzblatt, R. Piazza, H. Reubenstein, and S. Roberts. Using design knowledge to extract real-time task models. In Proceedings of the 4th Systems Reengineering Technology Workshop, 1994.

W. L. Johnson and E. Soloway. Proust: Knowledge-based program understanding. IEEE Transactions on Software Engineering, 11(3), March 1985.

W.L. Johnson, M. Feather, and D. Harris. Representation and presentation of requirements knowledge. IEEE Transactions on Software Engineering, 18(10), October 1992.

R. Kazman, L. Bass, G. Abowd, and R Clements. An architectural analysis case study: Internet information systems. In First International Workshop on Architectures for Software Systems, April 1995.

W. Kozaczynski, J. Ning, and T. Sarver. Program concept recognition. In 7th Annual Knowledge-Based Software Engineering Conference, 1992.

E. Mettala and M. Graham. The domain specific software architecture program. Technical Report CMU/SEI-92-SR-9, SEI, 1992.

M. Olsem and C. Sittenauer. Reengineering technology report. Technical report. Software Technology Support Center, 1993.

S. Paul and A. Prakash. A framework for source code search using program patterns. IEEE Transactions on Software Engineering, 20(6), June 1994.

S. Paul and A. Prakash. Supporting queries on source code: A formal framework. International Journal of Software Engineering and Knowledge Engineering, September 1994.

D. Perry and A. Wolf. Foundations for the study of software architecture. ACM Software Engineering Notes, 17(4), 1992.

A. Quilici. A hybrid approach to recognizing program plans. In Proceedings of the Working Conference on Reverse Engineering, 1993.

Reasoning Systems, Inc., Palo Alto, CA. REFINE User's Guide, 1990. For R E F I N E ^ ^ Version 3.0. Reasoning Systems. Refine/C User's Guide, March 1992. C. Rich and L. Wills. Recognizing a program's design: A graph parsing approach. IEEE Software, 7(1), 1990. R. Richardson and N. Wilde. Applying extensible dependency analysis: A case study of a heterogeneous system.

Technical Report SERC-TR-62-F, SERC, 1993. R. Schwanke. An intelligent tool for re-engineering software modularity. In 13th International Conference on

Software Engineering, 1991. M. Shaw. Larger scale systems require higher-level abstractions. In Proceedings of the 5th Intematioruzl Workshop

on Software Specification and Design, 1989. M. Shaw. Heterogeneous design idioms for software architecture. In Proceedings of the 6th International

Workshop on Software Specification and Design, 1991. E. Soloway. Meno-ii: An intelligent program tutor. Computer-based Instruction, 10,1983. W. Tracz. Domain-specific software architecture (DSSA) frequently asked questions (FAQ). ACM Software

Engineering Notes, 19(2), 1994. M. Weiser. Program slicing. IEEE Transactions on Software Engineering, 10(4), July 1984. L. Wills. Flexible control for program recognition. In Working Conference on Reverse Engineering, May 1993.

Automated Software Engineering, 3,139-164 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Strongest Postcondition Semantics as the Formal Basis for Reverse Engineering* GERALD C. GANNOD** AND BETTY H.C. CHENGt {gannod,chengb}@cps.msu.edu

Department of Computer Science Michigan State University East Lansing, Michigan 48824-1027

Abstract. Reverse engineering of program code is the process of constructing a higher level abstraction of an implementation in order to facihtate the understanding of a system that may be in a "legacy" or "geriatric" state. Changing architectures and improvements in programming methods, including formal methods in software development and object-oriented programming, have prompted a need to reverse engineer and re-engineer program code. This paper describes the application of the strongest postcondition predicate transformer (sp) as the formal basis for the reverse engineering of imperative program code.

Keywords: formal methods, formal specification, reverse engineering, software maintenance

1. Introduction

The demand for software correctness becomes more evident when accidents, sometimes fatal, are due to software errors. For example, recently it was reported that the software of a medical diagnostic system was the major source of a number of potentially fatal doses of radiation (Leveson and Turner, 1993). Other problems caused by or due to software failure have been well documented and with the change in laws concerning liability (Flor, 1991), the need to reduce the number of problems due to software increases.

Software maintenance has long been a problem faced by software professionals, where the average age of software is between 10 to 15 years old (Osborne and Chikofsky, 1990). With the development of new architectures and improvements in programming methods and languages, including formal methods in software development and object-oriented programming, there is a strong motivation to reverse engineer and re-engineer existing program code in order to preserve functionality, while exploiting the latest technology.

Formal methods in software development provide many benefits in the forward engineer­ing aspect of software development (Wing, 1990). One of the advantages of using formal methods in software development is that the formal notations are precise, verifiable, and facilitate automated processing (Cheng, 1994). Reverse Engineering is the process of con­structing high level representations from lower level instantiations of an existing system. One method for introducing formal methods, and therefore taking advantage of the benefits

This work is supported in part by the National Science Foundation grants CCR-9407318, CCR-9209873, and CDA-9312389. ** This author is supported in part by a NASA Graduate Student Researchers Program Fellowship. t Please address all correspondences to this author.

140 GANNOD AND CHENG

of formal methods, is through the reverse engineering of existing program code into formal specifications (Gannod and Cheng, 1994, Lano and Breuer, 1989, Ward et al., 1989).

This paper describes an approach to reverse engineering based on the formal semantics of the strongest postcondition predicate transformer sp (Dijkstra and Scholten, 1990), and the partial correctness model of program semantics introduced by Hoare (Hoare, 1969). Previously, we investigated the use of the weakest precondition predicate transformer wp as the underlying formal model for constructing formal specifications from program code (Cheng and Gannod, 1991, Gannod and Cheng, 1994). The difference between the two approaches is in the ability to directly apply a predicate transformer to a program (i.e., sp) versus using a predicate transformer as a guideline for constructing formal specifications (i.e., wp).

The remainder of this paper is organized as follows. Section 2 provides background material for software maintenance and formal methods. The formal approach to reverse engineering based on sp is described in Sections 3 and 4, where Section 3 discusses the sp semantics for assignment, alternation, and sequence, and Section 4 gives the sp semantics for iterative and procedural constructs. An example applying the reverse engineering technique is given in Section 5. Related work is discussed in Section 6. Finally, Section 7 draws conclusions and suggest future investigations.

2. Background

This section provides background information for software maintenance and formal meth­ods for software development. Included in this discussion is the formal model of program semantics used throughout the paper.

2,1, Software Maintenance

One of the most difficult aspects of re-engineering is the recognition of the functionality of existing programs. This step in re-engineering is known as reverse engineering. Identifying design decisions, intended use, and domain specific details are often significant obstacles to successfully re-engineering a system.

Several terms are frequently used in the discussion of re-engineering (Chikofsky and Cross, 1990). Forward Engineering is the process of developing a system by mov­ing from high level abstract specifications to detailed, implementation-specific manifes­tations (Chikofsky and Cross, 1990). The explicit use of the word "forward" is used to contrast the process with Reverse Engineering, the process of analyzing a system in order to identify system components, component relationships, and intended behav­ior (Chikofsky and Cross, 1990). Restructuring is the process of creating a logically equiv­alent system at the same level of abstraction (Chikofsky and Cross, 1990). This process does not require semantic understanding of the system and is best characterized by the task of transforming unstructured code into structured code. Re-Engineering is the examination and alteration of a system to reconstitute it in a new form, which potentially involves changes at the requirements, design, and implementation levels (Chikofsky and Cross, 1990).

STRONGEST POSTCONDITION SEMANTICS 141

Byrne described the re-engineering process using a graphical model similar to the one shown in Figure 1 (Byrne, 1992, Byrne and Gustafson, 1992). The process model appears in the form of two sectioned triangles, where each section in the triangles represents a different level of abstraction. The higher levels in the model are concepts and requirements. The lower levels include designs and implementations. The relative size of each of the sections is intended to represent the amount of information known about a system at a given level of abstraction. Entry into this re-engineering process model begins with system A, where Abstraction (or reverse engineering) is performed to an appropriate level of detail. The next step is Alteration, where the system is constituted into a new form at a different level of abstraction. Finally, Refinement of the new form into an implementation can be performed to create system B,

Alteration

"Reverse Engineering

Abstraction

"Forward Engineering"

Refinement

System A System B

Figure 1. Reverse Engineering Process Model

This paper describes an approach to reverse engineering that is applicable to the imple­mentation and design levels. In Figure 1, the context for this paper is represented by the dashed arrow. That is, we address the construction of formal low-level or ''as-builf de­sign specifications. The motivation for operating in such an implementation-bound level of abstraction is that it provides a means of traceability between the program source code and the formal specifications constructed using the techniques described in this paper. This traceability is necessary in order to facilitate technology transfer of formal methods. That is, currently existing development teams must be able to understand the relationship between the source code and the specifications.

2,2. Formal Methods

Although the waterfall development life-cycle provides a structured process for developing software, the design methodologies that support the life-cycle (i.e., Structured Analysis and

1 4 2 GANNOD AND CHENG

Design (Yourdon and Constantine, 1978)) make use of informal techniques, thus increasing the potential for introducing ambiguity, inconsistency, and incompleteness in designs and implementations. In contrast, formal methods used in software development are rigorous techniques for specifying, developing, and verifying computer software (Wing, 1990). A formal method consists of a well-defined specification language with a set of well-defined inference rules that can be used to reason about a specification (Wing, 1990). A benefit of formal methods is that their notations are well-defined and thus, are amenable to automated processing (Cheng, 1994).

2.2.7. Program Semantics

The notation Q {S} R (Hoare, 1969) is used to represent a partial correctness model of execution, where, given that a logical condition Q holds, if the execution of program S terminates, then logical condition R will hold. A rearrangement of the braces to produce {Q} S {R}/m contrast, represents a total correctness model of execution. That is, if condition Q holds, then S is guaranteed to terminate with condition R true.

A precondition describes the initial state of a program, and di postcondition describes the final state. Given a statement S and a postcondition R, the weakest precondition wp{S, R) describes the set of all states in which the statement S can begin execution and terminate with postcondition R true, and the weakest liberal precondition wlp{S^ R) is the set of all states in which the statement S can begin execution and establish R as true if S terminates. In this respect, wp{S^ R) establishes the total correctness of S, and wlp{S^ R) establishes the partial correctness of 5. The wp and wlp are called predicate transformers because they take predicate R and, using the properties listed in Table 1, produce a new predicate.

Table 1. Properties of the wp and wlp predicate trans­formers

wp{S, A) = wp{S, true) A wlp{S, A) wp{S, A) => -^wlp{S, ^A) wp(Syfalse) = false wp{S, AAB) = wp(S, A) A wp{S, B) wp{S, AV B) =^ wp{S, A) V wp{S, B) wp(S, A-^ B) => wp{S, A) -)• wp{S, B)

The context for our investigations is that we are reverse engineering systems that have desirable properties or functionality that should be preserved or extended. Therefore, the partial correctness model is sufficient for these purposes.

2.2.2. Strongest Postcondition

Consider the predicate -twlp{S,-^R), which is the set of all states in which there exists an execution of S that terminates with R true. That is, we wish to describe the set of states in which satisfaction of R is possible (Dijkstra and Scholten, 1990). The predicate

STRONGEST POSTCONDITION SEMANTICS 143

^wlp{S, ->R) is contrasted to wlp{S^ R) which is the set of states in which the computation of S either fails to terminate or terminates with R true.

An analogous characterization can be made in terms of the computation state space that describes initial conditions using the strongest postcondition sp{S, Q) predicate trans­former (Dijkstra and Scholten, 1990), which is the set of all states in which there exists a computation of 5 that begins with Q true. That is, given that Q holds, execution of S results in sp(5, Q) true, if 5 terminates. As such, sp{S^ Q) assumes partial correctness. Finally, we make the following observation about sp{S^ Q) and wlp{S, R) and the relationship between the two predicate transformers, given the Hoare triple Q{5}/?(Dijkstra and Scholten, 1990):

Q ^ wlp{S,R)

sp{S,Q) => R

The importance of this relationship is two-fold. First, it provides a formal basis for trans­lating programming statements into formal specifications. Second, the symmetry of sp and wlp provides a method for verifying the correctness of a reverse engineering process that utilizes the properties of wlp and sp in tandem.

2.2,3. spvs. wp

Given a Hoare triple Q{S} R,wc note that wp is a backward rule, in that a derivation of a specification begins with R, and produces a predicate wp{S^ R). The predicate transformer wp assumes a total correctness model of computation, meaning that given S and R, if the computation of S begins in state wp{S, R), the program S will halt with condition R true.

We contrast this model with the sp model, a forward derivation rule. That is, given a precondition Q and a program 5, sp derives a predicate sp{Sy Q). The predicate transformer sp assumes a partial correctness model of computation meaning that if a program starts in state Q, then the execution of S will place the program in state sp{S, Q) if S terminates. Figure 2 gives a pictorial depiction of the differences between sp and wp, where the input to the predicate transformer produces the corresponding predicate. Figure 2(a) gives the case where the input to the predicate transformer is "S" and "R", and the output to the predicate transformer (given by the box and appropriately named "wp") is "wp(S,R)". The sp case (Figure 2(b)) is similar, where the input to the predicate transformer is "S" and "Q", and the output to the transformer is "sp(S,Q)".

The use of these predicate transformers for reverse engineering have different implica­tions. Using wp implies that a postcondition R is known. However, with respect to reverse engineering, determining R is the objective, therefore wp can only be used as a guideline for performing reverse engineering. The use of sp assumes that a precondition Q is known and that a postcondition will be derived through the direct application of sp. As such, sp is more applicable to reverse engineering.

144 GANNOD AND CHENG

{ Q

sp

(a)

Q)

- ^ sp(S,Q) wp(S,R) - ^ wp

(b)

Figure 2. Black box representation and differences between wp and sp\ (a) wp (b) sp

3. Primitive Constructs

This section describes the derivation of formal specifications from the primitive program­ming constructs of assignment, ahernation, and sequences. The Dijkstra guarded command language (Dijkstra, 1976) is used to represent each primitive construct but the techniques are applicable to the general class of imperative languages. For each primitive, we first describe the semantics of the predicate transformers wlp and sp as they apply to each primitive and then, for reverse engineering purposes, describe specification derivation in terms of Hoare triples. Notationally, throughout the remainder of this paper, the notation {Q} S {R} will be used to indicate a partial correctness interpretation.

3 J, Assignment

An assignment statement has the form x: = e; where x is a variable, and e is an expression. The wlp of an assignment statement is expressed as wlp{yi: =e, R) = R^, which represents the postcondition R with every free occurrence of x replaced by the expression e. This type of replacement is termed a textual substitution of x by e in expression R. If x corresponds to a vector y of variables and e represents a vector E of expressions, then the wlp of the assignment is of the form R^, where each yi is replaced by Ei, respectively, in expression R. The sp of an assignment statement is expressed as follows (Dijkstra and Scholten, 1990)

5p(x:=e,Q) = {3v :: Q^Ax = e^), (1)

where Q is the precondition, v is the quantified variable, and *::' indicates that the range of the quantified variable v is not relevant in the current context.

We conjecture that the removal of the quantification for the initial values of a variable is valid if the precondition Q has a conjunct that specifies the textual substitution. That is, performing the textual substitution Q^ in Expression (1) is a redundant operation if, initially, Q has a conjunct of the form x = v. Refer to Appendix A where this case is described in more depth. Given the imposition of initial (or previous) values on variables, the Hoare triple formulation for assignment statements is as follows:

STRONGEST POSTCONDITION SEMANTICS 145

{Q} /* p r e c o n d i t i o n */ X : = e ;

{(xj+i = e^.) A Q } /* pos tcond i t ion */

where Xj represents the initial value of the variable x, Xj^i is the subsequent value of x, Q is the precondition. Subscripts are added to variables to convey historical information for a given variable.

Consider a program that consists of a series of assignments to a variable x, "x : = a; x:= b; x:= c; x:= d; x:= e; x:= f; x:= g; x:= h;" Despite its simplicity, the

{x = X} {xo = X} {xo = X} X := a; X := a; x := a; {x = aAX = X} {xi = a} {xi=aAxo=X} X := b; x := b; x := b; [x = bAa = a} {x2=b} {x2 = bAxi = aA...} X : = C; X : = c ; x : = c ;

{x = cAb = b} {x3=c} {x3=cAx2=bA...} X := d; X := d; x := d; {x = dAc= c} {x4 = d} {x4 = dAx3 = cA...} X : = e; x : = e; x : = e ; {x = e Ad = d} { x 5 = e } {x5=eAx4=dA...} x : = f ; x : = f ; x : = f ; {x = f Ae = e} {x6 = f} {xe = fAx5=eA...} X := g; x := g; x := g;

{x = gAf = f} {x7 = g} {x7 = 9Ax6 = fA...} x := h; X := h; x := h;

{x = h Ag = 9} {xs = h} {xs = hAx7=gA...}

(a) Code with strict sp (b) Code (c) Code with historical application with subscripts and propagation

historical subscripts

Figure 3. Different approaches to specifying the history of a variable

example is useful in illustrating the different ways that the effects of an assignment statement on a variable can be specified. For instance, Figure 3(a) depicts the specification of the program by strict application of the strongest postcondition.

Another possible way to specify the program is through the use of historical subscripts for a variable. A historical subscript is an integer number used to denote the i^"' textual assignment to a variable, where a textual assignment is an occurrence of an assignment statement in the program source (versus the number of times the statement is executed). An example of the use of historical subscripts is given in Figure 3(b). However, when using historical subscripts, special care must be taken to maintain the consistency of the specification with respect to the semantics of other programming constructs. That is, using

146 GANNOD AND CHENG

the technique shown in Figure 3(b) is not sufficient. The precondition of a given statement must be propagated to the postcondition, as shown in Figure 3(c). The main motivation for using histories is to remove the need to apply textual substitution to a complex precondition and to provide historical context to complex disjunctive and conjunctive expressions. The disadvantage to using such a technique is that the propagation of the precondition can potentially be complex visually. Note that we have not changed the semantics of the strongest postcondition, but, rather, in the application of strongest postcondition, extra information is appended that provides a historical context to all variables of a program during some "snapshot" or state of a program.

3,2. Alternation

An alternation statement using the Dijkstra guarded command language (Dijkstra, 1976) is expressed as

i f

f i ;

Bi ^ Si;

\\ ^n ^ ^Uf

where B ^^ s^ is a guarded command such that Si is only executed if logical expression (guard) Bi is true. The wlp for alternation statements is given by (Dijkstra and Scholten, 1990):

wlp{iF, R) = (Vi : Bi : wlp{Si,R)),

where IF represents the alternation statement. The equation states that the necessary condition to satisfy R, if the alternation statement terminates, is that given B^ is true, the wlp for each guarded statement Si with respect to R holds. The sp for alternation has the form (Dijkstra and Scholten, 1990)

5p(lF, Q) = {Bi :: sp{Si, Bi A Q)). (2)

The existential expression can be expanded into the following form

Sp{lF, Q) = (5p(Si, Bi A (5) V . . . V Sp{Sn, Bn A Q)). (3)

Expression (3) illustrates the disjunctive nature of alternation statements where each disjunct describes the postcondition in terms of both the precondition Q and the guard and guarded command pairs, given by B and s^, respectively. This characterization follows the intuition that a statement Si is only executed if B is true. The translation of alternation statements to specifications is based on the similarity of the semantics of Expression (3) and the execution behaviour for alternation statements. Using the Hoare triple notation, a specification is constructed as follows

STRONGEST POSTCONDITION SEMANTICS 147

{Q} if

B i - > S i ;

II ^n ^ ^n> f i ; { Sp(Si, 5 i A Q) V . . . V Sp{Sn, BnAQ)}

3.3. Sequence

For a given sequence of statements S i ; . . . ; Sn, it follows that the postcondition for some statement Si is the precondition for some subsequent statement S^+i. The wlp and sp for sequences follow accordingly. The wlp for sequences is defined as follows (Dijkstra and Scholten, 1990):

wlp{Si; S2, R) = wlp{Si,wlp{S2, R)).

Likewise, the sp (Dijkstra and Scholten, 1990) is

Sp{Si;S2,Q) = Sp{S2,Sp{Si,Q)). (4)

In the case of wlp, the set of states for which the sequence Si; S2 can execute with R true (if the sequence terminates) is equivalent to the wlp of Si with respect to the set of states defined by wlp{S2,R)' For 577, the derived postcondition for the sequence Si;S2 with respect to the precondition Q is equivalent to the derived postcondition for S2 with respect to a precondition given by sp{si, Q). The Hoare triple formulation and construction process is as follows:

{Q} S i ; {sp(Si,Q)} S2; {sp{s2,sp{Si,Q)) }.

4. Iterative and Procedural Constructs

The programming constructs of assignment, alternation, and sequence can be combined to produce straight-line programs (programs without iteration or recursion). The introduction of iteration and recursion into programs enables more compactness and abstraction in pro­gram development. However, constructing formal specifications of iterative and recursive programs can be problematic, even for the human specifier. This section discusses the formal specification of iteration and procedural abstractions without recursion. We deviate from our previous convention of providing the formalisms for wlp and sp for each construct and use an operational definition of how specifications are constructed. This approach is

148 GANNOD AND CHENG

necessary because the formalisms for the wlp and sp for iteration are defined in terms of recursive functions (Dijkstra and Scholten, 1990, Gries, 1981) that are, in general, difficult to practically apply.

4,1. Iteration

Iteration allows for the repetitive application of a statement. Iteration, using the Dijkstra language, has the form

do Bi -^ Si ;

od;

In more general terms, the iteration statement may contain any number of guarded com­mands of the form B —> s^, such that the loop is executed as long as any guard B^ is true. A simplified form of repetition is given by "do B —> s od ".

In the context of iteration, a bound function determines the upper bound on the number of iterations still to be performed on the loop. An invariant is a predicate that is true before and after each iteration of a loop. The problem of constructing formal specifications of iteration statements is difficult because the bound functions and the invariants must be determined. However, for a partial correctness model of execution, concerns of boundedness and termination fall outside of the interpretation, and thus can be relaxed.

Using the abbreviated form of repetition "do B — s od", the semantics for iteration in terms of the weakest liberal precondition predicate transformer wlp is given by the following (Dijkstra and Scholten, 1990):

wlp{BO, R) = {Wi:0<i: wlp{iF\ B V R)), (5)

where the notation " I F * " is used to indicate the execution of " i f B ^^ s f i " i times. Operationally, Expression (5) states that the weakest condition that must hold in order for the execution of an iteration statement to result with R true, provided that the iteration statement terminates, is equivalent to a conjunctive expression where each conjunct is an expression describing the semantics of executing the loop i times, where i > 0.

The strongest postcondition semantics for repetition has a similar but notably distinct formulation (Dijkstra and Scholten, 1990):

5P(DO, Q) = -^^ A (3i : 0 < ^ : sp{iF\ Q)). (6)

Expression (6) states that the strongest condition that holds after executing an iterative statement, given that condition Q holds, is equivalent to the condition where the loop guard is false {-^B), and a disjunctive expression describing the effects of iterating the loop i times, where i > 0.

Although the semantics for repetition in terms of strongest postcondition and weakest liberal precondition are less complex than that of the weakest precondition (Dijkstra and

STRONGEST POSTCONDITION SEMANTICS 149

Scholten, 1990), the recurrent nature of the closed forms make the appHcation of such semantics difficult. For instance, consider the counter program "do i < n ^ i : = i + 1 od". The application of the sp semantics for repetition leads to the following specification:

sp{do i < n - > i : = i + l od,Q) = {i>n)A{3j:0<j:sp{IF^,Q)).

The closed form for iteration suggests that the loop be unrolled j times. If j is set to n — start, where start is the initial value of variable i, then the unrolled version of the loop would have the following form:

1. 2.

3. 4.

5.

6. 7.

8.

9. 10. 11.

i : = i f

f i

i f

f i

i f

f i

s

i

i

i

t a r t ;

<

<

<

n

n

n

—> i := i + 1;

--> i := i + 1;

—> i := i + 1;

Application of the rule for alternation (Expression (2)) yields the sequence of annotated code shown in Figure 4, where the goal is to derive

5p(do i < n — > i : = i + l od, {start < n) A {i = start)).

In the construction of specifications of iteration statements, knowledge must be introduced by a human specifier. For instance, in line 19 of Figure 4 the inductive assertion that "i = start + (n — start — 1)" is made. This assertion is based on a specifier providing the information that (n — start — 1) additions have been performed if the loop were unrolled at least (n — start — 1) times. As such, by using loop unrolling and induction, the derived specification for the code sequence is

( ( n - 1 < n) A (z = n)).

For this simple example, we find that the solution is non-trivial when applying the formal definition of sp{DO,Q). As such, the specification process must rely on a user-guided strategy for constructing a specification. A strategy for obtaining a specification of a repetition statement is given in Figure 5.

150 GANNOD AND CHENG

1. { {i = I) A {start < n) }

2. i:= s t a r t ; 3. {{'^ = start) A {start < n)}

4. i f i < n -> i := i + 1 f i 5. { sp{i := i -\-1, {i < n) A {i = start) A {start < n))

6. V 7. {{i >= n) A{i = start) A {start < n))

8. = 9. {{i = start + 1) A {start < n)) }

10. i f i < n -> i := i + 1 f i 11. { sp{i := i -\-1, {i < n) A {i = start + 1) A {start < n))

12. V

13. {{i >= n) A{i = start -h 1) A {start < n))

14. = 15. {{i = start + 2) A {start + 1 < n))

16. V 17. {{i > = n) A{i = start + 1) A {start < n)) }

18. 19. { ((^ = start -h (n - start - 1)) A {start + (n - start - 1) - 1 < n)) 20. V 21. {{i >= n)A{i = start-\-{n-start-2))A{start-h{n-start-2)-l < n))

22. = 23. {{i = n-l)A{n-2<n))}

24. i f i < n -> i := i + 1 f i 25. { sp{i := i -h 1, (z < n) A (i = n - 1) A (n - 2 < n))

26. V 27. {{i>=n)A{i = n-l)A{n-2< n))

28. = 29. {^ = n)}

Figure 4. Annotated Source Code for Unrolled Loop

STRONGEST POSTCONDITION SEMANTICS 151

1. The following criteria are the main characteristics to be identified during the specifica­tion of the repetition statement:

• invariant (P): an expression describing the conditions prior to entry and upon exit of the iterative structure.

• guards (B): Boolean expressions that restrict the entry into the loop. Execution of each guarded command, Bi —> Si terminates with P true, so that P is an invariant of the loop.

{P A Bi}Si{P}, fori <i<n

When none of the guards is true and the invariant is true, then the postcondition of the loop should be satisfied (P A ^BB -^ R, where BB = BiV .. .W BnmdR is the postcondition).

2. Begin by introducing the assertion "Q ^ ^ ^ " ^s the precondition to the body of the loop.

3. Query the user for modifications to the assertion made in step 2. This guided interaction allows the user to provide generalizations about arbitrary iterations of the loop. In order to verify that the modifications made by a user are valid, wlp can be applied to the assertion.

4. Apply the strongest postcondition to the loop body Si using the precondition given by step 3.

5. Using the specification obtained from step 4 as a guideline, query the user for a loop invariant. Although this step is non-trivial, techniques exist that aid in the construction of loop invariants (Katz and Manna, 1976, Gries, 1981).

6. Using the relationship stated above (P A -*BB -^ R), construct the specification of the loop by taking the negation of the loop guard, and the loop invariant.

Figure 5. Strategy for constructing a specification for an iteration statement

4.2, Procedural Abstractions

This section describes the construction of formal specifications from code containing the use of non-recursive procedural abstractions. A procedure declaration can be represented using the following notation

152 GANNOD AND CHENG

proc p (value x\ value-result y; result z ); {P}{body){Q}

where x, y, and 'z represent the value, value-result, and result parameters for the procedure, respectively. A parameter of type value means that the parameter is used only for input to the procedure. Likewise, a parameter of type result indicates that the parameter is used only for output from the procedure. Parameters that are known as value-result indicate that the parameters can be used for both input and output to the procedure. The notation (body ) represents one or more statements making up the "procedure", while {P} and {Q} are the precondition and postcondition, respectively. The signature of a procedure appears as

proc/7: {inputjtype)* —> {output Jype)* (7)

where the Kleene star (*) indicates zero or more repetitions of the preceding unit, input-type denotes the one or more names of input parameters to the procedure p, and output-type denotes the one or more names of output parameters of procedure p. A specification of a procedure can be constructed to be of the form

{ P : f / } proc p : EQ —^ El (body) {Q:sp(bodyM)AU}

where EQ is one or more input parameter types with attribute value or value-result, and Ei is one or more output parameter types with attribute value-result or result. The postcondition for the body of the procedure, sp(body, U), is constructed using the previously defined guidelines for assignment, alternation, sequence, and iteration as applied to the statements of the procedure body.

Gries defines a theorem for specifying the effects of a procedure call (Gries, 1981) using a total correctness model of execution. Given a procedure declaration of the above form, the following condition holds (Gries, 1981)

{PRT : P f f A iWu,v :: Q|; i => I^l)}pia,b,c) {R} (8)

for a procedure call p(a, 6, c), where a, 6, and c represent the actual parameters of type value, value-result, and result, respectively. Local variables of procedure p used to compute value-result and result parameters are represented using u and v, respectively. Informally, the condition states that PRT must hold before the execution of procedure p in order to satisfy R. In addition, PRT states that the precondition for procedure p must hold for the parameters passed to the procedure and that the postcondition for procedure/? implies R for each value-result and result parameter. The formulation of Equation (8) in terms of a partial correctness model of execution is identical, assuming that the procedure is straight-line, non-recursive, and terminates. Using this theorem for the procedure call, an abstraction of the effects of a procedure call can be derived using a specification of the procedure declaration. That is, the construction of a formal specification from a procedure call can be performed by inlining a procedure call and using the strongest postcondition for assignment.

STRONGEST POSTCONDITION SEMANTICS 1 5 3

b e g i n

'{PR} p(a, 6, c)

end

b e g i n d e c l a r e x, y, z", u, v;

{PR } _ _ x,y : = a,b;

{p} (body) {Q}__ y,z : = u,v;

{e^} _ t>,c : = y ,z;

{ « }

end

Figure 6. Removal of procedure call p(a, 6, c) abstraction

A procedure call p(a, 6, c) can be represented by the program block (Gries, 1981) found in Figure 6, where (body) comprises the statements of the procedure declaration for/?, {PR} is the precondition for the call to procedure/?, {P}is the specification of the program after the formal parameters have been replaced by actual parameters, {Q} is the specification of the program after the procedure has been executed, { QR} is the specification of the program after formal parameters have been assigned with the values of local variables, and { /?} is the specification of the program after the actual parameters to the procedure call have been "returned". By representing a procedure call in this manner, parameter binding can be achieved through multiple assignment statements and a postcondition R can be established by using the sp for assignment. Removal of a procedural abstraction enables the extension of the notion of straight-line programs to include non-recursive straight-line procedures. Making the appropriate sp substitutions, we can annotate the code sequence from Figure 6 to appear as follows:

154 GANNOD AND CHENG

{PR } _ _ x,y : = a,b; { P: {3a,/3:: PR1% A X = a^'^ A y = b1%) }

{body) { e } _ _ y,z := u,v; {G/?.-(37,C::e^|Ay = u | j A z = v ^ | ) } b,c : = YJZ";

{ R: (3^,^ :: fi/^^f Ab = y5'5 A c = z^f) }

where a, ^, 7, C» » and ip are the initial values of x, y (before execution of the procedure body), y (after execution of the procedure body), ^, b, and c, respectively. Recall that in Section 3.1, we described how the existential operators and the textual substitution could be removed from the calculation of the sp. Applying that technique to assignments and recognizing that formal and actual result parameters have no initial values, and that local variables are used to compute the values of the value-result parameters, the above sequence can be simplified using the semantics of 577 for assignments to obtain the following annotated code sequence:

{PR }__ x,y : = a,b; ^ {P.-P/?Ax = a A y = b } {body) { G } _ _ y,z : = u,v; _ _ {ei?.-QAy = u ^ A z = v ^ } b,c : = y,'z ;^ {/?.-G/?Ab = y A c = ^ }

where Q is derived using sp{{body),P).

5. Example

The following example demonstrates the use of four major programming constructs de­scribed in this paper (assignment, alternation, sequence, and procedure call) along with the application of the translation rules for abstracting formal specifications from code. The program, shown in Figure 7, has four procedures, including three different imple­mentations of "swap". AUTOSPEC (Cheng and Gannod, 1991, Gannod and Cheng, 1993, Gannod and Cheng, 1994) is a tool that we have developed to support the derivational ap­proach to the reverse engineering of formal specifications fi-om program code.

STRONGEST POSTCONDITION SEMANTICS 1 5 5

program MaxMin ( input, output ); var a, b, c. Largest, Smallest : real;

procedure Find-

MaxMin{NumOne, NumTwo:real; var Max, Min:real );

begin

if NumOne > NumTwo then

begin

Max := NumOne;

Min := NumTwo;

end

else

begin

Max := NumTwo;

Min := NumOne;

end

end;

procedure swapa( var X:integer; var Y:integer );

begin

Y +

X

Y

end;

= Y - X

= Y - X

procedure swapb( var X:integer; var Y:integer );

var

temp : integer;

begin

temp := X;

X := Y;

Y := temp

end;

procedure funnyswap( X:integer; Y:integer );

var

temp : integer;

begin

temp := X;

X := Y;

Y := tenp

end;

begin

a := 5;

b := 10;

swapa(a,b);

swapb(a,b);

funnyswap{a,b);

FindMaxMin{a,b,Largest,Smallest);

c := Largest;

end.

Figure 7. Example Pascal program

Figures 8, 9, and 10 depict the output of AUTOSPEC when applied to the program code given in Figure 7 where the notation id{scope}instance is used to indicate a variable id with scope defined by the referencing environment for scope. The in s t ance identifier

156 GANNOD AND CHENG

program McixMin { input, output ) ;

var a, b, c, Largest, Smallest : real;

procedure FindMaxMin( NumOne, NumTwo:real; var Max, Min:real );

begin if (NumOne > NumTwo) then

begin Max := NumOne; (* Max{2)l = NumOneO & U *) Min := NumTwo; (* Min{2}l = NumTwoO & U *)

end I: (* (Max{2)l = Nu­

mOneO & Min{2}l = NumTwoO) & U *) else

begin Max := NumTwo; (* Max{2}l = NiimTwoO & U *) Min := NumOne; (* Min{2}l = NumOneO & U *)

end J: (* (Max{2)l = NumTwoO & Min{2}l = Nu­

mOneO) & U *) K: (* (((NumOneO > NumTwoO) &

(Max{0}l = NumOneO & Min{0}l = NumTwoO)) | (not (NumOneO > NumTwoO) & (Max{0}l = NumTwoO & Min{0}l = Nu­

mOneO ) ) ) & U *)

end L: (* (((NumOneO > NumTwoO) &

(Max{0}l = NumOneO & Min{0}l = N\JimTwoO)) | (not (NumOneO > NumTwoO) &

(Max{0}l = NumTwoO & Min{0}l = Nu­mOneO ) ) ) & U *)

Figure 8. Output created by applying AUTOSPEC to example

is used to provide an ordering of the assignments to a variable. The scope identifier has two purposes. When scope is an integer, it indicates the level of nesting within the current program or procedure. When scope is an identifier, it provides information about variables specified in a different context. For instance, if a call to some arbitrary procedure called f oo is invoked, then specifications for variables local to f oo are labeled with an integer scope. Upon return, the specification of the calling procedure will have references to variables local to f oo. Although the variables being referenced are outside the scope of the calling procedure, a specification of the input and output parameters for f oo can provide valuable information, such as the logic used to obtain the specification for the output variables to f oo. As such, in the specification for the variables local to f oo but outside the scope of the calling procedure, we use the scope label So. Therefore, if we have a variable q local to f oo, it might appear in a specification outside its local context as q{f oo}4, where "4" indicates the fourth instance of variable q in the context of f oo.

STRONGEST POSTCONDITION SEMANTICS 157

In addition to the notations for variables, we use the notation ' | ' to denote a logical-or, '&' to denote a logical-and, and the symbols ' (* * ) ' to delimit comments (i.e., specifications).

In Figure 8, the code for the procedure FindMaxMin contains an alternation statement, where lines I, J, K, and L specify the guarded commands of the alternation statement (i and j ) , the effect of the alternation statement (K), and the effect of the entire procedure (L), respectively.

Of particular interest are the specifications for the swap procedures given in Figure 9 named swapa and swapb. The variables x and Y are specified using the notation described above. As such, the first assignment to Y is written using Y{O}I, where Y is the variable, '{o}' describes the level of nesting (here, it is zero), and ' 1 ' is the historical subscript, the *l' indicating the first instance of Y after the initial value. The final comment for swapa (line M), which gives the specification for the entire procedure, reads as:

(* (Y{0}2 = XO Sc X{0}1 = YO & Y{0}1 = YO + XO) & U *)

where Y{O}2 = XO is the specification of the final value of Y, and x{o}l = YD is the specification of the final value of x. In this case, the intermediate value of Y, denoted Y{O}I, with value YO + xo is not considered in the final value of Y.

Procedure swapb uses a temporary variable algorithm for swap. Line N is the specification after the execution of the last line and reads as:

(* (Y{0}1 = XO & X{0}1 = YO & temp{0}l = XO) & U *)

where Y{O}I = XO is the specification of the final value of Y, and x{o}l = YO is the specification of the final value of x.

Although each implementation of the swap operation is different, the code in each proce­dure effectively produces the same results, a property appropriately captured by the respec­tive specifications for swapa and swapb with respect to the final values of the variables x and Y.

In addition, Figure 10 shows the formal specification of the funnyswap procedure. The semantics for the funnyswap procedure are similar to that of swapb. However, the param­eter passing scheme used in this procedure is pass by value.

The specification of the main begin-end block of the program MaxMin is given in Figure 10. There are eight lines of interest, labeled i, J, K, L, M, N, o, and p, respectively. Lines I and J specify the effects of assignment statements. The specification at line K demonstrates the use of identifier scope labels, where in this case, we see the specification of variables x and Y from the context of swapa. Line L is another example of the same idea, where the specification of variables from the context of swapb (x and Y), are given. In the main program, no variables local to the scope of the call to funnyswap are affected by funnyswap due to the pass by value nature of funnyswap, and thus the specification shows no change in variable values, which is shown by line M of Figure 10. The effects of the call to procedure FindMaxMin provides another example of the specification of a procedure call (line N). Finally, line P is the specification of the entire program, with every precondition propagated to the final postcondition as described in Section 3.1. Here, of interest are the final values of the variables that are local to the program MaxMin (i.e., a, b, and C). Thus, according to the rules for historical subscripts, the a{o}3, b{o}3, and c{o}l

158 GANNOD AND CHENG

procedure swapa( var X:integer; var Y:integer );

begin Y := (Y + X); (* {Y(0)1 = (YO + XO)) & U *) X := (Y - X) ; (* (X{0}1 = ((YO + XO) - XO)) & U *) Y := (Y - X); (* (Y{0}2 = ((YO + XO) - ((YO + XO) - XO)))

end (* (Y{0}2 = XO & X{0)1 = YO & Y(0)1 = YO + XO)

procedure swapb( var X:integer; var Y:integer

var tenp : integer;

begin temp := X; (* (temp{0)l = XO) & U *) X := Y; (* (X{0)1 = YO) & U *) Y := temp; (* (Y{0}1 = XO) & U *)

end (* (Y{0)1 = XO & X{0}1 = YO & temp{0}l = XO) &

& U *)

& U *)

);

U *)

procedure funnyswap( X:integer; Y:integer ); var

temp : integer;

begin temp := X; (* (tenp{0}l = XO) & U *) X := Y; (* (X{0)1 = YO) & U *) Y := temp; (* (Y{0}1 = XO) & U *)

end (* (Y{0}1 = XO & X{0}1 = YO & temp{0}l = XO) & U *)

Figure 9. Output created by applying AUTOSPEC to example (cont.)

are of interest. In addition, by propagating the preconditions for each statement, the logic that was used to obtain the values for the variables of interest can be analyzed.

6. Related Work

Previously, formal approaches to reverse engineering have used the semantics of the weak­est precondition predicate transformer wp as the underlying formalism of their technique. The Maintainer's Assistant uses a knowledge-based transformational approach to con­struct formal specifications from program code via the use of a Wide-Spectrum Language (WSL)(Ward et al., 1989). A WSL is a language that uses both specification and imperative language constructs. A knowledge-base manages the correctness preserving transforma-

STRONGEST POSTCONDITION SEMANTICS 1 5 9

(* Main Program begin

a

(*

b

(*

:= 5; a{0}l

:= 10; b{0}l

= 5

= 10

for MaxMin *)

& U *)

' & U *)

swapa{a,b) (* (b{0)2 = 5 &

(a{0}2 = 10 & {Y{swapa}2 ^ 5 &

(X{swapa}l = 10 & Y{swapa)l =15)))) & U *) swapb(a,b) (* (b{0}3 = 10 &

(a{0}3 = 5 & (Y{swapb}l = 10 &

(X{swapb}l = 5 & temp{swapb}l =10)))) & U *) funnyswap(a,b) (* (Y{funnyswap}l = 5 & X{funnyswap)l = 10 &

tenp(funnyswap}l =5) & U *) FindMaxMin{a,b,Largest,Smallest) (* (Smallest{0}l = Min{FindMaxMin)l &

Largest{0)1 = Max{FindMaxMin}l & (({5 > 10) &

{Max{FindMaxMin}l = 5 & Min{FincaMaxMin}l = 10)) |

(not (5 > 10) & (Max{FindMaxMin)l = 10 & Min{FindMaxMin)l = 5)))) & U *)

c := Largest; (* c{0}l = Max{FindMaxMin)l & U *)

(* ((c{0)l = Max{FindMaxMin}l) & (Smallest{0)l = Min{FindMaxMin)l & Largest{0}l = Max{FindMaxMin)l &

(((5 > 10) & (Max{Finc3MaxMin}l = 5 & Min{FindMaxMin}l =10)) |

(not(5 > 10) & (Max{FindMaxMin)l = 10 & Min{FindMaxMin)l = 5))))) &

( Y{funnyswap}l = 5 & X{fvinnyswap) 1 = 1

tenip{funnyswap)l = 5 ) & ( b{0)3 = 10 6c a{0}3 = 5 &

(Y{swapb}l = 10 & X{swapb}l = 5 & teirp{swapb)l = 10)) &

( b{0}2 = 5 & a{0}2 = 10 &

(Y{swapa}2 = 5 & X{swapa)l = 10 & Y{swapa}l = 15)) &

(b{0}l = 10 & a{0)l = 5 ) & U *)

Figure 10. Output created by applying AUTOSPEC to example (cont.)

tions of concrete, implementation constructs in a WSL to abstract specification constructs in the same WSL.

160 GANNOD AND CHENG

REDO (Lano and Breuer, 1989) (Restructuring, Maintenance, Validation and Documen­tation of Software Systems) is an Espirit II project whose objective is to improve applications by making them more maintainable through the use of reverse engineering techniques. The approach used to reverse engineer COBOL involves the development of general guidelines for the process of deriving objects and specifications from program code as well as providing a framework for formally reasoning about objects (Haughton and Lano, 1991).

In each of these approaches, the applied formalisms are based on the semantics of the weakest precondition predicate transformer wp. Some differences in applying wp and sp are that wp is a backward rule for program semantics and assumes a total correctness model of execution. However, the total correctness interpretation has no forward rule (i.e. no strongest total postcondition stp (Dijkstra and Scholten, 1990)). By using a partial correctness model of execution, both a forward rule {sp) and backward rule {wlp) can be used to verify and refine formal specifications generated by program understanding and reverse engineering tasks. The main difference between the two approaches is the ability to directly apply the strongest postcondition predicate transformer to code to construct formal specifications versus using the weakest precondition predicate transformer as a guideline for constructing formal specifications.

7. Conclusions and Future Investigations

Formal methods provide many benefits in the development of software. Automating the process of abstracting formal specifications from program code is sought but, unfortunately, not completely realizable as of yet. However, by providing the tools that support the reverse engineering of software, much can be learned about the functionality of a system.

The level of abstraction of specifications constructed using the techniques described in this paper are at the "as-built" level, that is, the specifications contain implementation-specific information. For straight-line programs (programs without iteration or recursion) the techniques described herein can be applied in order to obtain a formal specification from program code. As such, automated techniques for verifying the correctness of straight-line programs can be facilitated.

Since our technique to reverse engineering is based on the use of strongest postcondition for deriving formal specifications from program code, the application of the technique to other programming languages can be achieved by defining the formal semantics of a pro­gramming language using strongest postcondition, and then applying those semantics to the programming constructs of a program. Our current investigations into the use of strongest postcondition for reverse engineering focus on three areas. First, we are extending our method to encompasses all major facets of imperative programming constructs, including iteration and recursion. To this end, we are in the process of defining the formal seman­tics of the ANSI C programming language using strongest postcondition and are applying our techniques to a NASA mission control application for unmanned spacecraft. Second, methods for constructing higher level abstractions from lower level abstractions are be­ing investigated. Finally, a rigorous technique for re-engineering specifications from the imperative programming paradigm to the object-oriented programming paradigm is being developed (Gannod and Cheng, 1993). Directly related to this work is the potential for

STRONGEST POSTCONDITION SEMANTICS 1 6 1

applying the results to facilitate software reuse, where automated reasoning is applied to the specifications of existing components to determine reusability (Jeng and Cheng, 1992).

Acknowledgments

The authors greatly appreciate the comments and suggestions from the anonymous refer­ees. Also, the authors wish to thank Linda Wills for her efforts in organizing this special issue. Finally, the authors would like to thank the participants of the IEEE 1995 Working Conference on Reverse Engineering for the feedback and comments on an earlier version of this paper.

This is a revised and extended version of "Strongest Postcondition semantics as the Formal Basis for Reverse Engineering" by G.C. Gannod and B.H.C. Cheng, which first appeared in the Proceedings of the Second Working Conference on Reverse Engineering, IEEE Computer Society Press, pp. 188-197, July 1995.

Appendix A

Motivations for Notation and Removal of Quantification

Section 3.1 states a conjecture that the removal of the quantification for the initial values of a variable is valid if the precondition Q has a conjunct that specifies the textual substitution. This Appendix discusses this conjecture. Recall that

5p(x:= e,Q) = {3v::QlAx = el). (A.l)

There are two goals that must be satisfied in order to use the definition of strongest post­condition for assignment. They are:

1. Elimination of the existential quantifier

2. Development and use of a traceable notation.

Eliminating the Quantifier, First, we address the elimination of the existential quantifier. Consider the RHS of definition A.l. Let y be a variable such that

(Q^ A X = el) ^ {3v :: Q^ A X = el). (A.2)

Define spp{K:= e, Q) (pronounced "s-p-rho") as the strongest postcondition for assign­ment with the quantifier removed. That is,

spp{K:= e,Q) = {Ql Ax = ey) forsome}'. (A.3)

Given the definition of spp, it follows that

spp{x:= e,(3) =4> 5p(x:= e , Q ) . (A.4)

162 GANNOD AND CHENG

As such, the specification of the assignment statement can be made more simple if y from equation (A.3) can either be identified explicitly or named implicitly. The choice ofy must be made carefully. For instance, consider the following. Let Q := P A {x = z) such that P contains no free occurrences of x. Choosing an arbitrary a for y in (A.3) leads to the following derivation:

= {Q:=PA{x = z)) {PA{x = z)raA{x = e%)

= (textual substitution) (P^ A{x = z)l A (x = el)

= {P has no free occurrences of x. Textual substitution) PA{a = z)A{x = 6%)

= {a = z) P A ( a = z )A(x = 0

= (textual substitution) PA{a = z)A{x = e^).

At first glance, this choice ofy would seem to satisfy the first goal, namely removal of the quantification. However, this is not the case. Suppose P were replaced with P' A{a^ z). The derivation would lead to

5pp(x:= e,(3) = P^A{a^z)A{a = z)A{x = e%).

This is unacceptable because it leads to a contradiction, meaning that the specification of a program describes impossible behaviour. Ideally, it is desired that the specification of the assignment statement satisfy two requirements. It must:

1. Describe the behaviour of the assignment of the variable x, and

2. Adjust the precondition Q so that the free occurrences of x are replaced with the value of X before the assignment is encountered.

It can be proven that through successive assignments to a variable x that the specification spp will have only one conjunct of the form (x = /?), where P is an expression. Informally, we note that each successive application of spp uses a textual substitution that eliminates free references to x in the precondition and introduces a conjunct of the form (x = /3).

The convention used by the approach described in this paper is to choose for y the expression yS. If no P can be identified, use a place holder 7 such that the precondition Q has no occurrence of 7. As an example, let y in equation (A.3) be z, and Q := PA{X = z). Then

spp{x:= e,Q) = PA{z^z)A{x = e^).

Notice that the last conjunct in each of the derivations is (x = e^) and that since P contains no free occurrences of x, P is an invariant.

STRONGEST POSTCONDITION SEMANTICS 163

Notation. Define sp^^ (pronounced "s-p-rho-iota") as the strongest postcondition for assignment with the quantifier removed and indices. Formally, spp^ has the form

spp,{yi: = e, Q) = {Qy Axk = Sy) for somey. (A.5)

Again, an appropriate y must be chosen. Let Q := PA{xi = y), where P has no occurrence of X other than i subscripted x's of form {xj = ej),0 < j < i. Based on the previous discussion, choose y to be the RHS of the relation (xi = y). As such, the definition of spp^^ can be modified to appear as

spp,{x:= e, Q) = ((P A {xi = y))l A x^+i = e^) for some j . (A.6)

Consider the following example where subscripts are used to show the effects of two consecutive assignments to the variable x. Let Q := P A{xi = a), and let the assignment statement be x: = e. Application of sppc yields

spp,{x:= e,Q) = (PA(x i = a))^ A(xi+i = e ) ^ = (textual substitution)

P^ A {xi = a)% A {xi+i = e)l = (textual substitution)

PA{xi = a)A{xi^i = e^)

A subsequentapplication of 5j?pt on the statement x:= f subjecttoQ' := QA(xi+i = e%) has the following derivation:

SPP,{K:= f, Q') = (P A {xi = a)A (x^+i = e^))^. A Xi^2 = fe^ = (textual substitution)

Pfx A {xi = a)^x A {xi^i = eS)^x A x^+s = f^. = (P has no free x, textual substitution)

PA{xi = a)A {xi^i = e^) A Xi+2 = fe-= (definition of Q)

Q A {xi+i = e^) A Xi+2 = fe-= (definition of Q')

Q' A Xi+2 = feg

Therefore, it is observed that by using historical subscripts, the construction of the speci­fication of the assignment statements involves the propagation of the precondition Q as an invariant conjuncted with the specification of the effects of setting a variable to a dependent value. This convention makes the evaluation of a specification annotation traceable by avoiding the elimination of descriptions of variables and their values at certain steps in the program. This is especially helpful in the case where choice statements (alternation and iteration) create alternative values for specific variable instances.

1 6 4 GANNOD AND CHENG

References

Byrne, Eric J. A Conceptual Foundation for Software Re-engineering. In Proceedings for the Conference on Software Maintenance, pages 226-235. IEEE, 1992.

Byrne, Eric J. and Gustafson, David A. A Software Re-engineering Process Model. In COMPSAC. IEEE, 1992. Cheng, Betty H. C. Applying formal methods in automated software development. Journal of Computer and

Software Engineering, 2(2): 137-164, 1994. Cheng, Betty H.C., and Gannod, Gerald C. Abstraction of Formal Specifications from Program Code. In

Proceedings for the IEEE 3rd International Conference on Tools for Artificial Intelligence, pages 125-128. IEEE, 1991.

Chikofsky, Elliot J. and Cross, James H. Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software, 7(1): 13-17, January 1990.

Dijkstra, Edsgar W. A Discipline of Programming. Prentice Hall, 1976. Dijkstra, Edsger W. and Scholten, Carel S. Predicate Calculus and Program Semantics. Springer-Verlag, 1990. Flor, Victoria Slid. Ruling's Dicta Causes Uproar. The National Law Journal, July 1991. Gannod, Gerald C. and Cheng, Betty H.C. A Two Phase Approach to Reverse Engineering Using Formal Methods.

Lecture Notes in Computer Science: Formal Methods in Programming and Their Applications, 735:335-348, July 1993.

Gannod, Gerald C. and Cheng, Betty H.C. Facilitating the Maintenance of Safety-Critical Systems Using Formal Methods. The International Journal of Software Engineering and Knowledge Engineering, 4(2): 183-204,1994.

Gries, David. The Science of Programming. Springer-Verlag, 1981. Haughton, H.P., and Lano, Kevin. Objects Revisited. In Proceedings for the Conference on Software Maintenance,

pages 152-161. IEEE, 1991. Hoare, C. A. R. An axiomatic basis for computer programming. Communications of the ACM, 12(10):576-580,

October 1969. Jeng, Jun-jang and Cheng, Betty H. C. Using Automated Reasoning to Determine Software Reuse. International

Journal of Software Engineering and Knowledge Engineering, 2(4):523-546, December 1992. Katz, Shmuel and Manna, Zohar. Logical Analysis of Programs. Communications of the ACM, 19(4): 188-206,

April 1976. Lano, Kevin and Breuer, Peter T. From Programs to Z Specifications. In John E. Nicholls, editor, Z User Workshop,

pages 46-70. Springer-Verlag, 1989. Leveson, Nancy G. and Turner, Clark S. An Investigation of the Therac-25 Accidents. IEEE Computer, pages

1 8 ^ 1 , July 1993. Osborne, Wilma M. and Chikofsky, Elliot J. Fitting pieces to the maintenance puzzle. IEEE Software, 7(1): 11-12,

January 1990. Ward, M., Calliss, F.W., and Munro, M. The Maintainer's Assistant. In Proceedings for the Conference on

Software Maintenance. IEEE, 1989. Wing, Jeannette M. A Specifier's Introduction to Formal Methods. IEEE Computer, 23(9):8-24, September

1990. Yourdon, E. and Constantine, L. Structured Analysis and Design: Fundamentals Discipline of Computer Programs

and System Design. Yourdon Press, 1978.

Automated Software Engineering, 3, 165-172 (1996) © 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Recent Trends and Open Issues in Reverse Engineering LINDA M. WILLS linda.wills@ee.gatech.edu School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332-0250

JAMES H. CROSS II cross@eng.aubum.edu Auburn University, Computer Science and Engineering lOJDunstan Hall, Auburn University, AL 36849

Abstract. This paper discusses recent trends in the field of reverse engineering, particularly those highlighted at the Second Working Conference on Reverse Engineering, held in July 1995. The trends observed include increased orientation toward tasks, grounding in complex real-world applications, guidance from empirical study, analysis of non-code sources, and increased formalization. The paper also summarizes open research issues and provides pointers to future events and sources of information in this area.

1. Introduction

Researchers in reverse engineering use a variety of metaphors to describe the role their work plays in software development and evolution. They are detectives, piecing together clues incrementally discovered about a system's design and what "crimes" were committed in its evolution. They are rescuers, salvaging huge software investments, left stranded by shifting hardware platforms and operating systems. Some practice radiology, finding ways of viewing internal structures, obscured by and entangled with other parts of the software "organism": objects in procedural programs, logical data models in relational databases, and data and control flow "circulatory and nervous systems." Others are software arche-ologists (Chikofsky, 1995), reconstructing models of structures buried in the accumulated deposits of software patches and fixes; inspectors, measuring compliance with design, cod­ing, and documentation standards; foreign language interpreters, translating software in one language to another; and treasure hunters and miners, searching for gems to extract, polish, and save in a reuse library.

Although working from diverse points of view, reverse engineering researchers have a common goal of recovering information from existing software systems. Conceptual com­plexity is the software engineer's worst enemy. It directly affects costs and ultimately the reliability of the delivered system. Comprehension of existing systems is the underlying goal of reverse engineering technology. By examining and analyzing the system, the reverse engi­neering process generates multiple views of the system that highlight its salient features and delineate its components and the relationships between them (Chikofsky and Cross, 1990).

Recovering this information makes possible a wide array of critical software engineering activities, including those mentioned above. The prospect of being able to provide tools and methodologies to assist and automate portions of the reverse engineering process is

166 WILLS AND CROSS

an appealing one. Reverse engineering is an area of tremendous economic importance to the software industry not only in saving valuable existing assets, but also in facilitating the development of new software.

From the many different metaphors used to describe the diverse roles that reverse engi­neering plays, it is apparent that supporting and semi-automating the process is a complex, multifarious problem. There are many different types of information to extract and many different task situations, with varying availability and accuracy of information about the software. A variety of approaches and skills is required to attack this problem.

To help achieve coherence and facilitate communication in this rapidly growing field, researchers and practitioners have been meeting at the Working Conference on Reverse Engineering, the first of which was held in May 1993 (Waters and Chikofsky, 1993). The Working Conference provides a forum for researchers to discuss as a group current research directions and challenges to the field. The adjective "working" in the title emphasizes the conference's format of interspersing significant periods of discussion with paper presenta­tions. The Second Working Conference on Reverse Engineering (Wills et al., 1995) was held in July, 1995, organized by general chair Elliot Chikofsky of Northeastern University and the DMR Group, and by program co-chairs Philip Newcomb of the Software Revolution and Linda Wills of Georgia Institute of Technology.

This article uses highlights and observations from the Second Working Conference on Reverse Engineering to present a recent snapshot of where we are with respect to our overall goals, what new trends are apparent in the field, and where we are heading. It also points out areas where hopefully more research attention will be drawn in the future. Finally, it provides pointers to future conferences and workshops in this area and places to find additional information.

2. Increased Task-Orientation

The diverse set of metaphors listed above indicates the variety of tasks in which reverse engineering plays a significant role. Different tasks place different demands on the reverse engineering process. The issue in reverse engineering is not only how to extract information from an existing system, but which information should be extracted and in what form should it be made accessible? Researchers are recognizing the need to tailor reverse engineering tools toward recovering information relevant to the task at hand. Mechanisms for focused, goal-driven inquiries about a software system are actively being developed. Dynamic Documentation. A topic of considerable interest is automatically generating accessible, dynamic documentation from legacy systems. Lewis Johnson coined the phrase "explanation on demand" for this type of documentation technology (Johnson, 1995). The strategy is to concentrate on generating only documentation that addresses specific tasks, rather than generating all possible documentation whether it is needed or not.

Two important open issues are: what formalisms are appropriate for documentation, and how well do existing formalisms match the particular tasks maintainers have to perform? These issues are relevant to documentation at all levels of abstraction. For example, a similar issue arises in program understanding: what kinds of formal design representations

TRENDS AND ISSUES IN REVERSE ENGINEERING 167

should be used as a target for program understanding systems? How can multiple models of design abstractions be extracted, viewed, and integrated? Varying the Depth of Analysis. Depending on the task, different levels of analysis power are required. For example, recent advances have been made in using analysis techniques to detect duplicate fragments of code in large software systems (Baker, 1995, Kontogiannis et al., 1995). This is useful in identifying candidates for reuse and in prevent­ing inconsistent maintenance of conceptually related code. If a user were interested only in detecting instances of "cut-and-paste" reuse, it would be sufficient to find similarities based on matching syntactic features (e.g., constant and function names, variable usage, and keywords), without actually understanding the redundant pieces. The depth of analysis must be increased, however, if more complex, semantic similarities are to be detected, for example, for the task of identifying families of reusable components that all embody the same mathematical equations or business rules. Interactive Tools. Related to the issue of providing flexibility in task-oriented tools is the degree of automation and interaction the tools have with people (programmers, main-tainers, and domain experts (Quilici and Chin, 1995)). How is the focusing done? Who is controlling the depth of analysis and level of effort? The reverse engineering process is characterized by a search for knowledge about a design artifact with limited sources of information available. The person and the tool each bring different types of interpretive skills and information sources to the discovery process. A person can often see global patterns in data or subtle connections to informal domain concepts that would be difficult for tools based on current technology to uncover. Successful collaboration will depend on finding ways to leverage the respective abilities of the collaborators. The division of labor will be influenced by the task and environmental situation.

3. Attacking Industrial-Strength Problems

The types of problems that are driving reverse engineering research come from real-world systems and applications. Early work tended to focus on simplified versions of reverse en­gineering problems, often using data that did not always scale up to more realistic problems (Selfi-idge et al., 1993). This helped in initial explorations of techniques that have since matured.

At the Working Conference, several researchers reported on the application of reverse engineering techniques to practical industrial problems with results of significant economic importance. The software and legacy systems to which their techniques are being applied are quite complex, large, and diverse. Examples include a public key encryption program, industrial invoicing systems, the X window system, and software for analyzing data sent back from space missions. Consequently, the types of information being extracted from existing software spans a wide range, including specifications, business rules, objects, and more recently, architectural features. A good example of a large scale application was provided by Philip Newcomb (Newcomb,1995) who presented a tool, called the Legacy System Cataloging Facility. This tool supports modeling, analyzing, and transforming legacy systems on an enterprise scale by providing a mechanism for efficiently storing and managing huge models of information systems at Boeing Computer Services.

168 WILLS AND CROSS

Current applications are pushing the limits of existing techniques in terms of scalability and feasibility. Exploring these issues and developing new techniques in the context of real-world systems and problems is critical.

4. More Empirical Studies

One of the prerequisites in addressing real-world, economically significant reverse engi­neering problems is understanding what the problems are and establishing requirements on what it would take to solve them. Researchers are recognizing the necessity of conducting studies that examine what practitioners are doing currently, what is needed to support them, and how well (or poorly) the existing technology is meeting their needs.

The results of one such full-scale case study were presented at the Working Conference by Piernicola Fiore (Fiore et al., 1995). The study focused on a reverse engineering project at a software factory (Basica S.p.A in Italy) to reverse engineer banking software. Based on an analysis of productivity, the study identified the need for adaptable automated tools. Results indicated that cost is not necessarily related to number of lines of code, and that both the data and the program need distinct econometric models.

In addition to this formal, empirical investigation, some informal studies were reported at the Working Conference. During a panel discussion, Lewis Johnson described his work on dynamic, accessible documentation, which was driven by studies of inquiry episodes gathered from newsgroups. This helped to determine what types of questions software users and maintainers typically ask. (Blaha and Premerlani, 1995) reported on idiosyncracies they observed in relational database designs, many of which are in conmiercial software products!

Empirical data is useful not only in driving and guiding reverse engineering technology development, but also in estimating the effort involved in reverse engineering a given system. This can influence a software engineer's decisions about whether to reengineer a system or opt for continued maintenance or a complete redesign (Newcomb,1995). While the value of case studies is widely recognized, relatively few have been conducted thus far.

Closely related to this problem is the critical need for publicly available data sets that em­body representative reverse engineering problems (e.g., a legacy database system including all its associated documentation (Selfi"idge et al., 1993)). Adopting these as standard test data sets would enable researchers to quantitatively compare results and set clear milestones for measuring progress in the field.

Unfortunately, it is difficult to find data sets that can be agreed upon as being representative of those found in common reverse engineering situations. They must not be proprietary and they must be made easily accessible. Papers describing case studies and available data sets would significantly contribute to advancing the research in this field (Selfridge et al., 1993) and are actively sought by the Working Conference.

5. Looking Beyond Code for Sources of Information

In trying to understand aspects of a software system, a reverse engineer uses all the sources of information available. In the past, most reverse engineering research focused on supporting

TRENDS AND ISSUES IN REVERSE ENGINEERING 169

the recovery of information solely from the source code. Recently, the value of non-code system documents as rich sources of information has been recognized. Documents associated with the source code often contain information that is difficult to capture in the source code itself, such as design rationale, connections to "human-oriented" concepts (Biggerstaff et al., 1994), or the history of evolutionary steps that went into creating the software.

For example, at the Working Conference, analysis techniques were presented that auto­matically derived test cases from reference manuals and structured requirements (Lutsky, 1995), business rules and a domain lexicon from structured analysis specifications (Leite and Cerqueira, 1995), and formal semantics from dataflow diagrams (Butler et al., 1995).

A crucial open issue in this area of exploration is what happens when one source of information is inaccurate or inconsistent with another source of information, particularly the code. Who is the final arbiter? Often it is valuable simply to detect such inconsistencies, as is the case in generating test cases.

6. Increased Formalization

When a field is just beginning to form, it is common for researchers to try many different informal techniques and experimental methodologies to get a handle on the complex prob­lems they face. As the field matures, researchers start to formalize their methods and the underlying theory. The field of reverse engineering is starting to see this type of growth.

A fruitful interplay is emerging between prototyping and experimenting with new tech­niques which are sketched out informally, and the process of formalization which tries to provide an underlying theoretical basis for these informal techniques. This helps make the methods more precise and less prone to ambiguous results. Formal methods contribute to the validation of reverse engineering technology and to a clearer understanding of fundamental reverse engineering problems.

While formal methods, with their well-defined notations, also have a tremendous potential for facilitating automation, the current state-of-the-art focuses on small programs. This raises issues of practicality, feasibility and scalability. A promising strategy is to explore how formal methods can be used in conjunction with other approaches, for example, coupling pattern matching with symbolic execution.

Although the formal notations lend themselves to machine manipulation, they tend to introduce a communication barrier between the reverse engineer who is not familiar with formal methods and the machine. Making reverse engineering tools based on formal meth­ods accessible to practicing engineers will require the support of interfaces to the formal notations, including graphical notations and domain-oriented representations, such as those being explored in applying formal methods to component-based reuse (Lowry et al., 1994).

7. Challenges for the Future

Other issues not specifically addressed by papers presented at the Working Conference include:

170 WILLS AND CROSS

• How do we validate and test reverse engineering technology?

• How do we measure its potential impact? How can we support the critical task of assessment that should precede any reverse engineering activity? This includes deter­mining how amenable an artifact is to reverse engineering, what outcome is expected, the estimated cost of the reverse engineering project and the anticipated cost oinot re­verse engineering. Most reverse engineering research assumes that reverse engineering will be performed and thus overlook this critical assessment task which needs tools and methodologies to support it.

• What can we do now to prevent the software systems we are currently creating from becoming the incomprehensible legacy systems of tomorrow? For example, what new problems does object-oriented code present? What types of programming language features, documentation, or design techniques are helpful for later comprehension and evolution of the software?

• A goal of reverse engineering research is to raise the conceptual level at which software tools interact and conmiunicate with software engineers, domain experts, and end users. This raises issues concerning how to most effectively acquire, refine, and use knowledge of the application domain. How can it be used to organize and present information extracted in terms the tool user can readily comprehend? What new presentation and visualization techniques are useful? How can domain knowledge be captured from non-code sources? What new techniques are needed to reverse engineer programs written in non-traditional, domain-oriented "languages," such as spreadsheets, database queries, granmiar-based specifications, and hardware description languages?

• A clearer articulation of the reverse engineering process is needed. What is the life-cycle of a reverse engineering activity and how does it relate to the forward engineering life-cycle? Can one codify the best practices of reverse engineers, and thereby improve the effectiveness of reverse engineering generally?

• What is management's role in the success of reverse engineering technology? From the perspective of management, reverse engineering is often seen as a temporary set of activities, focused on short-term transition. As such, management is reluctant to invest heavily in reverse engineering research, education, and application. In reality, reverse engineering can be used in forward engineering as well as maintenance to better control conceptual complexity across the life-cycle of evolving software.

8. Conclusion and Future Events

This article has highlighted the key trends in the field of reverse engineering that we observed at the Second Working Conference. More details about the WCRE presentations and discussions is given in (Cross et al., 1995). The 1993 and 1995 WCRE proceedings are available from IEEE Computer Society Press.

Even more important than the trends and ideas discussed is the energy and enthusiasm shared by the research community. Even though the problems being attacked are complex,

TRENDS AND ISSUES IN REVERSE ENGINEERING 1 7 1

they are intensely interesting and highly relevant to many software-related activities. One of the hallmarks of the Working Conference is that Elliot Chikofsky manages to come up with amusing reverse engineering puzzles that allow attendees to revel in the reverse engineering process. For example, at the First Working Conference, he challenged attendees to reverse engineer jokes given only their punch-lines. This year, he created a "reverse taxonomy" of tongue-in-cheek definitions that needed to be reverse engineered into computing-related words.

The next Working Conference is planned for November 8-10, 1996 in Monterey, CA. It will be held in conjunction with the 1996 International Conference on Software Main­tenance (ICSM). Further information on the upcoming Working Conference can be found at http ://www.ee.gatech.edii/coiiferencesAVCRE or by sending mail to were @ computer.org. Other future events related to reverse engineering include:

• the Workshop on Program Comprehension, which was held in conjunction with the International Conference on Software Engineering in March, 1996 in Berlin, Germany;

• the International Workshop on Computer-Aided Software Engineering (CASE), which is being planned for London, England, in the Summer of 1997; and

• the Reengineering Forum, a commercially-oriented meeting, which complements the Working Conference and is being held June 27-28,1996 in St. Louis, MO.

Acknowledgments

This article is based, in part, on notes taken by rapporteurs at the Second Working Con­ference on Reverse Engineering: Gerardo Canfora, David Eichmann, Jean-Luc Hainaut, Lewis Johnson, Julio Cesar Leite, Ettore Merlo, Michael Olsem, Alex Quilici, Howard Reubenstein, Spencer Rugaber, and Mark Wilson. We also appreciate comments from Lewis Johnson which contributed to our list of challenges.

Notes

1. Some examples of Elliot's reverse taxonomy: (A) a suggestion made to a computer; (B) the answer when asked "what is that bag the Blue Jays batter runs to after hitting the ball?" (C) an instrument used for entering errors into a system. Answers: (A)command; (B)database; (C)k:eyboard

References

Baker, B. On finding duplication and near-duplication in large software systems. In (Willis et al., 1995), pages 86-95.

Biggerstaff, T., B. Mitbander, and D. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72-83, May 1994.

Blaha, M. and W. Premerlani. Observed idiosyncracies of relational database designs. In (Willis et al., 1995), pages 116-125.

1 7 2 WILLS AND CROSS

Butler, G., P. Grogono, R. Shinghal, and L Tjandra. Retrieving information from data flow diagrams. In [19], pages 22-29.

Chikofsky, E. Message from the general chair. In (Willis et al., 1995) (contains a particularly vivid analogy to archeology), page ix.

Chikofsky, E. and J. Cross. Reverse engineering and design recovery: A taxonomy. IEEE Software, pages 13-17, January 1990.

Cross, J., A. Quilici, L. Wills, R Newcomb, and E. Chikofsky. Second working conference on reverse engineering summary report. ACM SIGSOFTSoftware Engineering Notes, 20(5):23-26, December 1995.

Fiore, R, E Lanubile, and G. Vissaggio. Analyzing empirical data from a reverse engineering project. In (Willis et al., 1995), pages 106-114.

Johnson, W. L. Interactive explanation of software systems. In Proc. 10th Knowledge-Based Software Engineering Conference, pages 155-164, Boston, MA, 1995. IEEE Computer Society Press.

Kontogiannis, K., R. DeMori, M. Bernstein, M. Galler, and E. Merlo. Pattern matching for design concept localization. In (Willis et al., 1995), pages 96-103.

Leite, J. and P. Cerqueira. Recovering business rules from structured analysis specifications. In (Willis et al., 1995), pages 13-21.

Lowry, M., A. Philpot, T. Pressburger, and I. Underwood. A formal approach to domain-oriented software design environments. In Proc. 9th Knowledge-Based Software Engineering Conference, pages 48-57, Monterey, CA, 1994.

Lutsky, P. Automating testing by reverse engineering of software documentation. In (Willis et al., 1995), pages 8-12.

Newcomb, P. Legacy system cataloging facility. In (Wilhs et al., 1995), pages 52-60, July 1995. Quilici, A. and D. Chin. Decode: A cooperative environment for reverse-engineering legacy software. In (Willis

et al., 1995), pages 156-165. Selfridge, P., R. Waters, and E. Chikofsky. Challenges to the field of reverse engineering - A position paper. In

Proc. of the First Working Conference on Reverse Engineering, pages 144-150, Baltimore, MD, May 1993. IEEE Computer Society Press.

Waters, R. and E. Chikofsky, editors. Proc. of the First Working Conference on Reverse Engineering, Baltimore, MD, May 1993. IEEE Computer Society Press.

Wills, L., P. Newcomb, and E. Chikofsky, editors. Proc. of the Second Working Conference on Reverse Engi­neering, Toronto, Ontario, July 1995. IEEE Computer Society Press.

Automated Software Engineering 3, 173-178 (1996) (c) 1996 Kluwer Academic Publishers. Manufactured in The Netherlands.

Desert Island Column JOHN DOBSON john.dobson@newcastle.ac.uk Centre for Software Reliability, Bedson Building, University of Newcastle, Newcastle NEl 7RU, U.K.

When I started preparing for this article, I looked along my bookshelves to see what books I had on software engineering. There were none. It is not that software engineering has not been part of my life, but that I have not read anything on it as a subject that I wished to keep in order to read again. There were books on software, and books on engineering, and books on many a subject of interest to software engineers such as architecture and language. In fact these categories provided more than enough for me to wish to take to the desert island, so making the selection provided an enjoyable evening. I also chose to limit my quota to six (or maybe the editor did, I forget).

Since there was an element of choice involved, I made for myself some criteria: it had to be a book that I had read and enjoyed reading, it had to have (or have had) some significance for me in my career, either in terms of telling me how to do something or increasing my understanding, it had to be relevant to the kind of intellectual exercise we engage in when we are engineering software, and it had to be well-written. Of these, the last was the most important. There is a pleasure to be gained from reading a well-written book simply because it is written well. That doesn't necessarily mean easy to read; it means that there is a just and appropriate balance between what the writer has brought to the book and what the reader needs to bring in order to get the most out of it. All of my chosen books are well-written. All are worth reading for the illumination they shed on software engineering from another source, and I hope you will read them for that reason.

First, a book on engineering: To Engineer is Human, by Petroski (1985). Actually it is not so much about engineering (understood as meaning civil engineering) as about the history of civil engineering. Perhaps that is why I have no books on software engineering: the disci­pline is not yet old enough to have a decent history, and so there is not much of interest to say about it. What is interesting about Petroski's book, though, is the way it can be used as a base text for a future book on the history of software engineering, for it shows how the civil engi­neering discipline (particularly the building of bridges) has developed through disaster. The major bridge disasters of civil engineering history—Tay Bridge, Tacoma Narrows—have their analogues in our famous disasters—Therac, the London Ambulance Service. The im­portance of disasters, lies, of course, in what is learnt from them; and this means that they have to be well documented. The examples of software disasters that I gave have been doc­umented, the London Ambulance Service particularly, but these are in the minority. There must be many undocumented disasters in software engineering, from which as a result noth­ing has been learnt. This is yet another example of the main trouble with software being its invisibility, which is why engineering it is so hard. It is probably not possible, at least in the western world, to have a major disaster in civil engineering which can be completely concealed; Petroski's book shows just how this has helped the development of the discipline.

174 DOBSON

What makes Petroski's book so pleasant to read is the stress he places on engineering as a human activity and on the forces that drive engineers. Engineering is something that is born in irritation with something that is not as good as it could have been, a matter of making bad design better. But of course this is only part of the story. There is the issue of what the artifact is trying to achieve to consider. Engineering design lies in the details, the "minutely organized Particulars" as Blake calls them^. But what about the general principles, the grand scheme of things in which the particulars have a place, the "generalizing Demonstrations of the Rational Power"? In a word, the architecture—and of course the architect (the "Scoundrel, Hypocrite and Flatterer" who appeals to the "General Good"?).

It seems that the software engineer's favourite architect is Christopher Alexander. A number of colleagues have been influenced by that remarkable book A Pattern Language (Alexander et al., 1977), which is the architects' version of a library of reusable object classes. But for all its influence over software architects (its influence over real architects is, I think, much less noticeable), it is not the one I have chosen to take with me. Alexander's vision of the architectural language has come out of his vision of the architectural process, which he describes in an earlier book. The Timeless Way of Building (Alexander, 1979). He sees the creation of pattern languages as being an expression of the actions of ordinary people who shape buildings for themselves instead of having the architect do it for them. The role of the architect is that of a facilitator, helping people to decide for themselves what it is they want. This is a process which Alexander believes has to be rediscovered, since the languages have broken down, are no longer shared, because the architects and planners have taken them for themselves.

There is much talk these days of empowerment. I am not sure what it means, though I am sure that a lot of people who use it do not know what it means either. When it is not being used merely as a fashionable management slogan, empowerment seems to be a recognition of the embodiment in an artifact of the Tao, the quality without a name. As applied to architecture, this quality has nothing to do with the architecture of the building or with the processes it supports and which stem from it. The architecture and architectural process should serve to release a more basic understanding which is native to us. We find that we already know how to make the building live, but that the power has been frozen in us. Architectural empowerment is the unfreezing of this ability.

The Timeless Way of Building is an exploration of this Zen-like way of doing architecture. Indeed the book could have been called Zen and the Art of Architecture, but fortunately it was not. A cynical friend of mine commented, after he had read the book, "It is good to have thought like that"—the implication being that people who have been through that stage are more mature in their thinking than those who have not or who are still in it. I can see what he means but I think he is being unfair. I do not think we have really given this way of building systems a fair try. Christopher Alexander has, of course, and the results are described in two of his other books, The Production of Houses (Alexander et al., 1985) and The Oregon Experiment (Alexander et al., 1975). Reading between the lines of these two books does seem to indicate that the process was perhaps not as successful as it might have been and I think there is probably scope for an architectural process engineer to see what could be done to improve the process design. Some experiments in designing computer systems that way have been performed. One good example is described in Pelle Ehn's book Work-Oriented

DESERT ISLAND COLUMN 175

Design of Computer Artifacts (Ehn, 1988), which has clearly been influenced by Alexan­der's view of the architectural process. It also shares Alexander's irritating tendency to give the uneasy impression that the project was not quite as successful as claimed. But nevertheless I think these books of Alexander's should be required reading, particularly for those who like to acknowledge the influence of A Pattern Language. Perhaps The Timeless Way of Building and The Production of Houses will come to have the same influence on the new breed of requirements engineers as A Pattern Language has had on software engineers. That would be a good next stage of development for requirements engineering to go through.

If there is something about the architectural process that somehow embodies the human spirit, then there is something about the architectural product that embodies the human in­tellect. It sometimes seems as if computers have taken over almost every aspect of human intellectual endeavour, from flying aeroplanes to painting pictures. Where is it all going to end—indeed will it ever end? Is there anything that they can't do?

Well of course there is, and their limitations are provocatively explored in Hubert Dreyfus' famous book What Computers Can't Do (Dreyfus, 1979), which is my third selection.

For those who have yet to read this book, it is an enquiry into the basic philosophical presuppositions of the artificial intelligence domain. It raises some searching questions about the nature and use of intelligence in our society. It is also a reaction against some of the more exaggerated claims of proponents of artificial intelligence, claims which, however they may deserve respect for their useftilness and authority, have not been found agreeable to experience (as Gibbon remarked about the early Christian belief in the nearness of the end of the world).

Now it is too easy, and perhaps a bit unfair, to tease the AI community with some of the sillier sayings of their founders. Part of the promotion of any new discipline must involve a certain amount of overselling (look at that great engineer Brunei, for example). I do not wish to engage in that debate again here, but it is worth remembering that some famous names in software engineering have, on occasion, said things which perhaps they now wish they had not said. It would be very easy to write a book which does for software engineering what What Computers Can't Do did for artificial intelligence: raise a few deep issues, upset a lot of people, remind us all that when we cease to think about something we start to say stupid things and make unwarranted claims. It might be harder to do it with Dreyfus' panache, rhetoric, and philosophic understanding. I do find with What Computers Can't Do, though, that the rhetoric gets in the way a bit. A bit more dialectic would not come amiss. But the book is splendid reading.

Looking again at the first three books I have chosen, I note that all of them deal with the human and not the technical side of software capabilities, design and architecture. One of the great developments in software engineering came when it was realised and accepted that the creation of software was a branch of mathematics, with mathematical notions of logic and proof. The notion of proof is a particularly interesting one when it is applied to software, since it is remarkable how shallow and uninteresting the theorems and proofs about the behaviour of programs usually are. Where are the new concepts that make for great advances in mathematical proofs?

The best book I know that explores the nature of proof is Imre Lakatos' Proofs and Refutations (Lakatos, 1976) (subtitled The Logic of Mathematical Discovery—making the

176 DOBSON

point that proofs and refutations lead to discoveries, all very Hegelian). This surely is a deathless work which so cleverly explores the nature of proof, the role of counterexamples in producing new proofs by redefining concepts, and the role of formalism in convincing a mathematician. In a way, it describes the history of mathematical proof in the way that To Engineer is Human describes the history of engineering (build it; oh dear, it's fallen down; build it again, but better this time). What makes Proofs and Refutations so memorable is its cleverness, its intellectual fun, its wit. But the theorem discussed is just an ordinary invariant theorem (Euler's formula relating vertices, edges and faces of a polyhedron: V — E-\-F = 2), and its proof is hardly a deep one, either. But Lakatos makes all sorts of deep discussion come out of this simple example: the role of formalism in the advancement of understanding, the relationship between the certainty of a formal proof and the meaning of the denotational terms in the proof, the process of concept formation. To the extent that software engineering is a branch of mathematics, the discussion of the nature of mathematics (and there is no better discussion anywhere) is of relevance to software engineers.

Mathematics is not, of course, the only discipline of relevance to software engineering. Since computer systems have to take their place in the world of people, they have to respect that social world. I have lots of books on that topic on my bookshelf, and the one that currently I like the best is Computers in Context by Dahlbom and Mathiassen (1993), but it is not the one that I would choose to take to my desert island. Instead, I would prefer to be accompanied by Women, Fire and Dangerous Things by Lakoff (1987). The subtitle of this book is What Categories Reveal about the Mind. The title comes from the fact that in the Dyirbal language of Australia, the words for women, fire, and dangerous things are all placed in one category, but not because women are considered fiery or dangerous. Any object-oriented software engineer should, of course, be intensely interested in how people do categorise things and what the attributes are that are common to each category (since this will form the basis of the object model and schema). I find very little in my books on object-oriented requirements and design that tells me how to do this, except that many books tell me it not easy and requires a lot of understanding of the subject domain, something which I know already but lacks the concreteness of practical guidance. What Lakoff's book does is to tell you what the basis of linguistic categorisation actually is. (But I'm not going to tell you; my aim is to get you to read this book as well.) With George Lakoff telling you about the linguistic basis for object classification and the Christopher Alexander telling you about how to go about finding out what a person or organisation's object classification is, you are beginning to get enough knowledge to design a computer system for them.

However, you should be aware that the Lakoff book contains fiindamental criticisms of the objectivist stance, which believes that meaning is a matter of truth and reference (i.e., that it concerns the relationship between symbols and things in the world) and that there is a single correct way of understanding what is and what is not true. There is some debate about the objectivist stance and its relation to software (see the recent book Information Systems Development and Data Modelling by Hirschheim, Klein and Lyytinen (1995) for a fair discussion), but most software engineers seem reluctant to countenance any alternative view. Perhaps this is because the task of empowering people to construct their own reality,

DESERT ISLAND COLUMN 177

which is what all my chosen books so far are about, is seen as a task not fit, too subversive, for any decently engineered software to engage in. (Or maybe it is just too hard.)

My final choice goes against my self-denying ordinance not to make fun of the artificial intelligentsia. It is the funniest novel about computers ever written, and one of the great classics of comedy literature: The Tin Men by Frayn (1965). For those who appreciate such things, it also contains (in its last chapter) the best and most humorous use of self-reference ever published, though you have to read the whole book to get the most enjoyment out of it. For a book which was written more than thirty years ago, it still seems very pointed, hardly dated at all. I know of some institutions that claim as a matter of pride to have been the original for the fictitious William Morris Institute of Automation Research (a stroke of inspiration there!). They still could be; the technology may have been updated but the same individual types are still there, and the same meretricious research perhaps—constructing machines to invent the news in the newspapers, to write bonkbusters, to do good and say their prayers, to play all the world's sport and watch it—while the management gets on with more stimulating and demanding tasks, such as organising the official visit which the Queen is making to the Institute to open the new wing.

So there it is. I have tried to select a representative picture of engineering design, of the architecture of software artifacts, of the limitations and powers of mathematical formalisation of software, of the language software embodies and of the institutions in which software research is carried out. Together they say something about my view, not so much of the technical detail of software engineering, but of the historical, architectural, intellectual and linguistic context in which it takes place. So although none of these books is about software engineering, all are relevant since they show that what is true of our discipline is true of other disciplines also, and therefore we can learn from them and use their paradigms as our own.

There are many other books from other disciplines of relevance to computing that I am particularly sorry to leave behind, Wassily Kandinsky's book Point and Line to Plane (Kandinsky, 1979) (which attempts to codify the rules of artistic composition) perhaps the most. Now for my next trip to a desert island, I would like to take, in addition to the Kandinsky, [that's enough books, Ed.].

Note

L Jerusalem, Part III, plate 55.

References

Alexander, C. 1979. The Timeless Way of Building. New York: Oxford University Press. Alexander, C, Ishikawa, S., and Silverstein, M. 1977. A Pattern Language. New York: Oxford University Press. Alexander, C, Martinez, J., and Comer, D. 1985. The Production of Houses. New York: Oxford University Press. Alexander, C, Silverstein, M., Angel, S., Ishikawa, S., and Abrams, D. 1975. The Oregon Experiment. New York:

Oxford University Press. Dahlbom, B. and Mathiassen, L. 1993. Computers in Context. Cambridge, MA and Oxford, UK: NCC Blackwell. Dreyfus, H.L. 1979. What Computers Can't Do (revised edition). New York: Harper & Row. Ehn, P 1988. Work-Oriented Design of Computer Artifacts. Stockholm: Arbetslivscentrum (ISBN 91-86158-45-7).

178 DOBSON

Frayn, M. 1965. The Tin Men. London: Collins, (republished by Penguin Books, 1995). Hirschheim, R., Klein, H.K., and Lyytinen, K. 1995. Information Systems Development and Data Modelling.

Cambridge University Press. Kandinsky, W. 1979. Point and Line to Plane Trans. H. Dearstyne and H. Rebay (Eds.), New York: Dover,

(originally published 1926, in German). Lakatos, I. 1976. Proofs and Refutations. J. Worrall and E. Zahar (Eds.), Cambridge University Press. Lakoff, G. 1987. Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. University of

Chicago Press. Petroski, H. 1985. To Engineer is Human. New York: St. Martin's Press.

Automated Software Engineering

An International Journal

Instructions for Authors

Authors are encouraged to submit high quality, original work that has neither appeared in, nor is under consideration by, other journals.

PROCESS FOR SUBMISSION

1. Authors should submit five hard copies of their final manuscript to:

Mrs. Judith A. Kemp AUTOMATED SOFTWARE ENGINEERING Editorial Office Kluwer Academic Publishers Tel.: 617-871-6300 101 Philip Drive FAX: 617-871-6528 Norwell, MA 02061 E-mail: jkemp@wkap.com

2. Authors are strongly encouraged to use Kluwer's LSTgX journal style file. Please see ELECTRONIC SUBMISSION Section below.

3. Enclose with each manuscript, on a separate page, from three to five key words. 4. Enclose originals for the illustrations, in the style described below, for one copy of the

manuscript. Photocopies of the figures may accompany the remaining copies of the manuscript. Alternatively, original illustrations may be submitted after the paper has been accepted.

5. Enclose a separate page giving the preferred address of the contact author for correspon­dence and return of proofs. Please include a telephone number, fax number and email address, if available.

6. If possible, send an electronic mail message to <jkkluwer@world.std.com> at the time your manuscript is submitted, including the title, the names of the authors, and an abstract. This will help the journal expedite the refereeing of the manuscript.

7. The refereeing is done by anonymous reviewers.

STYLE FOR MANUSCRIPT

1. Typeset, double or 1^ space; use one side of sheet only (laser printed, typewritten, and good quality duplication acceptable).

2. Use an informative title for the paper and include an abstract of 100 to 250 words at the head of the manuscript. The abstracts should be a carefully worded description of the problem addressed, the key ideas introduced, and the results. Abstracts will be printed with the article.

3. Provide a separate double-space sheet listing all footnotes, beginning with "Affiliation of author" and continuing with numbered references. Acknowledgment of financial support may be given if appropriate.

References should appear in a separate bibliography at the end of the paper in alphabetical order with items referred to in the text by author and date of publication in parentheses, e.g., (Marr, 1982). References should be complete, in the following style:

Style for papers: Authors, last names followed by first initials, year of publication, title, volume, inclusive page numbers.

Style for books: Authors, year of publication, title, publisher and location, chapter and page numbers (if desired). Examples as follows:

(Book) Marr, D. 1982. Vision, a Computational Investigation into the Human Repre­sentation & Processing of Visual Information. San Francisco: Freeman.

(Journal) Rosenfeld, A. and Thurston, M. 1971. Edge and curve detection for visual scene analysis. IEEE Trans. Comput., €-20:562-569.

(Conference Proceedings) Witkin, A. 1983. Scales space filtering. Proc. Int. Joint Conf Artif Intell, Karlsruhe, West Germany, pp. 1019-1021.

(Lab. memo) Yuille, A.L. and Poggio, T. 1983. Scaling theorems for zero crossings. M.I.T. Artif. Intell. Lab., Massachusetts Inst. Technol., Cambridge, MA, A.L Memo. 722. Type or mark mathematical copy exactly as they should appear in print. Journal style for letter symbols is as follows: variables, italic type (indicated by underline); constants, roman text type; matrices and vectors, boldface type (indicated by wavy underline). In word-processor manuscripts, use appropriate typeface. It will be assumed that letters in displayed equations are to be set in italic type unless you mark them otherwise. All letter symbols in text discussion must be marked if they should be italic or boldface. Indicate best breaks for equations in case they will not fit on one line.

ELECTRONIC SUBMISSION PROCEDURE

Upon acceptance of publication, the preferred format of submission is the Kluwer I ^ gX journal style file. The style file may be accessed through a gopher site by means of the following commands:

Internet: gopher gopher . wkap. n l or (IP number 192.87.90.1)

WWW URL: gopher://gopher.wkap.nl

- Submitting and Author Instructions - Submitting to a Journal - Choose Journal Discipline - Choose Journal Listing - Submitting Camera Ready

Authors are encouraged to read the ''About this menu" file.

If you do not have access to gopher or have questions, please send e-mail to:

srumsey@wkap.com

The Kluwer L^T^ journal style file is the preferred format, and we urge all authors to use this style for existing and future papers; however, we will accept other common formats (e.g., WordPerfect or MicroSoft Word) as well as ASCII (text only) files. Also, we accept FrameMaker documents as "text only" files. Note, it is also helpful to supply both the source and ASCII files of a paper. Please submit PostScript files for figures as well as separate, original figures in camera-ready form. A PostScript figure file should be named after its figure number, e.g., figl.eps or circlel.eps.

ELECTRONIC DELIVERY

IMPORTANT - Hard copy of the ACCEPTED paper (along with separate, original figures in camera-ready form) should still be mailed to the appropriate Kluwer department. The hard copy must match the electronic version, and any changes made to the hard copy must be incorporated into the electronic version.

Via electronic mail

1. Please e-mail ACCEPTED, FINAL paper to

KAPfiles @ wkap.com

2. Recommended formats for sending files via e-mail: a. Binary files - uuencode or binhex b. Compressing files - compress, pkzip, gunzip c. Collecting files - tar

3. The e-mail message should include the author's last name, the name of the journal to which the paper has been accepted, and the type of file (e.g., I^lgX or ASCII).

Via disk

1. Label a 3.5 inch floppy disk with the operating system and word processing program (e.g., DOSAVordPerfect5.0) along with the authors' names, manuscript title, and name of journal to which the paper has been accepted.

2. Mail disk to

Kluwer Academic Publishers Desktop Department

101 Philip Drive Assinippi Park

Norwell, MA 02061

Any questions about the above procedures please send e-mail to:

srumsey @ wkap.com

STYLE FOR ILLUSTRATIONS

1. Originals for illustrations should be sharp, noise-free, and of good contrast. We regret that we cannot provide drafting or art service.

2. Line drawings should be in laser printer output or in India ink on paper, or board. Use 8 by 11-inch (22 x 29 cm) size sheets if possible, to simplify handling of the manuscript.

3. Each figure should be mentioned in the text and numbered consecutively using Arabic numerals. In one of your copies, which you should clearly distinguish, specify the desired location of each figure in the text but place the original figure itself on a separate page. In the remainder of copies, which will be read by the reviewers, include the illustration itself at the relevant place in the text.

4. Number each table consecutively using Arabic numerals. Please label any material that can be typeset as a table, reserving the term "figure" for material that has been drawn. Specify the desired location of each table in the text, but place the table itself on a separate page following the text. Type a brief title above each table.

5. All lettering should be large enough to permit legible reduction. 6. Photographs should be glossy prints, of good contrast and gradation, and any reasonable

size. 7. Number each original on the back. 8. Provide a separate sheet listing all figure captions, in proper style for the typesetter, e.g.,

"Fig. 3. Examples of the fault coverage of random vectors in (a) combinational and (b) sequential circuits."

PROOFING

Page proofs for articles to be included in a journal issue will be sent to the contact author for proofing, unless otherwise informed. The proofread copy should be received back by the Publisher within 72 hours.

COPYRIGHT

It is the policy of Kluwer Academic Publishers to own the copyright of all contributions it publishes. To comply with the U.S. Copyright Law, authors are required to sign a copyright transfer form before publication. This form returns to authors and their employers full rights to reuse their material for their own purposes. Authors must submit a signed copy of this form with their manuscript.

REPRINTS

Each group of authors will be entitled to 50 free reprints of their paper.

REVERSE ENGINEERING brings together in one place important contributions and up-to-date research results in this important

area.

REVERSE ENGINEERING serves as an excellent reference, providing insight into some of the most important research issues in the field.

ISBN 0-7923-9756-8

0-7923-9756-8 92»'397564»

top related